Speaking¶

{
    "id": "<string>",
    "customer_id": "<string>",
    "reference_type": "<string>",
    "reference_id": "<string>",
    "language": "<string>",
    "provider": "<string>",
    "voice_id": "<string>",
    "direction": "<string>",
    "status": "<string>",
    "tm_create": "<string>",
    "tm_update": "<string>",
    "tm_delete": "<string>"
}

id (UUID): The speaking session’s unique identifier. Returned when creating a TTS session via POST /speakings or listing via GET /speakings.
customer_id (UUID): The customer who owns this speaking session. Obtained from GET /customer.
reference_type (enum string): The type of resource receiving TTS audio. See Reference Type.
reference_id (UUID): The ID of the resource receiving TTS audio. Depending on reference_type, obtained from GET /calls or GET /conferences.
language (String, BCP47): The language and locale for TTS synthesis (e.g., en-US, ko-KR). Must match the provider’s supported languages.
provider (enum string, optional): The TTS provider used for synthesis. See Provider. If omitted, defaults to elevenlabs.
voice_id (String, optional): A provider-specific voice identifier. If omitted, the provider’s default voice for the specified language is used. Obtain available voices from the provider’s documentation.
direction (enum string): The audio routing direction. See Direction.
status (enum string): The speaking session’s current status. See Status.
tm_create (string, ISO 8601): Timestamp when the speaking session was created.
tm_update (string, ISO 8601): Timestamp of the last update to any speaking property.
tm_delete (string, ISO 8601): Timestamp when the speaking session was deleted. Set to 9999-01-01 00:00:00.000000 if not deleted.

Note

AI Implementation Hint

Timestamps set to 9999-01-01 00:00:00.000000 indicate the event has not yet occurred. For example, tm_delete with this value means the speaking session has not been deleted.

Example¶

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "language": "en-US",
    "provider": "elevenlabs",
    "voice_id": "",
    "direction": "both",
    "status": "active",
    "tm_create": "2025-06-15 14:30:00.123456",
    "tm_update": "2025-06-15 14:30:02.456789",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

reference_type¶

All possible values for the reference_type field:

Type	Description
call	Attach TTS to a live call. The `reference_id` is a call ID from `GET /calls`.
confbridge	Attach TTS to a live conference. The `reference_id` is a conference ID from `GET /conferences`.

provider¶

All possible values for the provider field:

Provider	Description
elevenlabs	ElevenLabs TTS. High-quality neural voices. Default provider if omitted.
gcp	Google Cloud Text-to-Speech. Wide language support with WaveNet and Neural2 voices.
aws	Amazon Polly. Neural and standard voices with SSML support.

When creating a speaking session, the provider field is optional. If omitted, VoIPBIN defaults to elevenlabs.

status¶

All possible values for the status field:

Status	Description
initiating	TTS session is being set up. Provider connection is being established. Do not call `POST /speakings/{id}/say` in this state.
active	TTS session is ready. Send text via `POST /speakings/{id}/say`. Audio is being injected into the call.
stopped	TTS session has ended. Stopped via `POST /speakings/{id}/stop` or the call was hung up.

direction¶

All possible values for the direction field:

Direction	Description
in	Audio injected toward the caller (remote party hears it, local party does not).
out	Audio injected toward the callee/local side (local party hears it, remote party does not).
both	Audio injected to both sides of the call. Both parties hear the synthesized speech.