Speaking

Real-time text-to-speech (TTS) injection into calls and conferences with support for multiple providers and voice selection.

API Reference: Speaking endpoints

Overview

Note

AI Context

  • Complexity: Low

  • Cost: Chargeable (per TTS synthesis request)

  • Async: Yes. POST /speakings returns immediately with status initiating. Poll GET /speakings/{id} until status is active before calling POST /speakings/{id}/say.

The Speaking API enables you to inject synthesized speech into live calls and conferences in real-time. You can choose from multiple TTS providers, select specific voices, control audio direction (who hears the speech), and queue multiple speech segments for continuous playback.

Key capabilities:

  • Inject synthesized speech into live calls and conferences

  • Choose from multiple TTS providers (ElevenLabs, Google Cloud, AWS)

  • Select specific voices or use provider defaults

  • Control audio direction (caller only, callee only, or both)

  • Queue multiple speech segments with flush control

How Speaking Works

The Speaking API synthesizes text-to-speech in real-time and delivers it to the specified audio target—either a call or a conference bridge.

+--------+        +----------------+        +-------------+
|  Call  |<-audio-|  TTS Engine    |<-text--|  Your App   |
+--------+        +----------------+        | POST /say   |
                         |                  +-------------+
+------------+           |
| Conference |<--audio---+
+------------+

Key components:

  • Audio Target: A live call or conference that will receive the synthesized audio

  • TTS Engine: The voice synthesis provider (ElevenLabs, Google Cloud, or AWS)

  • Your App: Sends text via POST /speakings/{id}/say to be synthesized and played

Speaking Lifecycle

Each speaking session progresses through a series of states from creation through termination.

POST /speakings
     |
     v
+-------------+                    +-------------+
| initiating  |----setup done----->|   active    |
+-------------+                    +------+------+
                                         |
                       POST /speakings/{id}/stop or call hangup
                                         |
                                         v
                                  +-------------+
                                  |   stopped   |
                                  +-------------+

Status values:

Status

Description

initiating

TTS session is being set up. Provider connection is being established. Do not call /say in this state.

active

TTS session is ready. You can send text via POST /speakings/{id}/say. Audio is being injected into the call.

stopped

TTS session has ended. Either stopped explicitly via POST /speakings/{id}/stop or the call was hung up.

Note

AI Implementation Hint

Always poll GET /speakings/{id} until status is active before calling POST /speakings/{id}/say. Sending text while status is initiating will fail. Typical setup time is 2-3 seconds. Only one active speaking session per call is allowed—create a new session only after the previous one is stopped.

Providers

The Speaking API supports multiple TTS providers, each with distinct voice libraries and pricing.

Provider

Description

elevenlabs

ElevenLabs TTS. High-quality neural voices with natural intonation. Default provider if omitted.

gcp

Google Cloud Text-to-Speech. Wide language support with WaveNet and Neural2 voices.

aws

Amazon Polly. Neural and standard voices with SSML support.

The provider field is optional and defaults to elevenlabs if omitted.

Direction

Control who hears the synthesized speech by specifying the audio direction.

Direction

Description

in

Audio injected toward the caller (remote party hears it, local party does not).

out

Audio injected toward the callee/local side (local party hears it, remote party does not).

both

Audio injected to both sides of the call. Both parties hear the synthesized speech.

Reference Types

Attach the speaking session to either a call or a conference.

Type

Description

call

Attach TTS to a live call. The reference_id is a call ID from GET /calls.

confbridge

Attach TTS to a live conference. The reference_id is a conference ID from GET /conferences.

Best Practices

  • Always wait for active status before sending text via POST /speakings/{id}/say

  • Keep individual say requests under 5,000 characters to avoid timeout and latency

  • Use flush to interrupt current speech when user speaks (barge-in scenarios)

  • Clean up sessions explicitly with POST /speakings/{id}/stop and DELETE /speakings/{id} when finished

  • Choose direction based on use case: both for announcements, out for agent coaching, in for IVR prompts

  • Monitor session status actively—if the underlying call hangs up, the speaking session auto-stops

Troubleshooting

  • 400 Bad Request:
    • Cause: Invalid language code or missing required fields.

    • Fix: Verify language is a valid BCP47 code (e.g., en-US). Ensure reference_type and reference_id are provided.

  • 404 Not Found:
    • Cause: The speaking session ID does not exist or belongs to another customer.

    • Fix: Verify the UUID was obtained from a recent POST /speakings or GET /speakings response.

  • 409 Conflict:
    • Cause: Another speaking session is already active on this call, or the call is not in progressing status.

    • Fix: Stop the existing session via POST /speakings/{id}/stop first. Verify the call status is progressing via GET /calls/{id}.

Speaking

Speaking

{
    "id": "<string>",
    "customer_id": "<string>",
    "reference_type": "<string>",
    "reference_id": "<string>",
    "language": "<string>",
    "provider": "<string>",
    "voice_id": "<string>",
    "direction": "<string>",
    "status": "<string>",
    "tm_create": "<string>",
    "tm_update": "<string>",
    "tm_delete": "<string>",
}
  • id (UUID): The speaking session’s unique identifier. Returned when creating a TTS session via POST /speakings or listing via GET /speakings.

  • customer_id (UUID): The customer who owns this speaking session. Obtained from GET /customer.

  • reference_type (enum string): The type of resource receiving TTS audio. See Reference Type.

  • reference_id (UUID): The ID of the resource receiving TTS audio. Depending on reference_type, obtained from GET /calls or GET /conferences.

  • language (String, BCP47): The language and locale for TTS synthesis (e.g., en-US, ko-KR). Must match the provider’s supported languages.

  • provider (enum string, optional): The TTS provider used for synthesis. See Provider. If omitted, defaults to elevenlabs.

  • voice_id (String, optional): A provider-specific voice identifier. If omitted, the provider’s default voice for the specified language is used. Obtain available voices from the provider’s documentation.

  • direction (enum string): The audio routing direction. See Direction.

  • status (enum string): The speaking session’s current status. See Status.

  • tm_create (string, ISO 8601): Timestamp when the speaking session was created.

  • tm_update (string, ISO 8601): Timestamp of the last update to any speaking property.

  • tm_delete (string, ISO 8601): Timestamp when the speaking session was deleted. Set to 9999-01-01 00:00:00.000000 if not deleted.

Note

AI Implementation Hint

Timestamps set to 9999-01-01 00:00:00.000000 indicate the event has not yet occurred. For example, tm_delete with this value means the speaking session has not been deleted.

Example

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "language": "en-US",
    "provider": "elevenlabs",
    "voice_id": "",
    "direction": "both",
    "status": "active",
    "tm_create": "2025-06-15 14:30:00.123456",
    "tm_update": "2025-06-15 14:30:02.456789",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

reference_type

All possible values for the reference_type field:

Type

Description

call

Attach TTS to a live call. The reference_id is a call ID from GET /calls.

confbridge

Attach TTS to a live conference. The reference_id is a conference ID from GET /conferences.

provider

All possible values for the provider field:

Provider

Description

elevenlabs

ElevenLabs TTS. High-quality neural voices. Default provider if omitted.

gcp

Google Cloud Text-to-Speech. Wide language support with WaveNet and Neural2 voices.

aws

Amazon Polly. Neural and standard voices with SSML support.

When creating a speaking session, the provider field is optional. If omitted, VoIPBIN defaults to elevenlabs.

status

All possible values for the status field:

Status

Description

initiating

TTS session is being set up. Provider connection is being established. Do not call POST /speakings/{id}/say in this state.

active

TTS session is ready. Send text via POST /speakings/{id}/say. Audio is being injected into the call.

stopped

TTS session has ended. Stopped via POST /speakings/{id}/stop or the call was hung up.

direction

All possible values for the direction field:

Direction

Description

in

Audio injected toward the caller (remote party hears it, local party does not).

out

Audio injected toward the callee/local side (local party hears it, remote party does not).

both

Audio injected to both sides of the call. Both parties hear the synthesized speech.

Tutorial

Before using the Speaking API, you need:

  • An authentication token. Obtain one via POST /login or use your access key via ?accesskey=<your-accesskey>.

  • An active call in progressing status. Create one via POST /calls and poll GET /calls/{id} until answered. Or an active conference via POST /conferences.

  • A language code in BCP47 format (e.g., en-US, ko-KR, ja-JP).

  • (Optional) A provider-specific voice ID. Defaults to the provider’s default voice if omitted.

Note

AI Implementation Hint

The call must be in progressing status before attaching a speaking session. Poll GET /calls/{id} until status is progressing. If the call reaches hangup status, the call was not answered and you must retry. For conferences, ensure at least one participant has joined before attaching TTS.

Create a Speaking Session

Attach a TTS session to a live call. The session starts in initiating status while the provider connection is established.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "language": "en-US",
    "direction": "both"
}'

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "language": "en-US",
    "provider": "elevenlabs",
    "voice_id": "",
    "direction": "both",
    "status": "initiating",
    "tm_create": "2025-06-15 14:30:00.123456",
    "tm_update": "2025-06-15 14:30:00.123456",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

Poll until the session becomes active:

$ curl --location --request GET 'https://api.voipbin.net/v1.0/speakings/a1b2c3d4-e5f6-7890-abcd-ef1234567890?token=<YOUR_AUTH_TOKEN>'

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "active",
    ...
}

Wait for active before proceeding. Typical setup time is 2-3 seconds.

Send Text to Speak

Once the session is active, send text to be synthesized and played into the call.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings/a1b2c3d4-e5f6-7890-abcd-ef1234567890/say?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "text": "Hello! This is your AI agent speaking. How can I help you today?"
}'

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "active",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f"
}

You can call /say multiple times to queue additional speech segments. Each segment is synthesized and played in order. Maximum text length per request is 5,000 characters.

Choose a TTS Provider

Specify the provider field when creating a session to select a TTS engine. If omitted, ElevenLabs is used by default.

ElevenLabs (default):

{
    "reference_type": "call",
    "reference_id": "<call-id>",
    "language": "en-US",
    "provider": "elevenlabs",
    "direction": "both"
}

ElevenLabs provides high-quality neural voices with natural intonation. This is the default provider—if provider is omitted, ElevenLabs is used.

Google Cloud TTS:

{
    "reference_type": "call",
    "reference_id": "<call-id>",
    "language": "en-US",
    "provider": "gcp",
    "direction": "both"
}

Google Cloud TTS offers wide language support with WaveNet and Neural2 voices.

Amazon Polly:

{
    "reference_type": "call",
    "reference_id": "<call-id>",
    "language": "en-US",
    "provider": "aws",
    "direction": "both"
}

Amazon Polly provides neural and standard voices with SSML support.

Select a Voice

Use the voice_id field to choose a specific voice from the provider’s voice library.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "language": "en-US",
    "provider": "elevenlabs",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "direction": "both"
}'

{
    "id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
    "provider": "elevenlabs",
    "voice_id": "21m00Tcm4TlvDq8ikWAM",
    "status": "initiating",
    ...
}

The voice_id is provider-specific. Obtain available voice IDs from your TTS provider’s documentation. If omitted, the provider’s default voice for the specified language is used.

Control Audio Direction

The direction field controls who hears the synthesized speech.

Both directions (announcements):

Use "direction": "both" when both parties should hear the speech. Suitable for announcements, greetings, or AI agent conversations.

{
    "direction": "both"
}

Outgoing only (agent coaching):

Use "direction": "out" so only the local party (callee) hears the speech. The remote caller does not. Suitable for real-time agent coaching or whisper prompts.

{
    "direction": "out"
}

Incoming only (IVR replacement):

Use "direction": "in" so only the remote caller hears the speech. The local party does not. Suitable for IVR-style prompts or one-sided announcements.

{
    "direction": "in"
}

Flush the Speech Queue

Clear queued text and stop the currently playing audio.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings/a1b2c3d4-e5f6-7890-abcd-ef1234567890/flush?token=<YOUR_AUTH_TOKEN>'

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "active",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f"
}

Use flush to implement barge-in behavior—when the user starts speaking, flush the queue and listen instead. After flushing, you can send new text via /say to continue the conversation.

Attach Speaking to a Conference

Attach TTS to a conference so all participants hear the synthesized speech.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
    "reference_type": "confbridge",
    "reference_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
    "language": "en-US",
    "provider": "elevenlabs",
    "direction": "both"
}'

{
    "id": "d4e5f6a7-b8c9-0123-def4-567890123456",
    "reference_type": "confbridge",
    "reference_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
    "language": "en-US",
    "provider": "elevenlabs",
    "voice_id": "",
    "direction": "both",
    "status": "initiating",
    "tm_create": "2025-06-15 15:00:00.123456",
    "tm_update": "2025-06-15 15:00:00.123456",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

The reference_id is a conference ID obtained from GET /conferences. All conference participants hear the synthesized speech when direction is both.

Stop and Delete a Speaking Session

When finished, stop the session first, then delete it.

Stop the session:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings/a1b2c3d4-e5f6-7890-abcd-ef1234567890/stop?token=<YOUR_AUTH_TOKEN>'

{
    "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "status": "stopped",
    ...
}

Delete the session:

$ curl --location --request DELETE 'https://api.voipbin.net/v1.0/speakings/a1b2c3d4-e5f6-7890-abcd-ef1234567890?token=<YOUR_AUTH_TOKEN>'

Always stop a session before deleting it. If the call is hung up, the speaking session is automatically stopped, but you should still delete it to clean up resources.

Troubleshooting

  • 400 Bad Request:
    • Cause: Invalid language code, empty text in /say, or missing required fields (reference_type, reference_id).

    • Fix: Verify language is a valid BCP47 code (e.g., en-US). Ensure text in /say is non-empty and under 5,000 characters.

  • 404 Not Found:
    • Cause: The speaking session ID does not exist or belongs to another customer.

    • Fix: Verify the UUID was obtained from a recent POST /speakings or GET /speakings response.

  • 409 Conflict:
    • Cause: Another speaking session is already active on this call, or the call is not in progressing status.

    • Fix: Stop the existing session via POST /speakings/{id}/stop first. Verify the call status is progressing via GET /calls/{id}.