Tutorial¶
Prerequisites¶
Before building a custom AI voice agent, you need:
An authentication token (String). Obtain one via
POST /auth/loginor use an access key fromGET /accesskeys.A source phone number in E.164 format (e.g.,
+15551234567). Obtain one viaGET /numbers.A destination phone number or extension to call.
A webhook URL or WebSocket connection to receive
transcript_createdevents. Configure your webhook URL viaPUT https://api.voipbin.net/v1.0/customer.An AI backend (LLM, RAG, or NLP system) that can receive text and return a response.
Note
AI Implementation Hint
The call must reach progressing status (answered, audio flowing) before you can start transcription or speaking sessions. If the call is still in dialing or ringing status, the API will reject the request.
Step 1: Create an Outbound Call¶
Create a call to the destination. The call must be answered before starting STT/TTS.
Request:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"source": {
"type": "tel",
"target": "+15551234567"
},
"destinations": [
{
"type": "tel",
"target": "+15559876543"
}
],
"flow_id": "00000000-0000-0000-0000-000000000000"
}'
Response (200 OK):
{
"calls": [],
"groupcalls": [
{
"id": "c1d2e3f4-a5b6-7890-cdef-123456789012",
"status": "progressing",
"source": {
"type": "tel",
"target": "+15551234567"
},
"destinations": [
{
"type": "tel",
"target": "+15559876543"
}
],
"call_ids": [
"a1b2c3d4-e5f6-7890-abcd-ef1234567890"
],
...
}
]
}
The response contains calls and groupcalls arrays. Extract your call_id from the call_ids field inside the groupcall object (e.g., groupcalls[0].call_ids[0]). This is the UUID you will use for transcription and speaking sessions. Wait for the call to reach progressing status by listening for a call_progressing webhook event, or poll GET /calls/{call_id} until status is progressing.
Step 2: Start Transcription¶
Once the call is in progressing status, start a transcription session.
Request:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"reference_type": "call",
"reference_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US",
"direction": "both"
}'
Response (200 OK):
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"customer_id": "7c4d2f3a-1b8e-4f5c-9a6d-3e2f1a0b4c5d",
"reference_type": "call",
"reference_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "progressing",
"language": "en-US",
"direction": "both",
"tm_create": "2026-01-15T09:30:00.000000Z",
"tm_update": "2026-01-15T09:30:00.000000Z",
"tm_delete": null
}
Save the id value — this is your transcribe_id (UUID).
Step 3: Create a Speaking Session¶
Create a speaking session on the same call so you can inject TTS audio.
Request:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"reference_type": "call",
"reference_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US",
"provider": "elevenlabs",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"direction": "both"
}'
Response (201 Created):
{
"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"customer_id": "7c4d2f3a-1b8e-4f5c-9a6d-3e2f1a0b4c5d",
"reference_type": "call",
"reference_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US",
"provider": "elevenlabs",
"voice_id": "21m00Tcm4TlvDq8ikWAM",
"direction": "both",
"status": "initiating",
"tm_create": "2026-01-15T09:30:01.000000Z",
"tm_update": "2026-01-15T09:30:01.000000Z",
"tm_delete": null
}
Save the id value — this is your speaking_id (UUID). Wait for the speaking_started webhook event or poll GET /speakings/{speaking_id} until status is active before sending text.
Step 4: Send an Initial Greeting¶
Once the speaking session is active, send an initial greeting to the caller.
Request:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/speakings/b2c3d4e5-f6a7-8901-bcde-f12345678901/say?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"text": "Hello, thank you for calling. How can I help you today?"
}'
Response (200 OK):
{
"id": "b2c3d4e5-f6a7-8901-bcde-f12345678901",
"status": "active",
...
}
The caller now hears the greeting. VoIPBIN synthesizes the text using the configured TTS provider and voice.
Step 5: Receive and Process Transcripts¶
When the caller responds, VoIPBIN delivers a transcript_created event to your webhook URL.
Webhook Payload:
{
"type": "transcript_created",
"data": {
"id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
"transcribe_id": "550e8400-e29b-41d4-a716-446655440000",
"direction": "in",
"message": "Hi, I need help with my account balance.",
"tm_transcript": "0001-01-01 00:00:08.991840",
"tm_create": "2026-01-15T09:30:15.000000Z"
}
}
Process in your AI backend:
# Python example — webhook handler
from flask import Flask, request, jsonify
import requests
VOIPBIN_TOKEN = "<YOUR_AUTH_TOKEN>"
TRANSCRIBE_ID = "550e8400-e29b-41d4-a716-446655440000"
SPEAKING_ID = "b2c3d4e5-f6a7-8901-bcde-f12345678901"
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.get_json()
if payload.get('type') == 'transcript_created':
transcript = payload['data']
# Only process caller's speech (direction: "in")
if transcript['direction'] == 'in':
caller_text = transcript['message']
# Send to your LLM
ai_response = call_your_llm(caller_text)
# Send AI response back to the call via Speaking API
requests.post(
f'https://api.voipbin.net/v1.0/speakings/{SPEAKING_ID}/say?token={VOIPBIN_TOKEN}',
json={'text': ai_response}
)
return jsonify({'status': 'ok'}), 200
def call_your_llm(text):
# Replace with your actual LLM call
# e.g., OpenAI, Anthropic, local model, etc.
return f"I understand you need help with: {text}"
Step 6: Handle Interruptions¶
If the caller speaks while TTS is playing, you receive a new transcript_created event. Flush the current playback before sending the new response.
@app.route('/webhook', methods=['POST'])
def handle_webhook():
payload = request.get_json()
if payload.get('type') == 'transcript_created':
transcript = payload['data']
if transcript['direction'] == 'in':
# Flush current TTS playback
requests.post(
f'https://api.voipbin.net/v1.0/speakings/{SPEAKING_ID}/flush?token={VOIPBIN_TOKEN}'
)
# Process and respond
ai_response = call_your_llm(transcript['message'])
requests.post(
f'https://api.voipbin.net/v1.0/speakings/{SPEAKING_ID}/say?token={VOIPBIN_TOKEN}',
json={'text': ai_response}
)
return jsonify({'status': 'ok'}), 200
Step 7: Clean Up on Call End¶
When the call ends, stop the transcription and speaking sessions to release resources.
Listen for the call_hangup webhook event and clean up:
if payload.get('type') == 'call_hangup':
call_id = payload['data']['id']
# Stop transcription
requests.post(
f'https://api.voipbin.net/v1.0/transcribes/{TRANSCRIBE_ID}/stop?token={VOIPBIN_TOKEN}'
)
# Stop speaking
requests.post(
f'https://api.voipbin.net/v1.0/speakings/{SPEAKING_ID}/stop?token={VOIPBIN_TOKEN}'
)
Note
AI Implementation Hint
Transcription and speaking sessions may stop automatically when the call ends. However, explicitly stopping them ensures immediate resource cleanup and prevents charges for idle sessions.
Complete Python Example¶
This example combines all steps into a complete webhook server that implements a custom AI voice agent.
from flask import Flask, request, jsonify
import requests
import os
app = Flask(__name__)
VOIPBIN_BASE = "https://api.voipbin.net/v1.0"
TOKEN = os.environ.get("VOIPBIN_TOKEN")
# Store session IDs per call (in production, use a database)
call_sessions = {}
@app.route('/webhook', methods=['POST'])
def webhook():
payload = request.get_json()
event_type = payload.get('type')
if event_type == 'call_progressing':
handle_call_progressing(payload['data'])
elif event_type == 'transcript_created':
handle_transcript(payload['data'])
elif event_type == 'call_hangup':
handle_call_hangup(payload['data'])
return jsonify({'status': 'ok'}), 200
def handle_call_progressing(call):
"""Call answered — start transcription and speaking."""
call_id = call['id']
# Start transcription
resp = requests.post(
f'{VOIPBIN_BASE}/transcribes?token={TOKEN}',
json={
'reference_type': 'call',
'reference_id': call_id,
'language': 'en-US',
'direction': 'both'
}
)
transcribe_id = resp.json()['id']
# Create speaking session
resp = requests.post(
f'{VOIPBIN_BASE}/speakings?token={TOKEN}',
json={
'reference_type': 'call',
'reference_id': call_id,
'language': 'en-US',
'provider': 'elevenlabs',
'direction': 'both'
}
)
speaking_id = resp.json()['id']
call_sessions[call_id] = {
'transcribe_id': transcribe_id,
'speaking_id': speaking_id
}
# Send initial greeting (after short delay for session to become active)
requests.post(
f'{VOIPBIN_BASE}/speakings/{speaking_id}/say?token={TOKEN}',
json={'text': 'Hello, how can I help you today?'}
)
def handle_transcript(transcript):
"""Process caller speech and respond."""
if transcript['direction'] != 'in':
return
# Find the speaking session for this transcribe
transcribe_id = transcript['transcribe_id']
session = next(
(s for s in call_sessions.values() if s['transcribe_id'] == transcribe_id),
None
)
if not session:
return
speaking_id = session['speaking_id']
# Flush any current playback
requests.post(
f'{VOIPBIN_BASE}/speakings/{speaking_id}/flush?token={TOKEN}'
)
# Generate AI response
ai_response = call_your_llm(transcript['message'])
# Send response to caller
requests.post(
f'{VOIPBIN_BASE}/speakings/{speaking_id}/say?token={TOKEN}',
json={'text': ai_response}
)
def handle_call_hangup(call):
"""Clean up sessions when call ends."""
call_id = call['id']
session = call_sessions.pop(call_id, None)
if session:
requests.post(
f'{VOIPBIN_BASE}/transcribes/{session["transcribe_id"]}/stop?token={TOKEN}'
)
requests.post(
f'{VOIPBIN_BASE}/speakings/{session["speaking_id"]}/stop?token={TOKEN}'
)
def call_your_llm(text):
"""Replace with your actual LLM integration."""
# Example: OpenAI, Anthropic, local model, etc.
return f"I understand you said: {text}. Let me help you with that."
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)