Transcribe
Real-time speech-to-text transcription for voice calls, converting spoken audio into text during active conversations.
API Reference: Transcribe endpoints
Overview
Note
AI Context
Complexity: Medium
Cost: Chargeable (per minute of audio transcribed)
Async: Yes.
POST /transcribesreturns immediately with statusprogressing. Transcripts are delivered asynchronously via webhook (transcript_createdevents) or WebSocket subscription. PollGET /transcribes/{id}to check fordonestatus when complete.
VoIPBIN’s Transcription API converts spoken audio from calls and conferences into text in real-time. Whether you need transcripts for compliance, searchable call logs, AI analysis, or accessibility, the Transcription API delivers accurate text as conversations happen.
With the Transcription API you can:
Transcribe calls and conferences in real-time
Distinguish between incoming and outgoing speech
Receive transcripts via webhooks or WebSocket
Support 70+ languages and regional variants
Integrate with AI systems for sentiment analysis and summarization
How Transcription Works
When you start transcription, VoIPBIN captures audio from the call or conference, sends it to a speech-to-text (STT) engine, and delivers the resulting text to your application.
Transcription Architecture
+--------+ +----------------+ +------------+
| Call |--audio-->| STT |--text-->| Webhook |
+--------+ | Engine | | or |
+----------------+ | WebSocket |
+------------+ | +------------+
| Conference |--audio----+ |
+------------+ v
+------------+
| Your App |
+------------+
Key Components
Audio Source: The call or conference being transcribed
STT Engine: Google Cloud Speech-to-Text or Amazon Transcribe (selectable per request)
Delivery: Webhooks (push) or WebSocket (subscribe) to your application
Transcription Types
Type |
Description |
|---|---|
Call Transcription |
Transcribes a single call with direction detection |
Conference Transcription |
Transcribes all participants (direction indicates speaker relative to conference) |
Transcription Lifecycle
Transcription runs continuously while active, generating transcript segments as speech is detected.
Lifecycle Diagram
POST /transcribes or flow action
|
v
+-------------+ +-------------+
| starting |------active------------>| transcribing|
+-------------+ +------+------+
|
POST /transcribe_stop, hangup, or timeout
|
v
+-------------+
| stopped |
+-------------+
State Descriptions
State |
What’s happening |
|---|---|
starting |
Transcription initialization. STT engine is connecting. |
transcribing |
Actively processing audio. Transcripts are being generated. |
stopped |
Transcription has ended. No more transcripts will be generated. |
Transcript Delivery Flow
Call Audio VoIPBIN STT Your App
| | |
|====audio chunk====>| |
| | process |
| |----+ |
| |<---+ |
| | |
| | transcript_created |
| +------------------->|
| | |
|====audio chunk====>| |
| | process |
| +------------------->|
| | |
Each transcript segment is delivered as soon as speech is recognized, enabling real-time processing.
Starting Transcription
VoIPBIN provides two methods to start transcription based on your use case.
Note
AI Implementation Hint
The language parameter uses BCP47 codes (e.g., en-US, ko-KR). Using the wrong language code significantly degrades accuracy. If the speaker’s language is unknown, start with the most likely code and consider switching if results are poor. There is no auto-detect mode; you must specify a language explicitly.
Method 1: Via Flow Action
Use transcribe_start and transcribe_stop actions in your call flow for automatic control.
Your Flow VoIPBIN Your App
| | |
| transcribe_start action | |
+-------------------------->| |
| | Initialize STT |
| | |
| |<====audio stream==== |
| | |
| | transcript_created |
| +-------------------------->|
| | |
| transcribe_stop action | |
+-------------------------->| |
| | |
Example flow with transcription:
{
"actions": [
{
"type": "answer"
},
{
"type": "transcribe_start",
"option": {
"language": "en-US"
}
},
{
"type": "talk",
"option": {
"text": "Hello, how can I help you today?"
}
},
{
"type": "connect",
"option": {
"destinations": [{"type": "tel", "target": "+15551234567"}]
}
},
{
"type": "transcribe_stop"
}
]
}
See detail here.
Method 2: Via API (Interrupt Method)
Start transcription on an active call or conference programmatically.
Start transcription:
$ curl -X POST 'https://api.voipbin.net/v1.0/transcribes?token=<token>' \
--header 'Content-Type: application/json' \
--data '{
"reference_type": "call",
"reference_id": "8c71bcb6-e7e7-4ed2-8aba-44bc2deda9a5",
"language": "en-US",
"direction": "both",
"provider": "gcp"
}'
Parameters:
When to Use Each Method
Method |
Best for |
|---|---|
Flow Action |
Automated transcription based on call flow logic |
API (Interrupt) |
Dynamic control - start/stop based on external events |
Receiving Transcripts
VoIPBIN delivers transcripts to your application via webhooks or WebSocket subscription.
Webhook Event Types
VoIPBIN generates the following events during a transcription session:
Event Type |
Description |
|---|---|
|
Final transcribed text segment |
|
Voice activity detected (speaker began talking) |
|
Partial transcript while speaker is still talking |
|
Voice activity ended (speaker stopped talking) |
The transcript_created event delivers final, complete transcript segments. The speech events provide real-time voice activity detection and interim results, useful for AI voice agent integrations. See Speech Webhook Message for the speech event payload structure.
Webhook Delivery (Push)
Configure a webhook URL in your customer settings to receive transcript_created events automatically.
VoIPBIN Your App
| |
| POST /your-webhook-endpoint |
| {transcript_created event} |
+-------------------------------->|
| |
| 200 OK |
|<--------------------------------+
| |
Webhook Payload:
{
"type": "transcript_created",
"data": {
"id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
"transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"direction": "out",
"message": "Hello, this is transcribe test call.",
"tm_transcript": "0001-01-01 00:00:08.991840",
"tm_create": "2024-04-04 07:15:59.233415"
}
}
WebSocket Subscription (Subscribe)
Subscribe to transcript events via WebSocket for real-time streaming.
Your App VoIPBIN
| |
| WebSocket connect |
+-------------------------------->|
| |
| Subscribe to transcript events |
+-------------------------------->|
| |
|<======= transcript events =====>|
|<======= transcript events =====>|
| |
| Unsubscribe |
+-------------------------------->|
| |
Comparison: Webhook vs WebSocket
Aspect |
Webhook |
WebSocket |
|---|---|---|
Connection |
VoIPBIN initiates POST |
Your app maintains connection |
Latency |
Higher (HTTP overhead) |
Lower (persistent connection) |
Reliability |
Retry on failure |
Must handle reconnection |
Best for |
Simple integration, batch processing |
Real-time UI, low-latency applications |
Understanding Transcript Direction
Each transcript includes a direction field indicating whether the speech was incoming or outgoing relative to VoIPBIN.
Direction Detection
+----------+ +---------+
| Caller |-----> direction: "in" ----->| VoIPBIN |
| | | |
| |<---- direction: "out" <-----| |
+----------+ +---------+
Example Conversation:
[in] "Hello, I need help with my account"
[out] "Sure, I can help you with that"
[in] "My account number is 12345"
[out] "Let me look that up for you"
Direction Values
Transcript Data Structure:
[
{
"id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "in",
"message": "Hi, good to see you. How are you today.",
"tm_transcript": "0001-01-01 00:01:04.441160",
"tm_create": "2024-04-01 07:22:07.229309"
},
{
"id": "3c95ea10-a5b7-4a68-aebf-ed1903baf110",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "out",
"message": "Welcome to the transcribe test scenario.",
"tm_transcript": "0001-01-01 00:00:43.116830",
"tm_create": "2024-04-01 07:17:27.208337"
}
]
Working with Transcripts
Timestamp Fields
Field |
Description |
|---|---|
tm_transcript |
Time offset within the call when speech occurred |
tm_create |
Absolute timestamp when transcript was created |
Combining Transcripts into Conversation
To reconstruct a conversation, sort transcripts by tm_transcript:
Transcripts received (order of delivery):
[out] 00:00:05 "Welcome to VoIPBIN support"
[in] 00:00:12 "Hi, I have a billing question"
[out] 00:00:18 "I'd be happy to help"
[in] 00:00:08 "Hello?"
Sorted by tm_transcript:
[out] 00:00:05 "Welcome to VoIPBIN support"
[in] 00:00:08 "Hello?"
[in] 00:00:12 "Hi, I have a billing question"
[out] 00:00:18 "I'd be happy to help"
Storing Transcripts
For long-term storage, consider:
Store raw transcripts with all metadata
Index by
transcribe_idto group by sessionUse
directionfor speaker attributionCreate searchable text indexes on
messagefield
Common Scenarios
Scenario 1: Real-Time Call Transcription
Transcribe a call from start to finish with webhook delivery.
Call starts
|
v
+--------------------+
| transcribe_start |
| language: "en-US" |
+--------+-----------+
|
v
+===================+
| Call in progress |------> transcript_created events
+===================+ to your webhook
|
v
+--------------------+
| Call ends |
| (auto-stop) |
+--------------------+
Scenario 2: Conference with Multiple Speakers
Transcribe all participants in a conference.
Conference
+-------------------------------------------------------+
| +------+ +------+ +------+ |
| |User A| |User B| |User C| |
| +--+---+ +--+---+ +--+---+ |
| | | | |
| +-----+-----+-----+-----+ |
| | |
| v |
| +------------+ |
| |Transcription|----> transcript_created events |
| +------------+ (direction indicates speaker)|
+-------------------------------------------------------+
Scenario 3: AI Integration
Send transcripts to an AI system for real-time analysis.
VoIPBIN Your App AI Service
| | |
| transcript_created | |
+--------------------->| |
| | Analyze sentiment |
| +--------------------->|
| | |
| | sentiment: positive |
| |<---------------------+
| | |
| | Update agent UI |
| | |
Scenario 4: Compliance Recording with Transcription
Combine recording and transcription for complete call documentation.
{
"actions": [
{"type": "answer"},
{"type": "recording_start"},
{"type": "transcribe_start", "option": {"language": "en-US"}},
{"type": "connect", "option": {"destinations": [...]}},
{"type": "transcribe_stop"},
{"type": "recording_stop"}
]
}
Supported Languages
VoIPBIN supports transcription in 70+ languages and regional variants. Specify the language using the language option (e.g., en-US, ko-KR).
Common Languages
Language Code |
Language |
|---|---|
en-US |
English (United States) |
en-GB |
English (United Kingdom) |
es-ES |
Spanish (Spain) |
es-MX |
Spanish (Mexico) |
fr-FR |
French (France) |
de-DE |
German (Germany) |
it-IT |
Italian (Italy) |
pt-BR |
Portuguese (Brazil) |
ja-JP |
Japanese (Japan) |
ko-KR |
Korean (South Korea) |
zh-CN |
Chinese (Mandarin) |
ar-SA |
Arabic (Saudi Arabia) |
hi-IN |
Hindi (India) |
nl-NL |
Dutch (Netherlands) |
ru-RU |
Russian (Russia) |
VoIPBIN supports 70+ languages including regional variants for Arabic, Spanish, English, and more. Contact support for the complete language list.
To ensure optimal transcription results, choose the code that best matches your speaker’s language and dialect.
Best Practices
1. Language Selection
Use the most specific regional variant (e.g.,
en-AUnot justen-USfor Australian speakers)Mismatched language codes significantly reduce accuracy
For multi-language calls, consider separate transcription sessions
2. Audio Quality
Clear audio produces better transcripts
Reduce background noise when possible
Avoid overlapping speech in group calls
3. Handling High Volume
Use WebSocket for real-time applications with many concurrent calls
Batch process webhooks for analytics workloads
Index transcripts for efficient searching
4. Storage and Compliance
Define retention policies for transcript data
Store transcripts with call metadata for context
Consider encryption for sensitive conversations
Troubleshooting
Transcription Not Starting
Symptom |
Solution |
|---|---|
No transcribe_id returned |
Verify call/conference is in “progressing” status before starting transcription |
Permission denied |
Check API token has transcription permissions |
Invalid language code |
Verify language code is in supported list |
Poor Accuracy
Symptom |
Solution |
|---|---|
Words frequently wrong |
Check language code matches speaker’s dialect |
Missing words |
Check audio quality - background noise or low volume reduces accuracy |
Technical terms wrong |
STT may not recognize domain-specific terms; consider post-processing |
Missing Transcripts
Symptom |
Solution |
|---|---|
Webhook not receiving |
Verify webhook URL is configured in customer settings and is publicly accessible |
WebSocket disconnects |
Implement reconnection logic; check for network issues |
Gaps in transcript |
Silence or unclear audio produces no transcripts - this is expected behavior |
Webhook Delivery Issues
Symptom |
Solution |
|---|---|
Events delayed |
Check webhook endpoint response time; should respond within 5 seconds |
Duplicate events |
Implement idempotency using transcript |
Events out of order |
Sort by |
Tutorial
Before working with transcription, you need:
An authentication token. Obtain one via
POST /auth/loginor use an access key fromGET /accesskeys.An active call or conference in
progressingstatus. Obtain the call ID viaGET /callsor conference ID viaGET /conferences.A BCP47 language code matching the speaker’s language (e.g.,
en-US,ko-KR). See Supported Languages.(Optional for recording transcription) A recording ID from
GET /recordings.
Note
AI Implementation Hint
Transcription can only be started on a call or conference that is in progressing status (i.e., answered and active). For recording transcription, the recording must exist and be in ended status. The language parameter is required and must be a valid BCP47 code; there is no auto-detect mode.
Start Transcription with Flow Action
The easiest way to enable transcription is by adding a transcribe_start action to your call flow. This automatically begins transcription when the call reaches that action.
Create Call with Automatic Transcription:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"source": {
"type": "tel",
"target": "+15551234567"
},
"destinations": [
{
"type": "tel",
"target": "+15559876543"
}
],
"actions": [
{
"type": "answer"
},
{
"type": "transcribe_start",
"option": {
"language": "en-US"
}
},
{
"type": "talk",
"option": {
"text": "This call is being transcribed for quality assurance",
"language": "en-US"
}
}
]
}'
Transcription starts when the call reaches the transcribe_start action and continues until the call ends.
Start Transcription via API (Manual)
For existing calls or conferences, start transcription manually by making an API request.
Transcribe an Active Call:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"resource_type": "call",
"resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US"
}'
{
"id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"customer_id": "12345678-1234-1234-1234-123456789012",
"resource_type": "call",
"resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US",
"status": "active",
"tm_create": "2026-01-20 12:00:00.000000",
"tm_update": "2026-01-20 12:00:00.000000",
"tm_delete": "9999-01-01 00:00:00.000000"
}
Transcribe a Conference:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"resource_type": "conference",
"resource_id": "c1d2e3f4-a5b6-7890-cdef-123456789abc",
"language": "en-US"
}'
Transcribe a Recording:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"resource_type": "recording",
"resource_id": "r1s2t3u4-v5w6-x789-yz01-234567890def",
"language": "en-US"
}'
Get Transcription Results
Retrieve transcription data after the transcription completes or during real-time transcription.
Get Transcription by ID:
$ curl --location --request GET 'https://api.voipbin.net/v1.0/transcribes/8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce?token=<YOUR_AUTH_TOKEN>'
{
"id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"customer_id": "12345678-1234-1234-1234-123456789012",
"resource_type": "call",
"resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"language": "en-US",
"status": "completed",
"tm_create": "2026-01-20 12:00:00.000000",
"tm_update": "2026-01-20 12:05:00.000000",
"tm_delete": "9999-01-01 00:00:00.000000"
}
Get Transcripts (Text Results):
$ curl --location --request GET 'https://api.voipbin.net/v1.0/transcripts?token=<YOUR_AUTH_TOKEN>&transcribe_id=8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce'
{
"result": [
{
"id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
"transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"direction": "in",
"message": "Hi, good to see you. How are you today?",
"tm_transcript": "0001-01-01 00:01:04.441160",
"tm_create": "2024-04-01 07:22:07.229309"
},
{
"id": "3c95ea10-a5b7-4a68-aebf-ed1903baf110",
"transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"direction": "out",
"message": "Welcome to the transcribe test. All your voice will be transcribed.",
"tm_transcript": "0001-01-01 00:00:43.116830",
"tm_create": "2024-04-01 07:17:27.208337"
}
]
}
Understanding Transcription Direction
VoIPBIN distinguishes between incoming and outgoing audio:
Direction: “in” - Audio from the customer/caller to VoIPBIN
Direction: “out” - Audio from VoIPBIN to the customer/caller
Customer -----"in"------> VoIPBIN
<----"out"-------
This helps identify who said what in the conversation: - “in”: What the customer said - “out”: What VoIPBIN played (TTS, recordings, or other party in the call)
Real-Time Transcription with WebSocket
Subscribe to real-time transcription events via WebSocket to get transcripts as they’re generated during the call.
1. Connect to WebSocket:
wss://api.voipbin.net/v1.0/ws?token=<YOUR_AUTH_TOKEN>
2. Subscribe to Transcription Events:
{
"type": "subscribe",
"topics": [
"customer_id:12345678-1234-1234-1234-123456789012:transcribe:8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce"
]
}
3. Receive Real-Time Transcripts:
{
"event_type": "transcript_created",
"timestamp": "2026-01-20T12:00:00.000000Z",
"data": {
"id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
"transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"direction": "out",
"message": "Hello, this is a transcribe test call.",
"tm_transcript": "0001-01-01 00:00:08.991840",
"tm_create": "2024-04-04 07:15:59.233415"
}
}
Python WebSocket Example:
import websocket
import json
def on_message(ws, message):
data = json.loads(message)
if data.get('event_type') == 'transcript_created':
transcript = data['data']
direction = transcript['direction']
text = transcript['message']
print(f"[{direction}] {text}")
# Process transcription in real-time
# - Display in UI
# - Run sentiment analysis
# - Detect keywords
def on_open(ws):
# Subscribe to transcription events
subscription = {
"type": "subscribe",
"topics": [
"customer_id:12345678-1234-1234-1234-123456789012:transcribe:*"
]
}
ws.send(json.dumps(subscription))
print("Subscribed to transcription events")
token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/ws?token={token}"
ws = websocket.WebSocketApp(
ws_url,
on_open=on_open,
on_message=on_message
)
ws.run_forever()
Receive Transcripts via Webhook
Configure webhooks to automatically receive transcription events.
1. Create Webhook:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/webhooks?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "Transcription Webhook",
"uri": "https://your-server.com/webhook",
"method": "POST",
"event_types": [
"transcribe.started",
"transcribe.completed",
"transcript.created"
]
}'
2. Webhook Payload Example:
POST https://your-server.com/webhook
{
"event_type": "transcript_created",
"timestamp": "2026-01-20T12:00:00.000000Z",
"data": {
"id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
"transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
"direction": "in",
"message": "I need help with my account",
"tm_transcript": "0001-01-01 00:00:15.500000",
"tm_create": "2024-04-04 07:16:05.100000"
}
}
3. Process Webhook in Your Server:
# Python Flask example
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def transcription_webhook():
payload = request.get_json()
event_type = payload.get('event_type')
if event_type == 'transcript_created':
transcript = payload['data']
transcribe_id = transcript['transcribe_id']
message = transcript['message']
direction = transcript['direction']
# Store transcript in database
store_transcript(transcribe_id, message, direction)
# Analyze content
sentiment = analyze_sentiment(message)
keywords = extract_keywords(message)
# Trigger actions based on content
if 'urgent' in message.lower():
alert_supervisor(transcribe_id)
return jsonify({'status': 'received'}), 200
Supported Languages
VoIPBIN supports transcription in multiple languages. See supported languages.
Common Languages:
- en-US - English (United States)
- en-GB - English (United Kingdom)
- es-ES - Spanish (Spain)
- fr-FR - French (France)
- de-DE - German (Germany)
- ja-JP - Japanese (Japan)
- ko-KR - Korean (Korea)
- zh-CN - Chinese (Simplified)
Example with Different Language:
{
"type": "transcribe_start",
"option": {
"language": "ja-JP"
}
}
Common Use Cases
1. Customer Service Quality Assurance:
# Monitor customer service calls
def on_transcript(transcript):
# Check for quality metrics
if contains_greeting(transcript):
mark_greeting_present()
if contains_problem_resolution(transcript):
mark_resolved()
# Flag negative sentiment
if analyze_sentiment(transcript) < 0.3:
flag_for_review()
2. Compliance and Record-Keeping:
# Store all call transcripts for compliance
def store_for_compliance(transcribe_id):
transcripts = get_transcripts(transcribe_id)
# Create formatted record
record = {
'call_id': call_id,
'date': datetime.now(),
'full_transcript': format_transcript(transcripts),
'participants': get_participants(call_id)
}
# Store in compliance database
compliance_db.store(record)
3. Real-Time Agent Assistance:
# Help agents during calls
def on_real_time_transcript(transcript):
# Detect customer questions
if is_question(transcript['message']):
# Suggest answers to agent
answers = knowledge_base.search(transcript['message'])
display_to_agent(answers)
# Detect customer frustration
if detect_frustration(transcript['message']):
suggest_supervisor_escalation()
4. Automated Call Summarization:
# Generate call summaries
def summarize_call(transcribe_id):
transcripts = get_all_transcripts(transcribe_id)
# Combine all transcripts
full_text = ' '.join([t['message'] for t in transcripts])
# Generate summary using AI
summary = ai_summarize(full_text)
# Extract key points
action_items = extract_action_items(full_text)
topics = extract_topics(full_text)
return {
'summary': summary,
'action_items': action_items,
'topics': topics
}
5. Keyword Detection and Alerting:
# Monitor for important keywords
ALERT_KEYWORDS = ['urgent', 'emergency', 'cancel', 'complaint', 'lawsuit']
def on_transcript(transcript):
message = transcript['message'].lower()
for keyword in ALERT_KEYWORDS:
if keyword in message:
# Send immediate alert
send_alert(
transcribe_id=transcript['transcribe_id'],
keyword=keyword,
context=message
)
# Escalate to supervisor
escalate_call(transcript['transcribe_id'])
6. Multi-Language Customer Support:
# Auto-detect and transcribe in customer's language
def start_multilingual_transcription(call_id):
# Detect language from first few seconds
detected_language = detect_language(call_id)
# Start transcription in detected language
start_transcribe(
resource_id=call_id,
language=detected_language
)
# Optionally translate to agent's language
if detected_language != 'en-US':
enable_translation(call_id, target_lang='en-US')
Best Practices
1. Choose the Right Trigger Method: - Flow Action: Use when transcription is always needed for specific flows - Manual API: Use when transcription is conditional or triggered by user action
2. Handle Real-Time Events Efficiently: - Process transcripts asynchronously to avoid blocking - Buffer transcripts if processing takes time - Use queues for high-volume scenarios
3. Language Selection: - Auto-detect language when possible - Set correct language for better accuracy - Test with actual customer accents and dialects
4. Data Management: - Store transcripts separately from call records - Implement retention policies (GDPR, compliance) - Encrypt sensitive transcriptions
5. Error Handling: - Handle cases where transcription fails - Retry logic for temporary failures - Log failures for debugging
6. Testing: - Test with various audio qualities - Verify accuracy with different accents - Test real-time latency
Transcription Lifecycle
1. Start Transcription:
POST /v1.0/transcribes
→ Returns transcribe_id
2. Active Transcription:
Status: "active"
→ Transcripts being generated in real-time
3. Receive Transcripts:
Via WebSocket: transcript_created events
Via Webhook: POST to your endpoint
Via API: GET /v1.0/transcripts?transcribe_id=...
4. Completion:
Status: "completed"
→ All transcripts available via API
Troubleshooting
Common Issues:
No transcripts generated: - Verify call has audio - Check language setting is correct - Ensure transcription started successfully
Poor transcription accuracy: - Use correct language code - Check audio quality - Verify clear speech (no background noise)
Missing real-time events: - Verify WebSocket subscription is active - Check topic pattern matches transcribe_id - Ensure network connection is stable
Delayed transcripts: - Real-time transcription has ~2-5 second delay (normal) - Check network latency - Verify server can handle webhook volume
For more information about transcription features and configuration, see Transcribe Overview.
Transcribe
Transcribe
{
"id": "<string>",
"customer_id": "<string>",
"reference_type": "<string>",
"reference_id": "<string>",
"status": "<string>",
"language": "<string>",
"provider": "<string>",
"tm_create": "<string>",
"tm_update": "<string>",
"tm_delete": "<string>",
}
id(UUID): The transcribe session’s unique identifier. Returned when creating a transcription viaPOST /transcribesor listing viaGET /transcribes.customer_id(UUID): The customer who owns this transcription. Obtained fromGET /customers.reference_type(enum string): The type of resource being transcribed. See Reference Type.reference_id(UUID): The ID of the resource being transcribed. Depending onreference_type, obtained fromGET /calls,GET /recordings, orGET /conferences.status(enum string): The transcription session’s current status. See Status.language(String, BCP47): The language code for transcription (e.g.,en-US,ko-KR,ja-JP). See Supported Languages.provider(enum string, optional): The STT provider used for this transcription. See Provider.tm_create(string, ISO 8601): Timestamp when the transcribe session was created.tm_update(string, ISO 8601): Timestamp of the last update to any transcribe property.tm_delete(string, ISO 8601): Timestamp when the transcribe session was deleted. Set to9999-01-01 00:00:00.000000if not deleted.
Note
AI Implementation Hint
Timestamps set to 9999-01-01 00:00:00.000000 indicate the event has not yet occurred. For example, tm_delete with this value means the transcription has not been deleted.
Example
{
"id": "bbf08426-3979-41bc-a544-5fc92c237848",
"customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
"reference_type": "call",
"reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
"status": "done",
"language": "en-US",
"provider": "gcp",
"tm_create": "2024-04-01 07:17:04.091019",
"tm_update": "2024-04-01 13:25:32.428602",
"tm_delete": "9999-01-01 00:00:00.000000"
}
reference_type
All possible values for the reference_type field:
Type |
Description |
|---|---|
call |
Transcribing a live call in real-time. The |
recording |
Transcribing a previously recorded audio file. The |
confbridge |
Transcribing a live conference. The |
provider
All possible values for the provider field:
Provider |
Description |
|---|---|
gcp |
Google Cloud Speech-to-Text |
aws |
Amazon Transcribe |
When creating a transcription, the provider field is optional. If omitted, VoIPBIN selects the best available provider automatically (default order: GCP, then AWS). If a specific provider is requested but unavailable, the system falls back to the default order.
status
All possible values for the status field:
Status |
Description |
|---|---|
progressing |
Transcription is actively in progress. New transcript segments are being generated and delivered via webhook or WebSocket. |
done |
Transcription is complete. No more transcript segments will be generated. All transcripts are available via |
Transcription
Transcription
{
"id": "<string>",
"transcribe_id": "<string>",
"direction": "<string>",
"message": "<string>",
"tm_transcript": "<string>",
"tm_create": "<string>",
},
id(UUID): The individual transcript segment’s unique identifier.transcribe_id(UUID): The parent transcribe session’s ID. Obtained fromGET /transcribesor the response ofPOST /transcribes.direction(enum string): Whether the speech was incoming or outgoing. See Direction.message(String): The transcribed text content of this speech segment.tm_transcript(String): Time offset within the call when this speech occurred. Uses0001-01-01 00:00:00as epoch; the time portion represents the offset from the start of the transcription session (e.g.,0001-01-01 00:01:04.441160means 1 minute and 4 seconds into the call). Sort by this field to reconstruct conversation order.tm_create(string, ISO 8601): Absolute timestamp when this transcript segment was created.
Note
AI Implementation Hint
The tm_transcript field is a time offset, not an absolute timestamp. Its date part (0001-01-01) is a sentinel value meaning “relative to the start of the transcription session.” To reconstruct a conversation in order, sort all transcript segments by tm_transcript, not by tm_create (which reflects delivery time, not speech time).
Example
{
"id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "in",
"message": "Hi, good to see you. How are you today.",
"tm_transcript": "0001-01-01 00:05:04.441160",
"tm_create": "2024-04-01 07:22:07.229309"
}
direction
All possible values for the direction field:
Direction |
Description |
|---|---|
in |
Incoming speech toward VoIPBIN (i.e., what the caller/remote party said). |
out |
Outgoing speech from VoIPBIN (i.e., TTS audio, recorded prompts, or the connected party’s speech sent from VoIPBIN). |
Speech Webhook Message
Speech Webhook Message
The speech webhook message is the payload delivered for transcribe_speech_started, transcribe_speech_interim, and transcribe_speech_ended events. These events are generated during a real-time streaming transcription session when voice activity is detected.
{
"id": "<string>",
"customer_id": "<string>",
"streaming_id": "<string>",
"transcribe_id": "<string>",
"direction": "<string>",
"message": "<string>",
"tm_event": "<string>",
"tm_create": "<string>"
}
id(UUID): The unique identifier of the speech event.customer_id(UUID): The customer who owns this transcription session. Obtained fromGET /customers.streaming_id(UUID): The unique identifier of the audio streaming session that produced this event.transcribe_id(UUID): The parent transcribe session’s ID. Obtained fromGET /transcribesor the response ofPOST /transcribes.direction(enum string): Whether the speech was incoming or outgoing. See Direction.message(String): The interim transcribed text. Present fortranscribe_speech_interimevents. Empty fortranscribe_speech_startedandtranscribe_speech_endedevents.tm_event(string, ISO 8601): Timestamp when the speech event occurred.tm_create(string, ISO 8601): Timestamp when the speech event record was created.
Example
transcribe_speech_started:
{
"type": "transcribe_speech_started",
"data": {
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
"streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "in",
"tm_event": "2024-04-01 07:22:07.229309",
"tm_create": "2024-04-01 07:22:07.229309"
}
}
transcribe_speech_interim:
{
"type": "transcribe_speech_interim",
"data": {
"id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
"customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
"streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "in",
"message": "Hello, I need help with my account",
"tm_event": "2024-04-01 07:22:08.115000",
"tm_create": "2024-04-01 07:22:08.115000"
}
}
transcribe_speech_ended:
{
"type": "transcribe_speech_ended",
"data": {
"id": "c3d4e5f6-a7b8-9012-cdef-345678901234",
"customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
"streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
"transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
"direction": "in",
"tm_event": "2024-04-01 07:22:12.500000",
"tm_create": "2024-04-01 07:22:12.500000"
}
}