Transcribe

Real-time speech-to-text transcription for voice calls, converting spoken audio into text during active conversations.

API Reference: Transcribe endpoints

Overview

Note

AI Context

  • Complexity: Medium

  • Cost: Chargeable (per minute of audio transcribed)

  • Async: Yes. POST /transcribes returns immediately with status progressing. Transcripts are delivered asynchronously via webhook (transcript_created events) or WebSocket subscription. Poll GET /transcribes/{id} to check for done status when complete.

VoIPBIN’s Transcription API converts spoken audio from calls and conferences into text in real-time. Whether you need transcripts for compliance, searchable call logs, AI analysis, or accessibility, the Transcription API delivers accurate text as conversations happen.

With the Transcription API you can:

  • Transcribe calls and conferences in real-time

  • Distinguish between incoming and outgoing speech

  • Receive transcripts via webhooks or WebSocket

  • Support 70+ languages and regional variants

  • Integrate with AI systems for sentiment analysis and summarization

How Transcription Works

When you start transcription, VoIPBIN captures audio from the call or conference, sends it to a speech-to-text (STT) engine, and delivers the resulting text to your application.

Transcription Architecture

+--------+        +----------------+        +------------+
|  Call  |--audio-->|     STT      |--text-->|  Webhook   |
+--------+        |    Engine      |        |     or     |
                  +----------------+        | WebSocket  |
+------------+           |                  +------------+
| Conference |--audio----+                        |
+------------+                                    v
                                          +------------+
                                          |  Your App  |
                                          +------------+

Key Components

  • Audio Source: The call or conference being transcribed

  • STT Engine: Google Cloud Speech-to-Text or Amazon Transcribe (selectable per request)

  • Delivery: Webhooks (push) or WebSocket (subscribe) to your application

Transcription Types

Type

Description

Call Transcription

Transcribes a single call with direction detection

Conference Transcription

Transcribes all participants (direction indicates speaker relative to conference)

Transcription Lifecycle

Transcription runs continuously while active, generating transcript segments as speech is detected.

Lifecycle Diagram

POST /transcribes or flow action
       |
       v
+-------------+                         +-------------+
|  starting   |------active------------>| transcribing|
+-------------+                         +------+------+
                                               |
                          POST /transcribe_stop, hangup, or timeout
                                               |
                                               v
                                        +-------------+
                                        |   stopped   |
                                        +-------------+

State Descriptions

State

What’s happening

starting

Transcription initialization. STT engine is connecting.

transcribing

Actively processing audio. Transcripts are being generated.

stopped

Transcription has ended. No more transcripts will be generated.

Transcript Delivery Flow

Call Audio          VoIPBIN STT           Your App
    |                    |                    |
    |====audio chunk====>|                    |
    |                    | process            |
    |                    |----+               |
    |                    |<---+               |
    |                    |                    |
    |                    | transcript_created |
    |                    +------------------->|
    |                    |                    |
    |====audio chunk====>|                    |
    |                    | process            |
    |                    +------------------->|
    |                    |                    |

Each transcript segment is delivered as soon as speech is recognized, enabling real-time processing.

Starting Transcription

VoIPBIN provides two methods to start transcription based on your use case.

Note

AI Implementation Hint

The language parameter uses BCP47 codes (e.g., en-US, ko-KR). Using the wrong language code significantly degrades accuracy. If the speaker’s language is unknown, start with the most likely code and consider switching if results are poor. There is no auto-detect mode; you must specify a language explicitly.

Method 1: Via Flow Action

Use transcribe_start and transcribe_stop actions in your call flow for automatic control.

Your Flow                    VoIPBIN                     Your App
    |                           |                           |
    | transcribe_start action   |                           |
    +-------------------------->|                           |
    |                           | Initialize STT            |
    |                           |                           |
    |                           |<====audio stream====      |
    |                           |                           |
    |                           | transcript_created        |
    |                           +-------------------------->|
    |                           |                           |
    | transcribe_stop action    |                           |
    +-------------------------->|                           |
    |                           |                           |

Example flow with transcription:

{
    "actions": [
        {
            "type": "answer"
        },
        {
            "type": "transcribe_start",
            "option": {
                "language": "en-US"
            }
        },
        {
            "type": "talk",
            "option": {
                "text": "Hello, how can I help you today?"
            }
        },
        {
            "type": "connect",
            "option": {
                "destinations": [{"type": "tel", "target": "+15551234567"}]
            }
        },
        {
            "type": "transcribe_stop"
        }
    ]
}

See detail here.

Method 2: Via API (Interrupt Method)

Start transcription on an active call or conference programmatically.

Start transcription:

$ curl -X POST 'https://api.voipbin.net/v1.0/transcribes?token=<token>' \
    --header 'Content-Type: application/json' \
    --data '{
        "reference_type": "call",
        "reference_id": "8c71bcb6-e7e7-4ed2-8aba-44bc2deda9a5",
        "language": "en-US",
        "direction": "both",
        "provider": "gcp"
    }'

Parameters:

When to Use Each Method

Method

Best for

Flow Action

Automated transcription based on call flow logic

API (Interrupt)

Dynamic control - start/stop based on external events

Receiving Transcripts

VoIPBIN delivers transcripts to your application via webhooks or WebSocket subscription.

Webhook Event Types

VoIPBIN generates the following events during a transcription session:

Event Type

Description

transcript_created

Final transcribed text segment

transcribe_speech_started

Voice activity detected (speaker began talking)

transcribe_speech_interim

Partial transcript while speaker is still talking

transcribe_speech_ended

Voice activity ended (speaker stopped talking)

The transcript_created event delivers final, complete transcript segments. The speech events provide real-time voice activity detection and interim results, useful for AI voice agent integrations. See Speech Webhook Message for the speech event payload structure.

Webhook Delivery (Push)

Configure a webhook URL in your customer settings to receive transcript_created events automatically.

VoIPBIN                           Your App
    |                                 |
    | POST /your-webhook-endpoint     |
    | {transcript_created event}      |
    +-------------------------------->|
    |                                 |
    |            200 OK               |
    |<--------------------------------+
    |                                 |

Webhook Payload:

{
    "type": "transcript_created",
    "data": {
        "id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
        "transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
        "direction": "out",
        "message": "Hello, this is transcribe test call.",
        "tm_transcript": "0001-01-01 00:00:08.991840",
        "tm_create": "2024-04-04 07:15:59.233415"
    }
}

WebSocket Subscription (Subscribe)

Subscribe to transcript events via WebSocket for real-time streaming.

Your App                          VoIPBIN
    |                                 |
    | WebSocket connect               |
    +-------------------------------->|
    |                                 |
    | Subscribe to transcript events  |
    +-------------------------------->|
    |                                 |
    |<======= transcript events =====>|
    |<======= transcript events =====>|
    |                                 |
    | Unsubscribe                     |
    +-------------------------------->|
    |                                 |

Comparison: Webhook vs WebSocket

Aspect

Webhook

WebSocket

Connection

VoIPBIN initiates POST

Your app maintains connection

Latency

Higher (HTTP overhead)

Lower (persistent connection)

Reliability

Retry on failure

Must handle reconnection

Best for

Simple integration, batch processing

Real-time UI, low-latency applications

Understanding Transcript Direction

Each transcript includes a direction field indicating whether the speech was incoming or outgoing relative to VoIPBIN.

Direction Detection

+----------+                             +---------+
|  Caller  |-----> direction: "in" ----->| VoIPBIN |
|          |                             |         |
|          |<---- direction: "out" <-----|         |
+----------+                             +---------+

Example Conversation:

[in]  "Hello, I need help with my account"
[out] "Sure, I can help you with that"
[in]  "My account number is 12345"
[out] "Let me look that up for you"

Direction Values

Transcript Data Structure:

[
    {
        "id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
        "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
        "direction": "in",
        "message": "Hi, good to see you. How are you today.",
        "tm_transcript": "0001-01-01 00:01:04.441160",
        "tm_create": "2024-04-01 07:22:07.229309"
    },
    {
        "id": "3c95ea10-a5b7-4a68-aebf-ed1903baf110",
        "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
        "direction": "out",
        "message": "Welcome to the transcribe test scenario.",
        "tm_transcript": "0001-01-01 00:00:43.116830",
        "tm_create": "2024-04-01 07:17:27.208337"
    }
]

Working with Transcripts

Timestamp Fields

Field

Description

tm_transcript

Time offset within the call when speech occurred

tm_create

Absolute timestamp when transcript was created

Combining Transcripts into Conversation

To reconstruct a conversation, sort transcripts by tm_transcript:

Transcripts received (order of delivery):
[out] 00:00:05 "Welcome to VoIPBIN support"
[in]  00:00:12 "Hi, I have a billing question"
[out] 00:00:18 "I'd be happy to help"
[in]  00:00:08 "Hello?"

Sorted by tm_transcript:
[out] 00:00:05 "Welcome to VoIPBIN support"
[in]  00:00:08 "Hello?"
[in]  00:00:12 "Hi, I have a billing question"
[out] 00:00:18 "I'd be happy to help"

Storing Transcripts

For long-term storage, consider:

  • Store raw transcripts with all metadata

  • Index by transcribe_id to group by session

  • Use direction for speaker attribution

  • Create searchable text indexes on message field

Common Scenarios

Scenario 1: Real-Time Call Transcription

Transcribe a call from start to finish with webhook delivery.

Call starts
     |
     v
+--------------------+
| transcribe_start   |
| language: "en-US"  |
+--------+-----------+
         |
         v
+===================+
| Call in progress  |------> transcript_created events
+===================+           to your webhook
         |
         v
+--------------------+
| Call ends          |
| (auto-stop)        |
+--------------------+

Scenario 2: Conference with Multiple Speakers

Transcribe all participants in a conference.

Conference
+-------------------------------------------------------+
|  +------+    +------+    +------+                     |
|  |User A|    |User B|    |User C|                     |
|  +--+---+    +--+---+    +--+---+                     |
|     |           |           |                         |
|     +-----+-----+-----+-----+                         |
|           |                                           |
|           v                                           |
|     +------------+                                    |
|     |Transcription|----> transcript_created events    |
|     +------------+       (direction indicates speaker)|
+-------------------------------------------------------+

Scenario 3: AI Integration

Send transcripts to an AI system for real-time analysis.

VoIPBIN                Your App               AI Service
    |                      |                      |
    | transcript_created   |                      |
    +--------------------->|                      |
    |                      | Analyze sentiment    |
    |                      +--------------------->|
    |                      |                      |
    |                      | sentiment: positive  |
    |                      |<---------------------+
    |                      |                      |
    |                      | Update agent UI      |
    |                      |                      |

Scenario 4: Compliance Recording with Transcription

Combine recording and transcription for complete call documentation.

{
    "actions": [
        {"type": "answer"},
        {"type": "recording_start"},
        {"type": "transcribe_start", "option": {"language": "en-US"}},
        {"type": "connect", "option": {"destinations": [...]}},
        {"type": "transcribe_stop"},
        {"type": "recording_stop"}
    ]
}

Supported Languages

VoIPBIN supports transcription in 70+ languages and regional variants. Specify the language using the language option (e.g., en-US, ko-KR).

Common Languages

Language Code

Language

en-US

English (United States)

en-GB

English (United Kingdom)

es-ES

Spanish (Spain)

es-MX

Spanish (Mexico)

fr-FR

French (France)

de-DE

German (Germany)

it-IT

Italian (Italy)

pt-BR

Portuguese (Brazil)

ja-JP

Japanese (Japan)

ko-KR

Korean (South Korea)

zh-CN

Chinese (Mandarin)

ar-SA

Arabic (Saudi Arabia)

hi-IN

Hindi (India)

nl-NL

Dutch (Netherlands)

ru-RU

Russian (Russia)

VoIPBIN supports 70+ languages including regional variants for Arabic, Spanish, English, and more. Contact support for the complete language list.

To ensure optimal transcription results, choose the code that best matches your speaker’s language and dialect.

Best Practices

1. Language Selection

  • Use the most specific regional variant (e.g., en-AU not just en-US for Australian speakers)

  • Mismatched language codes significantly reduce accuracy

  • For multi-language calls, consider separate transcription sessions

2. Audio Quality

  • Clear audio produces better transcripts

  • Reduce background noise when possible

  • Avoid overlapping speech in group calls

3. Handling High Volume

  • Use WebSocket for real-time applications with many concurrent calls

  • Batch process webhooks for analytics workloads

  • Index transcripts for efficient searching

4. Storage and Compliance

  • Define retention policies for transcript data

  • Store transcripts with call metadata for context

  • Consider encryption for sensitive conversations

Troubleshooting

Transcription Not Starting

Symptom

Solution

No transcribe_id returned

Verify call/conference is in “progressing” status before starting transcription

Permission denied

Check API token has transcription permissions

Invalid language code

Verify language code is in supported list

Poor Accuracy

Symptom

Solution

Words frequently wrong

Check language code matches speaker’s dialect

Missing words

Check audio quality - background noise or low volume reduces accuracy

Technical terms wrong

STT may not recognize domain-specific terms; consider post-processing

Missing Transcripts

Symptom

Solution

Webhook not receiving

Verify webhook URL is configured in customer settings and is publicly accessible

WebSocket disconnects

Implement reconnection logic; check for network issues

Gaps in transcript

Silence or unclear audio produces no transcripts - this is expected behavior

Webhook Delivery Issues

Symptom

Solution

Events delayed

Check webhook endpoint response time; should respond within 5 seconds

Duplicate events

Implement idempotency using transcript id

Events out of order

Sort by tm_transcript to reconstruct conversation order

Tutorial

Before working with transcription, you need:

  • An authentication token. Obtain one via POST /auth/login or use an access key from GET /accesskeys.

  • An active call or conference in progressing status. Obtain the call ID via GET /calls or conference ID via GET /conferences.

  • A BCP47 language code matching the speaker’s language (e.g., en-US, ko-KR). See Supported Languages.

  • (Optional for recording transcription) A recording ID from GET /recordings.

Note

AI Implementation Hint

Transcription can only be started on a call or conference that is in progressing status (i.e., answered and active). For recording transcription, the recording must exist and be in ended status. The language parameter is required and must be a valid BCP47 code; there is no auto-detect mode.

Start Transcription with Flow Action

The easiest way to enable transcription is by adding a transcribe_start action to your call flow. This automatically begins transcription when the call reaches that action.

Create Call with Automatic Transcription:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "transcribe_start",
                "option": {
                    "language": "en-US"
                }
            },
            {
                "type": "talk",
                "option": {
                    "text": "This call is being transcribed for quality assurance",
                    "language": "en-US"
                }
            }
        ]
    }'

Transcription starts when the call reaches the transcribe_start action and continues until the call ends.

Start Transcription via API (Manual)

For existing calls or conferences, start transcription manually by making an API request.

Transcribe an Active Call:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "resource_type": "call",
        "resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "language": "en-US"
    }'

{
    "id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
    "customer_id": "12345678-1234-1234-1234-123456789012",
    "resource_type": "call",
    "resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "language": "en-US",
    "status": "active",
    "tm_create": "2026-01-20 12:00:00.000000",
    "tm_update": "2026-01-20 12:00:00.000000",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

Transcribe a Conference:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "resource_type": "conference",
        "resource_id": "c1d2e3f4-a5b6-7890-cdef-123456789abc",
        "language": "en-US"
    }'

Transcribe a Recording:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/transcribes?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "resource_type": "recording",
        "resource_id": "r1s2t3u4-v5w6-x789-yz01-234567890def",
        "language": "en-US"
    }'

Get Transcription Results

Retrieve transcription data after the transcription completes or during real-time transcription.

Get Transcription by ID:

$ curl --location --request GET 'https://api.voipbin.net/v1.0/transcribes/8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce?token=<YOUR_AUTH_TOKEN>'

{
    "id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
    "customer_id": "12345678-1234-1234-1234-123456789012",
    "resource_type": "call",
    "resource_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "language": "en-US",
    "status": "completed",
    "tm_create": "2026-01-20 12:00:00.000000",
    "tm_update": "2026-01-20 12:05:00.000000",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

Get Transcripts (Text Results):

$ curl --location --request GET 'https://api.voipbin.net/v1.0/transcripts?token=<YOUR_AUTH_TOKEN>&transcribe_id=8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce'

{
    "result": [
        {
            "id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
            "transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
            "direction": "in",
            "message": "Hi, good to see you. How are you today?",
            "tm_transcript": "0001-01-01 00:01:04.441160",
            "tm_create": "2024-04-01 07:22:07.229309"
        },
        {
            "id": "3c95ea10-a5b7-4a68-aebf-ed1903baf110",
            "transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
            "direction": "out",
            "message": "Welcome to the transcribe test. All your voice will be transcribed.",
            "tm_transcript": "0001-01-01 00:00:43.116830",
            "tm_create": "2024-04-01 07:17:27.208337"
        }
    ]
}

Understanding Transcription Direction

VoIPBIN distinguishes between incoming and outgoing audio:

Direction: “in” - Audio from the customer/caller to VoIPBIN

Direction: “out” - Audio from VoIPBIN to the customer/caller

Customer  -----"in"------>  VoIPBIN
         <----"out"-------

This helps identify who said what in the conversation: - “in”: What the customer said - “out”: What VoIPBIN played (TTS, recordings, or other party in the call)

Real-Time Transcription with WebSocket

Subscribe to real-time transcription events via WebSocket to get transcripts as they’re generated during the call.

1. Connect to WebSocket:

wss://api.voipbin.net/v1.0/ws?token=<YOUR_AUTH_TOKEN>

2. Subscribe to Transcription Events:

{
    "type": "subscribe",
    "topics": [
        "customer_id:12345678-1234-1234-1234-123456789012:transcribe:8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce"
    ]
}

3. Receive Real-Time Transcripts:

{
    "event_type": "transcript_created",
    "timestamp": "2026-01-20T12:00:00.000000Z",
    "data": {
        "id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
        "transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
        "direction": "out",
        "message": "Hello, this is a transcribe test call.",
        "tm_transcript": "0001-01-01 00:00:08.991840",
        "tm_create": "2024-04-04 07:15:59.233415"
    }
}

Python WebSocket Example:

import websocket
import json

def on_message(ws, message):
    data = json.loads(message)

    if data.get('event_type') == 'transcript_created':
        transcript = data['data']
        direction = transcript['direction']
        text = transcript['message']

        print(f"[{direction}] {text}")

        # Process transcription in real-time
        # - Display in UI
        # - Run sentiment analysis
        # - Detect keywords

def on_open(ws):
    # Subscribe to transcription events
    subscription = {
        "type": "subscribe",
        "topics": [
            "customer_id:12345678-1234-1234-1234-123456789012:transcribe:*"
        ]
    }
    ws.send(json.dumps(subscription))
    print("Subscribed to transcription events")

token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/ws?token={token}"

ws = websocket.WebSocketApp(
    ws_url,
    on_open=on_open,
    on_message=on_message
)

ws.run_forever()

Receive Transcripts via Webhook

Configure webhooks to automatically receive transcription events.

1. Create Webhook:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/webhooks?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "name": "Transcription Webhook",
        "uri": "https://your-server.com/webhook",
        "method": "POST",
        "event_types": [
            "transcribe.started",
            "transcribe.completed",
            "transcript.created"
        ]
    }'

2. Webhook Payload Example:

POST https://your-server.com/webhook

{
    "event_type": "transcript_created",
    "timestamp": "2026-01-20T12:00:00.000000Z",
    "data": {
        "id": "9d59e7f0-7bdc-4c52-bb8c-bab718952050",
        "transcribe_id": "8c5a9e2a-2a7f-4a6f-9f1d-debd72c279ce",
        "direction": "in",
        "message": "I need help with my account",
        "tm_transcript": "0001-01-01 00:00:15.500000",
        "tm_create": "2024-04-04 07:16:05.100000"
    }
}

3. Process Webhook in Your Server:

# Python Flask example
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def transcription_webhook():
    payload = request.get_json()
    event_type = payload.get('event_type')

    if event_type == 'transcript_created':
        transcript = payload['data']
        transcribe_id = transcript['transcribe_id']
        message = transcript['message']
        direction = transcript['direction']

        # Store transcript in database
        store_transcript(transcribe_id, message, direction)

        # Analyze content
        sentiment = analyze_sentiment(message)
        keywords = extract_keywords(message)

        # Trigger actions based on content
        if 'urgent' in message.lower():
            alert_supervisor(transcribe_id)

    return jsonify({'status': 'received'}), 200

Supported Languages

VoIPBIN supports transcription in multiple languages. See supported languages.

Common Languages: - en-US - English (United States) - en-GB - English (United Kingdom) - es-ES - Spanish (Spain) - fr-FR - French (France) - de-DE - German (Germany) - ja-JP - Japanese (Japan) - ko-KR - Korean (Korea) - zh-CN - Chinese (Simplified)

Example with Different Language:

{
    "type": "transcribe_start",
    "option": {
        "language": "ja-JP"
    }
}

Common Use Cases

1. Customer Service Quality Assurance:

# Monitor customer service calls
def on_transcript(transcript):
    # Check for quality metrics
    if contains_greeting(transcript):
        mark_greeting_present()

    if contains_problem_resolution(transcript):
        mark_resolved()

    # Flag negative sentiment
    if analyze_sentiment(transcript) < 0.3:
        flag_for_review()

2. Compliance and Record-Keeping:

# Store all call transcripts for compliance
def store_for_compliance(transcribe_id):
    transcripts = get_transcripts(transcribe_id)

    # Create formatted record
    record = {
        'call_id': call_id,
        'date': datetime.now(),
        'full_transcript': format_transcript(transcripts),
        'participants': get_participants(call_id)
    }

    # Store in compliance database
    compliance_db.store(record)

3. Real-Time Agent Assistance:

# Help agents during calls
def on_real_time_transcript(transcript):
    # Detect customer questions
    if is_question(transcript['message']):
        # Suggest answers to agent
        answers = knowledge_base.search(transcript['message'])
        display_to_agent(answers)

    # Detect customer frustration
    if detect_frustration(transcript['message']):
        suggest_supervisor_escalation()

4. Automated Call Summarization:

# Generate call summaries
def summarize_call(transcribe_id):
    transcripts = get_all_transcripts(transcribe_id)

    # Combine all transcripts
    full_text = ' '.join([t['message'] for t in transcripts])

    # Generate summary using AI
    summary = ai_summarize(full_text)

    # Extract key points
    action_items = extract_action_items(full_text)
    topics = extract_topics(full_text)

    return {
        'summary': summary,
        'action_items': action_items,
        'topics': topics
    }

5. Keyword Detection and Alerting:

# Monitor for important keywords
ALERT_KEYWORDS = ['urgent', 'emergency', 'cancel', 'complaint', 'lawsuit']

def on_transcript(transcript):
    message = transcript['message'].lower()

    for keyword in ALERT_KEYWORDS:
        if keyword in message:
            # Send immediate alert
            send_alert(
                transcribe_id=transcript['transcribe_id'],
                keyword=keyword,
                context=message
            )

            # Escalate to supervisor
            escalate_call(transcript['transcribe_id'])

6. Multi-Language Customer Support:

# Auto-detect and transcribe in customer's language
def start_multilingual_transcription(call_id):
    # Detect language from first few seconds
    detected_language = detect_language(call_id)

    # Start transcription in detected language
    start_transcribe(
        resource_id=call_id,
        language=detected_language
    )

    # Optionally translate to agent's language
    if detected_language != 'en-US':
        enable_translation(call_id, target_lang='en-US')

Best Practices

1. Choose the Right Trigger Method: - Flow Action: Use when transcription is always needed for specific flows - Manual API: Use when transcription is conditional or triggered by user action

2. Handle Real-Time Events Efficiently: - Process transcripts asynchronously to avoid blocking - Buffer transcripts if processing takes time - Use queues for high-volume scenarios

3. Language Selection: - Auto-detect language when possible - Set correct language for better accuracy - Test with actual customer accents and dialects

4. Data Management: - Store transcripts separately from call records - Implement retention policies (GDPR, compliance) - Encrypt sensitive transcriptions

5. Error Handling: - Handle cases where transcription fails - Retry logic for temporary failures - Log failures for debugging

6. Testing: - Test with various audio qualities - Verify accuracy with different accents - Test real-time latency

Transcription Lifecycle

1. Start Transcription:

POST /v1.0/transcribes
→ Returns transcribe_id

2. Active Transcription:

Status: "active"
→ Transcripts being generated in real-time

3. Receive Transcripts:

Via WebSocket: transcript_created events
Via Webhook: POST to your endpoint
Via API: GET /v1.0/transcripts?transcribe_id=...

4. Completion:

Status: "completed"
→ All transcripts available via API

Troubleshooting

Common Issues:

No transcripts generated: - Verify call has audio - Check language setting is correct - Ensure transcription started successfully

Poor transcription accuracy: - Use correct language code - Check audio quality - Verify clear speech (no background noise)

Missing real-time events: - Verify WebSocket subscription is active - Check topic pattern matches transcribe_id - Ensure network connection is stable

Delayed transcripts: - Real-time transcription has ~2-5 second delay (normal) - Check network latency - Verify server can handle webhook volume

For more information about transcription features and configuration, see Transcribe Overview.

Transcribe

Transcribe

{
    "id": "<string>",
    "customer_id": "<string>",
    "reference_type": "<string>",
    "reference_id": "<string>",
    "status": "<string>",
    "language": "<string>",
    "provider": "<string>",
    "tm_create": "<string>",
    "tm_update": "<string>",
    "tm_delete": "<string>",
}
  • id (UUID): The transcribe session’s unique identifier. Returned when creating a transcription via POST /transcribes or listing via GET /transcribes.

  • customer_id (UUID): The customer who owns this transcription. Obtained from GET /customers.

  • reference_type (enum string): The type of resource being transcribed. See Reference Type.

  • reference_id (UUID): The ID of the resource being transcribed. Depending on reference_type, obtained from GET /calls, GET /recordings, or GET /conferences.

  • status (enum string): The transcription session’s current status. See Status.

  • language (String, BCP47): The language code for transcription (e.g., en-US, ko-KR, ja-JP). See Supported Languages.

  • provider (enum string, optional): The STT provider used for this transcription. See Provider.

  • tm_create (string, ISO 8601): Timestamp when the transcribe session was created.

  • tm_update (string, ISO 8601): Timestamp of the last update to any transcribe property.

  • tm_delete (string, ISO 8601): Timestamp when the transcribe session was deleted. Set to 9999-01-01 00:00:00.000000 if not deleted.

Note

AI Implementation Hint

Timestamps set to 9999-01-01 00:00:00.000000 indicate the event has not yet occurred. For example, tm_delete with this value means the transcription has not been deleted.

Example

{
    "id": "bbf08426-3979-41bc-a544-5fc92c237848",
    "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
    "reference_type": "call",
    "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f",
    "status": "done",
    "language": "en-US",
    "provider": "gcp",
    "tm_create": "2024-04-01 07:17:04.091019",
    "tm_update": "2024-04-01 13:25:32.428602",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

reference_type

All possible values for the reference_type field:

Type

Description

call

Transcribing a live call in real-time. The reference_id is a call ID from GET /calls.

recording

Transcribing a previously recorded audio file. The reference_id is a recording ID from GET /recordings.

confbridge

Transcribing a live conference. The reference_id is a conference ID from GET /conferences.

provider

All possible values for the provider field:

Provider

Description

gcp

Google Cloud Speech-to-Text

aws

Amazon Transcribe

When creating a transcription, the provider field is optional. If omitted, VoIPBIN selects the best available provider automatically (default order: GCP, then AWS). If a specific provider is requested but unavailable, the system falls back to the default order.

status

All possible values for the status field:

Status

Description

progressing

Transcription is actively in progress. New transcript segments are being generated and delivered via webhook or WebSocket.

done

Transcription is complete. No more transcript segments will be generated. All transcripts are available via GET /transcripts?transcribe_id={id}.

Transcription

Transcription

{
    "id": "<string>",
    "transcribe_id": "<string>",
    "direction": "<string>",
    "message": "<string>",
    "tm_transcript": "<string>",
    "tm_create": "<string>",
},
  • id (UUID): The individual transcript segment’s unique identifier.

  • transcribe_id (UUID): The parent transcribe session’s ID. Obtained from GET /transcribes or the response of POST /transcribes.

  • direction (enum string): Whether the speech was incoming or outgoing. See Direction.

  • message (String): The transcribed text content of this speech segment.

  • tm_transcript (String): Time offset within the call when this speech occurred. Uses 0001-01-01 00:00:00 as epoch; the time portion represents the offset from the start of the transcription session (e.g., 0001-01-01 00:01:04.441160 means 1 minute and 4 seconds into the call). Sort by this field to reconstruct conversation order.

  • tm_create (string, ISO 8601): Absolute timestamp when this transcript segment was created.

Note

AI Implementation Hint

The tm_transcript field is a time offset, not an absolute timestamp. Its date part (0001-01-01) is a sentinel value meaning “relative to the start of the transcription session.” To reconstruct a conversation in order, sort all transcript segments by tm_transcript, not by tm_create (which reflects delivery time, not speech time).

Example

{
    "id": "06af78f0-b063-48c0-b22d-d31a5af0aa88",
    "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
    "direction": "in",
    "message": "Hi, good to see you. How are you today.",
    "tm_transcript": "0001-01-01 00:05:04.441160",
    "tm_create": "2024-04-01 07:22:07.229309"
}

direction

All possible values for the direction field:

Direction

Description

in

Incoming speech toward VoIPBIN (i.e., what the caller/remote party said).

out

Outgoing speech from VoIPBIN (i.e., TTS audio, recorded prompts, or the connected party’s speech sent from VoIPBIN).

Speech Webhook Message

Speech Webhook Message

The speech webhook message is the payload delivered for transcribe_speech_started, transcribe_speech_interim, and transcribe_speech_ended events. These events are generated during a real-time streaming transcription session when voice activity is detected.

{
    "id": "<string>",
    "customer_id": "<string>",
    "streaming_id": "<string>",
    "transcribe_id": "<string>",
    "direction": "<string>",
    "message": "<string>",
    "tm_event": "<string>",
    "tm_create": "<string>"
}
  • id (UUID): The unique identifier of the speech event.

  • customer_id (UUID): The customer who owns this transcription session. Obtained from GET /customers.

  • streaming_id (UUID): The unique identifier of the audio streaming session that produced this event.

  • transcribe_id (UUID): The parent transcribe session’s ID. Obtained from GET /transcribes or the response of POST /transcribes.

  • direction (enum string): Whether the speech was incoming or outgoing. See Direction.

  • message (String): The interim transcribed text. Present for transcribe_speech_interim events. Empty for transcribe_speech_started and transcribe_speech_ended events.

  • tm_event (string, ISO 8601): Timestamp when the speech event occurred.

  • tm_create (string, ISO 8601): Timestamp when the speech event record was created.

Example

transcribe_speech_started:

{
    "type": "transcribe_speech_started",
    "data": {
        "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
        "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
        "streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
        "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
        "direction": "in",
        "tm_event": "2024-04-01 07:22:07.229309",
        "tm_create": "2024-04-01 07:22:07.229309"
    }
}

transcribe_speech_interim:

{
    "type": "transcribe_speech_interim",
    "data": {
        "id": "b2c3d4e5-f6a7-8901-bcde-f23456789012",
        "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
        "streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
        "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
        "direction": "in",
        "message": "Hello, I need help with my account",
        "tm_event": "2024-04-01 07:22:08.115000",
        "tm_create": "2024-04-01 07:22:08.115000"
    }
}

transcribe_speech_ended:

{
    "type": "transcribe_speech_ended",
    "data": {
        "id": "c3d4e5f6-a7b8-9012-cdef-345678901234",
        "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
        "streaming_id": "c0d1e2f3-a4b5-6c7d-8e9f-0a1b2c3d4e5f",
        "transcribe_id": "bbf08426-3979-41bc-a544-5fc92c237848",
        "direction": "in",
        "tm_event": "2024-04-01 07:22:12.500000",
        "tm_create": "2024-04-01 07:22:12.500000"
    }
}