Tutorial

Prerequisites

Before using media streaming, you need:

An authentication token. Obtain one via POST /auth/login or use an access key from GET /accesskeys.
An active call in progressing status. Obtain the call ID via GET /calls. Or an active conference via GET /conferences.
A WebSocket client library (not a standard HTTP client). The media stream endpoint upgrades the HTTP connection to WebSocket.
Knowledge of the audio format for your chosen encapsulation: rtp (G.711 ulaw, 12-byte header + 160-byte payload), sln (raw PCM, no headers), or audiosocket (Asterisk AudioSocket, 320-byte chunks).

Note

AI Implementation Hint

The media stream URL (GET /calls/{id}/media_stream) is a WebSocket upgrade endpoint. Do not use curl or standard HTTP libraries to connect. Use a WebSocket client (e.g., Python websocket-client, JavaScript WebSocket, Node.js ws). Audio data must be sent as binary frames, not text frames.

Bi-Directional Media Streaming for Calls

Connect to a call’s media stream via WebSocket to send and receive audio in real-time. This allows you to build custom audio processing applications without SIP signaling.

Establish WebSocket Connection:

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/calls/652af662-eb45-11ee-b1a5-6fde165f9226/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This creates a bi-directional WebSocket connection where you can: - Receive audio from the call (what the other party is saying) - Send audio to the call (inject audio into the conversation)

Bi-Directional Media Streaming for Conferences

Access a conference’s media stream to monitor or participate in the conference audio.

GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/conferences/1ed12456-eb4b-11ee-bba8-1bfb2838807a/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This allows you to: - Listen to all conference participants - Inject audio into the conference - Build custom conference recording or analysis tools

Encapsulation Types

VoIPBIN supports three encapsulation types for media streaming:

1. RTP (Real-time Transport Protocol)

Standard protocol for audio/video over IP networks.

?encapsulation=rtp

Use cases: - Standard VoIP integration - Compatible with most audio processing tools - Industry-standard protocol

2. SLN (Signed Linear Mono)

Raw audio stream without headers or padding.

?encapsulation=sln

Use cases: - Minimal overhead needed - Simple audio processing - Direct PCM audio access

3. AudioSocket

Asterisk-specific protocol for simple audio streaming.

?encapsulation=audiosocket

Use cases: - Asterisk integration - Low-overhead streaming - Simple audio applications

Codec: All formats use 16-bit, 8kHz, mono audio (ulaw for RTP/SLN, PCM little-endian for AudioSocket)

WebSocket Client Examples

Python Example (RTP Streaming):

import websocket
import struct

def on_message(ws, message):
    """Receive audio data from the call"""
    # message contains RTP packets
    print(f"Received {len(message)} bytes of audio")

    # Process audio here
    # - Save to file
    # - Run speech recognition
    # - Analyze audio
    process_audio(message)

def on_open(ws):
    """Connection established, can start sending audio"""
    print("Media stream connected")

    # Send audio to the call
    # audio_data should be RTP packets
    audio_data = generate_audio()
    ws.send(audio_data, opcode=websocket.ABNF.OPCODE_BINARY)

def on_error(ws, error):
    print(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    print(f"Connection closed: {close_status_code}")

# Connect to media stream
call_id = "652af662-eb45-11ee-b1a5-6fde165f9226"
token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/calls/{call_id}/media_stream?encapsulation=rtp&token={token}"

ws = websocket.WebSocketApp(
    ws_url,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws.run_forever()

def process_audio(rtp_packet):
    """Process received RTP audio"""
    # Extract payload from RTP packet
    # RTP header is typically 12 bytes
    payload = rtp_packet[12:]

    # Save or process audio
    with open('received_audio.raw', 'ab') as f:
        f.write(payload)

def generate_audio():
    """Generate RTP packets to send"""
    # This is a simplified example
    # In production, properly construct RTP packets

    # Read audio file
    with open('audio_to_inject.raw', 'rb') as f:
        audio_data = f.read(160)  # 20ms of 8kHz audio

    # Construct RTP header (simplified)
    # In production, use a proper RTP library
    return audio_data

JavaScript Example (Browser):

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=rtp&token=${token}`;

const ws = new WebSocket(wsUrl);
ws.binaryType = 'arraybuffer';

ws.onopen = function() {
    console.log('Media stream connected');

    // Send audio to the call
    const audioData = generateAudio();
    ws.send(audioData);
};

ws.onmessage = function(event) {
    // Receive audio from the call
    const audioData = event.data;
    console.log(`Received ${audioData.byteLength} bytes`);

    // Process audio
    processAudio(new Uint8Array(audioData));
};

ws.onerror = function(error) {
    console.error('WebSocket error:', error);
};

ws.onclose = function() {
    console.log('Media stream closed');
};

function processAudio(audioBuffer) {
    // Process received audio
    // - Play through Web Audio API
    // - Run speech recognition
    // - Visualize audio
}

function generateAudio() {
    // Generate audio to send
    // Returns ArrayBuffer with RTP packets
    return new ArrayBuffer(172); // RTP packet size
}

Node.js Example (AudioSocket):

const WebSocket = require('ws');
const fs = require('fs');

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=audiosocket&token=${token}`;

const ws = new WebSocket(wsUrl);

ws.on('open', function() {
    console.log('AudioSocket connected');

    // Send audio file
    const audioFile = fs.readFileSync('audio.pcm');

    // Send in chunks (20ms = 320 bytes for 16-bit 8kHz mono)
    const chunkSize = 320;
    for (let i = 0; i < audioFile.length; i += chunkSize) {
        const chunk = audioFile.slice(i, i + chunkSize);
        ws.send(chunk);
    }
});

ws.on('message', function(data) {
    // Receive audio from call
    console.log(`Received ${data.length} bytes`);

    // Save received audio
    fs.appendFileSync('received_audio.pcm', data);
});

ws.on('error', function(error) {
    console.error('Error:', error);
});

ws.on('close', function() {
    console.log('AudioSocket closed');
});

Uni-Directional Streaming with Flow Action

For sending audio to a call without receiving audio back, use the external_media_start flow action.

Create Call with External Media:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "external_media_start",
                "option": {
                    "url": "wss://your-media-server.com/audio-stream",
                    "encapsulation": "audiosocket"
                }
            }
        ]
    }'

This creates a uni-directional stream where VoIPBIN: 1. Establishes the call 2. Connects to your media server via WebSocket 3. Receives audio from your server 4. Plays that audio to the call participant

Your media server receives:

WebSocket connection from VoIPBIN
→ Send audio chunks (PCM format for AudioSocket)
→ VoIPBIN plays audio to call

Common Use Cases

1. Real-Time Speech Recognition:

# Python example
def on_message(ws, message):
    # Extract audio from RTP packet
    audio = extract_audio(message)

    # Send to speech recognition API
    text = speech_to_text(audio)
    print(f"Recognized: {text}")

    # Store transcription
    save_transcription(text)

2. Audio Injection / IVR Replacement:

# Node.js example
ws.on('open', function() {
    // Play custom audio prompts
    const prompt1 = fs.readFileSync('welcome.pcm');
    ws.send(prompt1);

    // Wait for DTMF or speech
    // Then play next prompt
});

3. Conference Recording:

# Python example
def on_message(ws, message):
    # Save all conference audio
    with open(f'conference_{conference_id}.raw', 'ab') as f:
        f.write(extract_audio(message))

4. Real-Time Audio Analysis:

def on_message(ws, message):
    audio = extract_audio(message)

    # Detect emotion
    emotion = analyze_emotion(audio)

    # Detect keywords
    if detect_keyword(audio, ['help', 'urgent']):
        alert_supervisor()

    # Calculate audio quality
    quality = measure_quality(audio)

5. Custom Music on Hold:

ws.on('open', function() {
    // Play custom music or messages
    const music = fs.readFileSync('hold_music.pcm');

    // Loop music while call is on hold
    setInterval(() => {
        ws.send(music);
    }, 1000);
});

6. AI-Powered Voice Assistant:

ws.on('message', function(data) {
    // Receive customer audio
    const audio = extractAudio(data);

    // Send to AI for processing
    const response = await aiProcess(audio);

    // Convert AI response to audio
    const responseAudio = textToSpeech(response);

    // Send back to call
    ws.send(responseAudio);
});

Audio Format Details

RTP Format: - Codec: ulaw (G.711 μ-law) - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Packet size: 160 bytes payload (20ms audio)

SLN Format: - Raw PCM audio - No headers or padding - Sample rate: 8 kHz - Bits: 16-bit signed - Channels: Mono

AudioSocket Format: - PCM little-endian - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Chunk size: 320 bytes (20ms of audio)

Best Practices

1. Buffer Management: - Maintain audio buffers to handle jitter - Send audio in consistent 20ms chunks - Don’t send too fast or too slow

2. Error Handling: - Implement reconnection logic - Handle WebSocket disconnections gracefully - Log errors for debugging

3. Audio Quality: - Use proper RTP packet construction - Maintain correct timing for audio chunks - Monitor for packet loss

4. Resource Management: - Close WebSocket when done - Don’t leave connections open indefinitely - Clean up audio buffers and files

5. Testing: - Test with various network conditions - Verify audio quality with real calls - Monitor latency and packet loss

6. Security: - Use WSS (secure WebSocket) in production - Validate authentication tokens - Encrypt sensitive audio data

Connection Lifecycle

1. Establish Connection:

GET /v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>

2. WebSocket Upgrade:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade

3. Bi-Directional Communication:

Client ←→ VoIPBIN
- Send audio: Binary frames with RTP packets
- Receive audio: Binary frames with RTP packets

4. Close Connection:

ws.close()

Troubleshooting

Common Issues:

No audio received: - Check WebSocket connection is established - Verify call is active and answered - Ensure correct encapsulation type

Audio quality poor: - Check network latency - Verify audio format matches requirements - Monitor packet loss

Connection drops: - Implement reconnection logic - Check firewall rules for WebSocket - Verify authentication token is valid

Can’t send audio: - Ensure binary frames are used (not text) - Verify audio format is correct - Check audio chunk size (typically 20ms)

For more information about media stream configuration, see Media Stream Overview.