Mediastream

A mediastream provides real-time audio streaming over WebSocket or audiosocket, enabling live audio processing such as speech-to-text and AI-driven voice interactions.

API Reference: Mediastream endpoints

Overview

Note

AI Context

  • Complexity: High

  • Cost: Free (no additional charge beyond the underlying call/conference costs)

  • Async: Yes. GET /calls/{id}/media_stream upgrades to a WebSocket connection for real-time bi-directional audio streaming. The connection remains open for the duration of the call. Uni-directional streaming via the external_media_start flow action is initiated by VoIPBIN connecting to your server.

VoIPBIN’s Media Stream API provides direct access to call and conference audio via WebSocket connections. Instead of relying on SIP signaling for media control, you can stream audio bidirectionally with your applications for real-time processing, AI integration, custom IVR, and more.

With the Media Stream API you can:

  • Stream live call audio to your application in real-time

  • Inject audio into calls and conferences

  • Build AI voice assistants with direct audio access

  • Create custom speech recognition pipelines

  • Implement real-time audio analysis and monitoring

How Media Streaming Works

When you connect to a media stream, VoIPBIN establishes a WebSocket connection that carries audio data directly between the call/conference and your application.

Media Stream Architecture

Traditional VoIP:                   VoIPBIN Media Stream:
+-------+   SIP   +-------+         +-------+   WebSocket  +----------+
| Phone |<------->|VoIPBIN|         | Call  |<============>| Your App |
+-------+         +-------+         +-------+              +----------+
     (signaling only)                    (direct audio access)

Key Differences from Traditional VoIP

Aspect

Traditional SIP

Media Streaming

Audio Access

Via RTP to SIP endpoints

Direct WebSocket to your app

Control

SIP signaling

API and WebSocket

Integration

Requires SIP stack

Simple WebSocket client

Use Cases

Phone-to-phone calls

AI, custom IVR, analysis

System Components

+--------+                                              +-----------+
|  Call  |<------- RTP ------->+                        |           |
+--------+                     |                        |  Your     |
                          +----+-----+                  |  App      |
                          | VoIPBIN  |<== WebSocket ===>|           |
                          | Media    |                  | - AI/ML   |
                          | Bridge   |                  | - STT/TTS |
+------------+            +----+-----+                  | - IVR     |
| Conference |<-- RTP --->+                             |           |
+------------+                                          +-----------+

The Media Bridge handles protocol conversion between RTP (VoIP standard) and WebSocket (web standard), enabling any WebSocket-capable application to process call audio.

Streaming Modes

VoIPBIN supports two streaming modes based on your application’s needs.

Note

AI Implementation Hint

Media streaming requires the call or conference to be in progressing status (answered). The GET /calls/{id}/media_stream endpoint is a WebSocket upgrade, not a regular HTTP GET. Use a WebSocket client library, not a standard HTTP client. Audio must be sent as binary WebSocket frames in consistent 20ms chunks matching the selected encapsulation format.

Bi-Directional Streaming

Your application both receives and sends audio through the same WebSocket connection.

+----------+                              +----------+
|          |======= audio IN ============>|          |
| VoIPBIN  |                              | Your App |
|          |<====== audio OUT ============|          |
+----------+                              +----------+

Initiate via API:

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>

Use cases: - AI voice assistants (listen and respond) - Interactive IVR systems - Real-time audio processing with feedback - Call bridging to custom systems

Uni-Directional Streaming

VoIPBIN receives audio from your server and plays it to the call. Your app sends audio but doesn’t receive call audio.

+----------+                              +----------+
|          |                              |          |
| VoIPBIN  |<====== audio only ===========| Your App |
|          |                              |          |
+----------+                              +----------+

Initiate via Flow Action:

{
    "type": "external_media_start",
    "option": {
        "url": "wss://your-server.com/audio",
        "encapsulation": "audiosocket"
    }
}

See detail here.

Use cases: - Custom music on hold - Pre-recorded message playback - Text-to-speech from external service - Audio announcements

Mode Comparison

Encapsulation Types

VoIPBIN supports three encapsulation types for different integration scenarios.

Decision Guide

                  What's your use case?
                         |
        +----------------+----------------+
        |                                 |
  Standard VoIP                      Simple audio
  integration?                       processing?
        |                                 |
  +-----+-----+                     +-----+-----+
  |           |                     |           |
 Yes         No                    Yes         No
  |           |                     |           |
  v           |                     v           |
[RTP]         |                   [SLN]         |
              |                                 |
         Asterisk                               |
         integration?                           |
              |                                 |
        +-----+-----+                           |
        |           |                           |
       Yes         No                           |
        |           |                           |
        v           +---------------------------+
  [AudioSocket]                 |
                                v
                          [RTP default]

RTP (Real-time Transport Protocol)

The standard protocol for audio/video over IP networks.

+------------------+------------------+
|   RTP Header     |   Audio Payload  |
|   (12 bytes)     |   (160 bytes)    |
+------------------+------------------+

Specification

Value

Protocol

RTP over WebSocket

Codec

G.711 μ-law (ulaw)

Sample Rate

8 kHz

Bit Depth

16-bit

Channels

Mono

Packet Size

172 bytes (12 header + 160 payload = 20ms)

Best for: Standard VoIP tools, industry compatibility, existing RTP processing pipelines.

SLN (Signed Linear)

Raw audio without protocol overhead.

+----------------------------------+
|   Raw PCM Audio Data             |
|   (no headers, no padding)       |
+----------------------------------+

Specification

Value

Format

Raw PCM, signed linear

Sample Rate

8 kHz

Bit Depth

16-bit signed

Channels

Mono

Byte Order

Native

Best for: Minimal overhead, simple audio processing, direct PCM access without parsing.

AudioSocket

Asterisk-specific protocol designed for simple audio streaming.

+------------------+------------------+
| AudioSocket Hdr  |   PCM Audio      |
+------------------+------------------+

Specification

Value

Protocol

Asterisk AudioSocket

Format

PCM little-endian

Sample Rate

8 kHz

Bit Depth

16-bit

Channels

Mono

Chunk Size

320 bytes (20ms of audio)

Best for: Asterisk integration, simple streaming with minimal overhead.

See Asterisk AudioSocket Documentation for protocol details.

Encapsulation Comparison

Aspect

RTP

SLN

AudioSocket

Headers

12 bytes

None

Protocol header

Compatibility

Industry standard

Simple

Asterisk

Overhead

Low

Minimal

Low

Parsing Required

Yes (RTP)

No

Yes (AudioSocket)

Supported Resources

Media streaming is available for both calls and conferences.

Call Media Streaming

Stream audio from a single call.

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=<type>&token=<token>

Audio contains: Both parties’ audio mixed together.

Conference Media Streaming

Stream audio from a conference with multiple participants.

GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=<type>&token=<token>

Audio contains: All participants’ audio mixed together.

Resource Comparison

Aspect

Call

Conference

Audio Source

Two-party conversation

Multi-party conversation

Audio Mix

Caller + callee

All participants

Audio Injection

Heard by both parties

Heard by all participants

Use Case

1:1 AI assistant

Conference monitoring/recording

Connection Lifecycle

Understanding the WebSocket connection lifecycle helps build robust streaming applications.

Connection Flow

Your App                         VoIPBIN
    |                               |
    | GET /calls/{id}/media_stream  |
    +------------------------------>|
    |                               |
    | 101 Switching Protocols       |
    |<------------------------------+
    |                               |
    |<======= audio frames ========>|  (bi-directional)
    |<======= audio frames ========>|
    |<======= audio frames ========>|
    |                               |
    | close() or call ends          |
    +------------------------------>|
    |                               |

Connection States

connecting ---> open ---> streaming ---> closing ---> closed
                               |
                               v
                          (call ends)
                               |
                               v
                            closed

State Descriptions

State

What’s happening

connecting

WebSocket handshake in progress

open

Connection established, ready for audio

streaming

Audio frames being sent/received

closing

Graceful shutdown initiated

closed

Connection terminated

Connection Termination

The WebSocket connection closes when:

  • Your application closes the connection

  • The call or conference ends

  • Network failure occurs

  • Authentication token expires

Integration Patterns

Common patterns for integrating media streaming with your applications.

Pattern 1: AI Voice Assistant

Call Audio         Your App           AI Service
    |                  |                   |
    |====audio====>    |                   |
    |                  | STT               |
    |                  +------------------>|
    |                  |                   |
    |                  | AI response       |
    |                  |<------------------+
    |                  |                   |
    |                  | TTS               |
    |    <====audio====+                   |
    |                  |                   |

Pattern 2: Real-Time Monitoring

Call Audio         Your App           Dashboard
    |                  |                   |
    |====audio====>    |                   |
    |                  | analyze           |
    |                  +------------------>|
    |                  |    sentiment,     |
    |                  |    keywords,      |
    |                  |    quality        |
    |                  |                   |

Pattern 3: Custom IVR

Call Audio         Your App           Logic Engine
    |                  |                   |
    |====audio====>    |                   |
    |                  | detect DTMF/speech|
    |                  +------------------>|
    |                  |                   |
    |                  | next action       |
    |                  |<------------------+
    |                  |                   |
    |    <====prompt===+                   |
    |                  |                   |

Pattern 4: Recording with Processing

Call Audio         Your App           Storage
    |                  |                   |
    |====audio====>    |                   |
    |                  | process           |
    |                  | (filter, enhance) |
    |                  |                   |
    |                  | store             |
    |                  +------------------>|
    |                  |                   |

For working code examples of these patterns, see the Media Stream Tutorial.

Best Practices

1. Audio Timing

  • Send audio in consistent 20ms chunks

  • Maintain proper timing to avoid audio gaps or overlaps

  • Buffer incoming audio to handle network jitter

2. Connection Management

  • Implement automatic reconnection for dropped connections

  • Handle the onclose event gracefully

  • Close connections when no longer needed

3. Resource Efficiency

  • Process audio asynchronously to avoid blocking

  • Use appropriate buffer sizes (typically 320 bytes for 20ms)

  • Monitor memory usage for long-running streams

4. Error Handling

  • Log connection errors for debugging

  • Implement exponential backoff for reconnection attempts

  • Handle authentication failures gracefully

Troubleshooting

Connection Issues

Symptom

Solution

Connection refused

Verify call/conference is active and in “progressing” status

401 Unauthorized

Check API token is valid and has permissions

Connection drops

Implement reconnection logic; check network stability

Audio Issues

Symptom

Solution

No audio received

Verify call is answered and audio is flowing; check encapsulation type

Audio quality poor

Check network latency; verify correct audio format; monitor packet loss

Audio choppy

Implement jitter buffer; send in consistent 20ms chunks; check CPU usage

Can’t send audio

Use binary WebSocket frames; verify audio format matches encapsulation type

Tutorial

Before using media streaming, you need:

  • An authentication token. Obtain one via POST /auth/login or use an access key from GET /accesskeys.

  • An active call in progressing status. Obtain the call ID via GET /calls. Or an active conference via GET /conferences.

  • A WebSocket client library (not a standard HTTP client). The media stream endpoint upgrades the HTTP connection to WebSocket.

  • Knowledge of the audio format for your chosen encapsulation: rtp (G.711 ulaw, 12-byte header + 160-byte payload), sln (raw PCM, no headers), or audiosocket (Asterisk AudioSocket, 320-byte chunks).

Note

AI Implementation Hint

The media stream URL (GET /calls/{id}/media_stream) is a WebSocket upgrade endpoint. Do not use curl or standard HTTP libraries to connect. Use a WebSocket client (e.g., Python websocket-client, JavaScript WebSocket, Node.js ws). Audio data must be sent as binary frames, not text frames.

Bi-Directional Media Streaming for Calls

Connect to a call’s media stream via WebSocket to send and receive audio in real-time. This allows you to build custom audio processing applications without SIP signaling.

Establish WebSocket Connection:

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/calls/652af662-eb45-11ee-b1a5-6fde165f9226/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This creates a bi-directional WebSocket connection where you can: - Receive audio from the call (what the other party is saying) - Send audio to the call (inject audio into the conversation)

Bi-Directional Media Streaming for Conferences

Access a conference’s media stream to monitor or participate in the conference audio.

GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/conferences/1ed12456-eb4b-11ee-bba8-1bfb2838807a/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This allows you to: - Listen to all conference participants - Inject audio into the conference - Build custom conference recording or analysis tools

Encapsulation Types

VoIPBIN supports three encapsulation types for media streaming:

1. RTP (Real-time Transport Protocol)

Standard protocol for audio/video over IP networks.

?encapsulation=rtp

Use cases: - Standard VoIP integration - Compatible with most audio processing tools - Industry-standard protocol

2. SLN (Signed Linear Mono)

Raw audio stream without headers or padding.

?encapsulation=sln

Use cases: - Minimal overhead needed - Simple audio processing - Direct PCM audio access

3. AudioSocket

Asterisk-specific protocol for simple audio streaming.

?encapsulation=audiosocket

Use cases: - Asterisk integration - Low-overhead streaming - Simple audio applications

Codec: All formats use 16-bit, 8kHz, mono audio (ulaw for RTP/SLN, PCM little-endian for AudioSocket)

WebSocket Client Examples

Python Example (RTP Streaming):

import websocket
import struct

def on_message(ws, message):
    """Receive audio data from the call"""
    # message contains RTP packets
    print(f"Received {len(message)} bytes of audio")

    # Process audio here
    # - Save to file
    # - Run speech recognition
    # - Analyze audio
    process_audio(message)

def on_open(ws):
    """Connection established, can start sending audio"""
    print("Media stream connected")

    # Send audio to the call
    # audio_data should be RTP packets
    audio_data = generate_audio()
    ws.send(audio_data, opcode=websocket.ABNF.OPCODE_BINARY)

def on_error(ws, error):
    print(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    print(f"Connection closed: {close_status_code}")

# Connect to media stream
call_id = "652af662-eb45-11ee-b1a5-6fde165f9226"
token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/calls/{call_id}/media_stream?encapsulation=rtp&token={token}"

ws = websocket.WebSocketApp(
    ws_url,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws.run_forever()

def process_audio(rtp_packet):
    """Process received RTP audio"""
    # Extract payload from RTP packet
    # RTP header is typically 12 bytes
    payload = rtp_packet[12:]

    # Save or process audio
    with open('received_audio.raw', 'ab') as f:
        f.write(payload)

def generate_audio():
    """Generate RTP packets to send"""
    # This is a simplified example
    # In production, properly construct RTP packets

    # Read audio file
    with open('audio_to_inject.raw', 'rb') as f:
        audio_data = f.read(160)  # 20ms of 8kHz audio

    # Construct RTP header (simplified)
    # In production, use a proper RTP library
    return audio_data

JavaScript Example (Browser):

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=rtp&token=${token}`;

const ws = new WebSocket(wsUrl);
ws.binaryType = 'arraybuffer';

ws.onopen = function() {
    console.log('Media stream connected');

    // Send audio to the call
    const audioData = generateAudio();
    ws.send(audioData);
};

ws.onmessage = function(event) {
    // Receive audio from the call
    const audioData = event.data;
    console.log(`Received ${audioData.byteLength} bytes`);

    // Process audio
    processAudio(new Uint8Array(audioData));
};

ws.onerror = function(error) {
    console.error('WebSocket error:', error);
};

ws.onclose = function() {
    console.log('Media stream closed');
};

function processAudio(audioBuffer) {
    // Process received audio
    // - Play through Web Audio API
    // - Run speech recognition
    // - Visualize audio
}

function generateAudio() {
    // Generate audio to send
    // Returns ArrayBuffer with RTP packets
    return new ArrayBuffer(172); // RTP packet size
}

Node.js Example (AudioSocket):

const WebSocket = require('ws');
const fs = require('fs');

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=audiosocket&token=${token}`;

const ws = new WebSocket(wsUrl);

ws.on('open', function() {
    console.log('AudioSocket connected');

    // Send audio file
    const audioFile = fs.readFileSync('audio.pcm');

    // Send in chunks (20ms = 320 bytes for 16-bit 8kHz mono)
    const chunkSize = 320;
    for (let i = 0; i < audioFile.length; i += chunkSize) {
        const chunk = audioFile.slice(i, i + chunkSize);
        ws.send(chunk);
    }
});

ws.on('message', function(data) {
    // Receive audio from call
    console.log(`Received ${data.length} bytes`);

    // Save received audio
    fs.appendFileSync('received_audio.pcm', data);
});

ws.on('error', function(error) {
    console.error('Error:', error);
});

ws.on('close', function() {
    console.log('AudioSocket closed');
});

Uni-Directional Streaming with Flow Action

For sending audio to a call without receiving audio back, use the external_media_start flow action.

Create Call with External Media:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "external_media_start",
                "option": {
                    "url": "wss://your-media-server.com/audio-stream",
                    "encapsulation": "audiosocket"
                }
            }
        ]
    }'

This creates a uni-directional stream where VoIPBIN: 1. Establishes the call 2. Connects to your media server via WebSocket 3. Receives audio from your server 4. Plays that audio to the call participant

Your media server receives:

WebSocket connection from VoIPBIN
→ Send audio chunks (PCM format for AudioSocket)
→ VoIPBIN plays audio to call

Common Use Cases

1. Real-Time Speech Recognition:

# Python example
def on_message(ws, message):
    # Extract audio from RTP packet
    audio = extract_audio(message)

    # Send to speech recognition API
    text = speech_to_text(audio)
    print(f"Recognized: {text}")

    # Store transcription
    save_transcription(text)

2. Audio Injection / IVR Replacement:

# Node.js example
ws.on('open', function() {
    // Play custom audio prompts
    const prompt1 = fs.readFileSync('welcome.pcm');
    ws.send(prompt1);

    // Wait for DTMF or speech
    // Then play next prompt
});

3. Conference Recording:

# Python example
def on_message(ws, message):
    # Save all conference audio
    with open(f'conference_{conference_id}.raw', 'ab') as f:
        f.write(extract_audio(message))

4. Real-Time Audio Analysis:

def on_message(ws, message):
    audio = extract_audio(message)

    # Detect emotion
    emotion = analyze_emotion(audio)

    # Detect keywords
    if detect_keyword(audio, ['help', 'urgent']):
        alert_supervisor()

    # Calculate audio quality
    quality = measure_quality(audio)

5. Custom Music on Hold:

ws.on('open', function() {
    // Play custom music or messages
    const music = fs.readFileSync('hold_music.pcm');

    // Loop music while call is on hold
    setInterval(() => {
        ws.send(music);
    }, 1000);
});

6. AI-Powered Voice Assistant:

ws.on('message', function(data) {
    // Receive customer audio
    const audio = extractAudio(data);

    // Send to AI for processing
    const response = await aiProcess(audio);

    // Convert AI response to audio
    const responseAudio = textToSpeech(response);

    // Send back to call
    ws.send(responseAudio);
});

Audio Format Details

RTP Format: - Codec: ulaw (G.711 μ-law) - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Packet size: 160 bytes payload (20ms audio)

SLN Format: - Raw PCM audio - No headers or padding - Sample rate: 8 kHz - Bits: 16-bit signed - Channels: Mono

AudioSocket Format: - PCM little-endian - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Chunk size: 320 bytes (20ms of audio)

Best Practices

1. Buffer Management: - Maintain audio buffers to handle jitter - Send audio in consistent 20ms chunks - Don’t send too fast or too slow

2. Error Handling: - Implement reconnection logic - Handle WebSocket disconnections gracefully - Log errors for debugging

3. Audio Quality: - Use proper RTP packet construction - Maintain correct timing for audio chunks - Monitor for packet loss

4. Resource Management: - Close WebSocket when done - Don’t leave connections open indefinitely - Clean up audio buffers and files

5. Testing: - Test with various network conditions - Verify audio quality with real calls - Monitor latency and packet loss

6. Security: - Use WSS (secure WebSocket) in production - Validate authentication tokens - Encrypt sensitive audio data

Connection Lifecycle

1. Establish Connection:

GET /v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>

2. WebSocket Upgrade:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade

3. Bi-Directional Communication:

Client ←→ VoIPBIN
- Send audio: Binary frames with RTP packets
- Receive audio: Binary frames with RTP packets

4. Close Connection:

ws.close()

Troubleshooting

Common Issues:

No audio received: - Check WebSocket connection is established - Verify call is active and answered - Ensure correct encapsulation type

Audio quality poor: - Check network latency - Verify audio format matches requirements - Monitor packet loss

Connection drops: - Implement reconnection logic - Check firewall rules for WebSocket - Verify authentication token is valid

Can’t send audio: - Ensure binary frames are used (not text) - Verify audio format is correct - Check audio chunk size (typically 20ms)

For more information about media stream configuration, see Media Stream Overview.