Mediastream
A mediastream provides real-time audio streaming over WebSocket or audiosocket, enabling live audio processing such as speech-to-text and AI-driven voice interactions.
API Reference: Mediastream endpoints
Overview
Note
AI Context
Complexity: High
Cost: Free (no additional charge beyond the underlying call/conference costs)
Async: Yes.
GET /calls/{id}/media_streamupgrades to a WebSocket connection for real-time bi-directional audio streaming. The connection remains open for the duration of the call. Uni-directional streaming via theexternal_media_startflow action is initiated by VoIPBIN connecting to your server.
VoIPBIN’s Media Stream API provides direct access to call and conference audio via WebSocket connections. Instead of relying on SIP signaling for media control, you can stream audio bidirectionally with your applications for real-time processing, AI integration, custom IVR, and more.
With the Media Stream API you can:
Stream live call audio to your application in real-time
Inject audio into calls and conferences
Build AI voice assistants with direct audio access
Create custom speech recognition pipelines
Implement real-time audio analysis and monitoring
How Media Streaming Works
When you connect to a media stream, VoIPBIN establishes a WebSocket connection that carries audio data directly between the call/conference and your application.
Media Stream Architecture
Traditional VoIP: VoIPBIN Media Stream:
+-------+ SIP +-------+ +-------+ WebSocket +----------+
| Phone |<------->|VoIPBIN| | Call |<============>| Your App |
+-------+ +-------+ +-------+ +----------+
(signaling only) (direct audio access)
Key Differences from Traditional VoIP
Aspect |
Traditional SIP |
Media Streaming |
|---|---|---|
Audio Access |
Via RTP to SIP endpoints |
Direct WebSocket to your app |
Control |
SIP signaling |
API and WebSocket |
Integration |
Requires SIP stack |
Simple WebSocket client |
Use Cases |
Phone-to-phone calls |
AI, custom IVR, analysis |
System Components
+--------+ +-----------+
| Call |<------- RTP ------->+ | |
+--------+ | | Your |
+----+-----+ | App |
| VoIPBIN |<== WebSocket ===>| |
| Media | | - AI/ML |
| Bridge | | - STT/TTS |
+------------+ +----+-----+ | - IVR |
| Conference |<-- RTP --->+ | |
+------------+ +-----------+
The Media Bridge handles protocol conversion between RTP (VoIP standard) and WebSocket (web standard), enabling any WebSocket-capable application to process call audio.
Streaming Modes
VoIPBIN supports two streaming modes based on your application’s needs.
Note
AI Implementation Hint
Media streaming requires the call or conference to be in progressing status (answered). The GET /calls/{id}/media_stream endpoint is a WebSocket upgrade, not a regular HTTP GET. Use a WebSocket client library, not a standard HTTP client. Audio must be sent as binary WebSocket frames in consistent 20ms chunks matching the selected encapsulation format.
Bi-Directional Streaming
Your application both receives and sends audio through the same WebSocket connection.
+----------+ +----------+
| |======= audio IN ============>| |
| VoIPBIN | | Your App |
| |<====== audio OUT ============| |
+----------+ +----------+
Initiate via API:
GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>
Use cases: - AI voice assistants (listen and respond) - Interactive IVR systems - Real-time audio processing with feedback - Call bridging to custom systems
Uni-Directional Streaming
VoIPBIN receives audio from your server and plays it to the call. Your app sends audio but doesn’t receive call audio.
+----------+ +----------+
| | | |
| VoIPBIN |<====== audio only ===========| Your App |
| | | |
+----------+ +----------+
Initiate via Flow Action:
{
"type": "external_media_start",
"option": {
"url": "wss://your-server.com/audio",
"encapsulation": "audiosocket"
}
}
See detail here.
Use cases: - Custom music on hold - Pre-recorded message playback - Text-to-speech from external service - Audio announcements
Mode Comparison
Encapsulation Types
VoIPBIN supports three encapsulation types for different integration scenarios.
Decision Guide
What's your use case?
|
+----------------+----------------+
| |
Standard VoIP Simple audio
integration? processing?
| |
+-----+-----+ +-----+-----+
| | | |
Yes No Yes No
| | | |
v | v |
[RTP] | [SLN] |
| |
Asterisk |
integration? |
| |
+-----+-----+ |
| | |
Yes No |
| | |
v +---------------------------+
[AudioSocket] |
v
[RTP default]
RTP (Real-time Transport Protocol)
The standard protocol for audio/video over IP networks.
+------------------+------------------+
| RTP Header | Audio Payload |
| (12 bytes) | (160 bytes) |
+------------------+------------------+
Specification |
Value |
|---|---|
Protocol |
RTP over WebSocket |
Codec |
G.711 μ-law (ulaw) |
Sample Rate |
8 kHz |
Bit Depth |
16-bit |
Channels |
Mono |
Packet Size |
172 bytes (12 header + 160 payload = 20ms) |
Best for: Standard VoIP tools, industry compatibility, existing RTP processing pipelines.
SLN (Signed Linear)
Raw audio without protocol overhead.
+----------------------------------+
| Raw PCM Audio Data |
| (no headers, no padding) |
+----------------------------------+
Specification |
Value |
|---|---|
Format |
Raw PCM, signed linear |
Sample Rate |
8 kHz |
Bit Depth |
16-bit signed |
Channels |
Mono |
Byte Order |
Native |
Best for: Minimal overhead, simple audio processing, direct PCM access without parsing.
AudioSocket
Asterisk-specific protocol designed for simple audio streaming.
+------------------+------------------+
| AudioSocket Hdr | PCM Audio |
+------------------+------------------+
Specification |
Value |
|---|---|
Protocol |
Asterisk AudioSocket |
Format |
PCM little-endian |
Sample Rate |
8 kHz |
Bit Depth |
16-bit |
Channels |
Mono |
Chunk Size |
320 bytes (20ms of audio) |
Best for: Asterisk integration, simple streaming with minimal overhead.
See Asterisk AudioSocket Documentation for protocol details.
Encapsulation Comparison
Aspect |
RTP |
SLN |
AudioSocket |
|---|---|---|---|
Headers |
12 bytes |
None |
Protocol header |
Compatibility |
Industry standard |
Simple |
Asterisk |
Overhead |
Low |
Minimal |
Low |
Parsing Required |
Yes (RTP) |
No |
Yes (AudioSocket) |
Supported Resources
Media streaming is available for both calls and conferences.
Call Media Streaming
Stream audio from a single call.
GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=<type>&token=<token>
Audio contains: Both parties’ audio mixed together.
Conference Media Streaming
Stream audio from a conference with multiple participants.
GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=<type>&token=<token>
Audio contains: All participants’ audio mixed together.
Resource Comparison
Aspect |
Call |
Conference |
|---|---|---|
Audio Source |
Two-party conversation |
Multi-party conversation |
Audio Mix |
Caller + callee |
All participants |
Audio Injection |
Heard by both parties |
Heard by all participants |
Use Case |
1:1 AI assistant |
Conference monitoring/recording |
Connection Lifecycle
Understanding the WebSocket connection lifecycle helps build robust streaming applications.
Connection Flow
Your App VoIPBIN
| |
| GET /calls/{id}/media_stream |
+------------------------------>|
| |
| 101 Switching Protocols |
|<------------------------------+
| |
|<======= audio frames ========>| (bi-directional)
|<======= audio frames ========>|
|<======= audio frames ========>|
| |
| close() or call ends |
+------------------------------>|
| |
Connection States
connecting ---> open ---> streaming ---> closing ---> closed
|
v
(call ends)
|
v
closed
State Descriptions
State |
What’s happening |
|---|---|
connecting |
WebSocket handshake in progress |
open |
Connection established, ready for audio |
streaming |
Audio frames being sent/received |
closing |
Graceful shutdown initiated |
closed |
Connection terminated |
Connection Termination
The WebSocket connection closes when:
Your application closes the connection
The call or conference ends
Network failure occurs
Authentication token expires
Integration Patterns
Common patterns for integrating media streaming with your applications.
Pattern 1: AI Voice Assistant
Call Audio Your App AI Service
| | |
|====audio====> | |
| | STT |
| +------------------>|
| | |
| | AI response |
| |<------------------+
| | |
| | TTS |
| <====audio====+ |
| | |
Pattern 2: Real-Time Monitoring
Call Audio Your App Dashboard
| | |
|====audio====> | |
| | analyze |
| +------------------>|
| | sentiment, |
| | keywords, |
| | quality |
| | |
Pattern 3: Custom IVR
Call Audio Your App Logic Engine
| | |
|====audio====> | |
| | detect DTMF/speech|
| +------------------>|
| | |
| | next action |
| |<------------------+
| | |
| <====prompt===+ |
| | |
Pattern 4: Recording with Processing
Call Audio Your App Storage
| | |
|====audio====> | |
| | process |
| | (filter, enhance) |
| | |
| | store |
| +------------------>|
| | |
For working code examples of these patterns, see the Media Stream Tutorial.
Best Practices
1. Audio Timing
Send audio in consistent 20ms chunks
Maintain proper timing to avoid audio gaps or overlaps
Buffer incoming audio to handle network jitter
2. Connection Management
Implement automatic reconnection for dropped connections
Handle the
oncloseevent gracefullyClose connections when no longer needed
3. Resource Efficiency
Process audio asynchronously to avoid blocking
Use appropriate buffer sizes (typically 320 bytes for 20ms)
Monitor memory usage for long-running streams
4. Error Handling
Log connection errors for debugging
Implement exponential backoff for reconnection attempts
Handle authentication failures gracefully
Troubleshooting
Connection Issues
Symptom |
Solution |
|---|---|
Connection refused |
Verify call/conference is active and in “progressing” status |
401 Unauthorized |
Check API token is valid and has permissions |
Connection drops |
Implement reconnection logic; check network stability |
Audio Issues
Symptom |
Solution |
|---|---|
No audio received |
Verify call is answered and audio is flowing; check encapsulation type |
Audio quality poor |
Check network latency; verify correct audio format; monitor packet loss |
Audio choppy |
Implement jitter buffer; send in consistent 20ms chunks; check CPU usage |
Can’t send audio |
Use binary WebSocket frames; verify audio format matches encapsulation type |
Tutorial
Before using media streaming, you need:
An authentication token. Obtain one via
POST /auth/loginor use an access key fromGET /accesskeys.An active call in
progressingstatus. Obtain the call ID viaGET /calls. Or an active conference viaGET /conferences.A WebSocket client library (not a standard HTTP client). The media stream endpoint upgrades the HTTP connection to WebSocket.
Knowledge of the audio format for your chosen encapsulation:
rtp(G.711 ulaw, 12-byte header + 160-byte payload),sln(raw PCM, no headers), oraudiosocket(Asterisk AudioSocket, 320-byte chunks).
Note
AI Implementation Hint
The media stream URL (GET /calls/{id}/media_stream) is a WebSocket upgrade endpoint. Do not use curl or standard HTTP libraries to connect. Use a WebSocket client (e.g., Python websocket-client, JavaScript WebSocket, Node.js ws). Audio data must be sent as binary frames, not text frames.
Bi-Directional Media Streaming for Calls
Connect to a call’s media stream via WebSocket to send and receive audio in real-time. This allows you to build custom audio processing applications without SIP signaling.
Establish WebSocket Connection:
GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>
Example:
GET https://api.voipbin.net/v1.0/calls/652af662-eb45-11ee-b1a5-6fde165f9226/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>
This creates a bi-directional WebSocket connection where you can: - Receive audio from the call (what the other party is saying) - Send audio to the call (inject audio into the conversation)
Bi-Directional Media Streaming for Conferences
Access a conference’s media stream to monitor or participate in the conference audio.
GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>
Example:
GET https://api.voipbin.net/v1.0/conferences/1ed12456-eb4b-11ee-bba8-1bfb2838807a/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>
This allows you to: - Listen to all conference participants - Inject audio into the conference - Build custom conference recording or analysis tools
Encapsulation Types
VoIPBIN supports three encapsulation types for media streaming:
1. RTP (Real-time Transport Protocol)
Standard protocol for audio/video over IP networks.
?encapsulation=rtp
Use cases: - Standard VoIP integration - Compatible with most audio processing tools - Industry-standard protocol
2. SLN (Signed Linear Mono)
Raw audio stream without headers or padding.
?encapsulation=sln
Use cases: - Minimal overhead needed - Simple audio processing - Direct PCM audio access
3. AudioSocket
Asterisk-specific protocol for simple audio streaming.
?encapsulation=audiosocket
Use cases: - Asterisk integration - Low-overhead streaming - Simple audio applications
Codec: All formats use 16-bit, 8kHz, mono audio (ulaw for RTP/SLN, PCM little-endian for AudioSocket)
WebSocket Client Examples
Python Example (RTP Streaming):
import websocket
import struct
def on_message(ws, message):
"""Receive audio data from the call"""
# message contains RTP packets
print(f"Received {len(message)} bytes of audio")
# Process audio here
# - Save to file
# - Run speech recognition
# - Analyze audio
process_audio(message)
def on_open(ws):
"""Connection established, can start sending audio"""
print("Media stream connected")
# Send audio to the call
# audio_data should be RTP packets
audio_data = generate_audio()
ws.send(audio_data, opcode=websocket.ABNF.OPCODE_BINARY)
def on_error(ws, error):
print(f"Error: {error}")
def on_close(ws, close_status_code, close_msg):
print(f"Connection closed: {close_status_code}")
# Connect to media stream
call_id = "652af662-eb45-11ee-b1a5-6fde165f9226"
token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/calls/{call_id}/media_stream?encapsulation=rtp&token={token}"
ws = websocket.WebSocketApp(
ws_url,
on_open=on_open,
on_message=on_message,
on_error=on_error,
on_close=on_close
)
ws.run_forever()
def process_audio(rtp_packet):
"""Process received RTP audio"""
# Extract payload from RTP packet
# RTP header is typically 12 bytes
payload = rtp_packet[12:]
# Save or process audio
with open('received_audio.raw', 'ab') as f:
f.write(payload)
def generate_audio():
"""Generate RTP packets to send"""
# This is a simplified example
# In production, properly construct RTP packets
# Read audio file
with open('audio_to_inject.raw', 'rb') as f:
audio_data = f.read(160) # 20ms of 8kHz audio
# Construct RTP header (simplified)
# In production, use a proper RTP library
return audio_data
JavaScript Example (Browser):
const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=rtp&token=${token}`;
const ws = new WebSocket(wsUrl);
ws.binaryType = 'arraybuffer';
ws.onopen = function() {
console.log('Media stream connected');
// Send audio to the call
const audioData = generateAudio();
ws.send(audioData);
};
ws.onmessage = function(event) {
// Receive audio from the call
const audioData = event.data;
console.log(`Received ${audioData.byteLength} bytes`);
// Process audio
processAudio(new Uint8Array(audioData));
};
ws.onerror = function(error) {
console.error('WebSocket error:', error);
};
ws.onclose = function() {
console.log('Media stream closed');
};
function processAudio(audioBuffer) {
// Process received audio
// - Play through Web Audio API
// - Run speech recognition
// - Visualize audio
}
function generateAudio() {
// Generate audio to send
// Returns ArrayBuffer with RTP packets
return new ArrayBuffer(172); // RTP packet size
}
Node.js Example (AudioSocket):
const WebSocket = require('ws');
const fs = require('fs');
const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=audiosocket&token=${token}`;
const ws = new WebSocket(wsUrl);
ws.on('open', function() {
console.log('AudioSocket connected');
// Send audio file
const audioFile = fs.readFileSync('audio.pcm');
// Send in chunks (20ms = 320 bytes for 16-bit 8kHz mono)
const chunkSize = 320;
for (let i = 0; i < audioFile.length; i += chunkSize) {
const chunk = audioFile.slice(i, i + chunkSize);
ws.send(chunk);
}
});
ws.on('message', function(data) {
// Receive audio from call
console.log(`Received ${data.length} bytes`);
// Save received audio
fs.appendFileSync('received_audio.pcm', data);
});
ws.on('error', function(error) {
console.error('Error:', error);
});
ws.on('close', function() {
console.log('AudioSocket closed');
});
Uni-Directional Streaming with Flow Action
For sending audio to a call without receiving audio back, use the external_media_start flow action.
Create Call with External Media:
$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"source": {
"type": "tel",
"target": "+15551234567"
},
"destinations": [
{
"type": "tel",
"target": "+15559876543"
}
],
"actions": [
{
"type": "answer"
},
{
"type": "external_media_start",
"option": {
"url": "wss://your-media-server.com/audio-stream",
"encapsulation": "audiosocket"
}
}
]
}'
This creates a uni-directional stream where VoIPBIN: 1. Establishes the call 2. Connects to your media server via WebSocket 3. Receives audio from your server 4. Plays that audio to the call participant
Your media server receives:
WebSocket connection from VoIPBIN
→ Send audio chunks (PCM format for AudioSocket)
→ VoIPBIN plays audio to call
Common Use Cases
1. Real-Time Speech Recognition:
# Python example
def on_message(ws, message):
# Extract audio from RTP packet
audio = extract_audio(message)
# Send to speech recognition API
text = speech_to_text(audio)
print(f"Recognized: {text}")
# Store transcription
save_transcription(text)
2. Audio Injection / IVR Replacement:
# Node.js example
ws.on('open', function() {
// Play custom audio prompts
const prompt1 = fs.readFileSync('welcome.pcm');
ws.send(prompt1);
// Wait for DTMF or speech
// Then play next prompt
});
3. Conference Recording:
# Python example
def on_message(ws, message):
# Save all conference audio
with open(f'conference_{conference_id}.raw', 'ab') as f:
f.write(extract_audio(message))
4. Real-Time Audio Analysis:
def on_message(ws, message):
audio = extract_audio(message)
# Detect emotion
emotion = analyze_emotion(audio)
# Detect keywords
if detect_keyword(audio, ['help', 'urgent']):
alert_supervisor()
# Calculate audio quality
quality = measure_quality(audio)
5. Custom Music on Hold:
ws.on('open', function() {
// Play custom music or messages
const music = fs.readFileSync('hold_music.pcm');
// Loop music while call is on hold
setInterval(() => {
ws.send(music);
}, 1000);
});
6. AI-Powered Voice Assistant:
ws.on('message', function(data) {
// Receive customer audio
const audio = extractAudio(data);
// Send to AI for processing
const response = await aiProcess(audio);
// Convert AI response to audio
const responseAudio = textToSpeech(response);
// Send back to call
ws.send(responseAudio);
});
Audio Format Details
RTP Format: - Codec: ulaw (G.711 μ-law) - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Packet size: 160 bytes payload (20ms audio)
SLN Format: - Raw PCM audio - No headers or padding - Sample rate: 8 kHz - Bits: 16-bit signed - Channels: Mono
AudioSocket Format: - PCM little-endian - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Chunk size: 320 bytes (20ms of audio)
Best Practices
1. Buffer Management: - Maintain audio buffers to handle jitter - Send audio in consistent 20ms chunks - Don’t send too fast or too slow
2. Error Handling: - Implement reconnection logic - Handle WebSocket disconnections gracefully - Log errors for debugging
3. Audio Quality: - Use proper RTP packet construction - Maintain correct timing for audio chunks - Monitor for packet loss
4. Resource Management: - Close WebSocket when done - Don’t leave connections open indefinitely - Clean up audio buffers and files
5. Testing: - Test with various network conditions - Verify audio quality with real calls - Monitor latency and packet loss
6. Security: - Use WSS (secure WebSocket) in production - Validate authentication tokens - Encrypt sensitive audio data
Connection Lifecycle
1. Establish Connection:
GET /v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>
2. WebSocket Upgrade:
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
3. Bi-Directional Communication:
Client ←→ VoIPBIN
- Send audio: Binary frames with RTP packets
- Receive audio: Binary frames with RTP packets
4. Close Connection:
ws.close()
Troubleshooting
Common Issues:
No audio received: - Check WebSocket connection is established - Verify call is active and answered - Ensure correct encapsulation type
Audio quality poor: - Check network latency - Verify audio format matches requirements - Monitor packet loss
Connection drops: - Implement reconnection logic - Check firewall rules for WebSocket - Verify authentication token is valid
Can’t send audio: - Ensure binary frames are used (not text) - Verify audio format is correct - Check audio chunk size (typically 20ms)
For more information about media stream configuration, see Media Stream Overview.