Overview
VoIPBIN’s Media Stream API provides direct access to call and conference audio via WebSocket connections. Instead of relying on SIP signaling for media control, you can stream audio bidirectionally with your applications for real-time processing, AI integration, custom IVR, and more.
With the Media Stream API you can:
Stream live call audio to your application in real-time
Inject audio into calls and conferences
Build AI voice assistants with direct audio access
Create custom speech recognition pipelines
Implement real-time audio analysis and monitoring
How Media Streaming Works
When you connect to a media stream, VoIPBIN establishes a WebSocket connection that carries audio data directly between the call/conference and your application.
Media Stream Architecture
Traditional VoIP: VoIPBIN Media Stream:
+-------+ SIP +-------+ +-------+ WebSocket +----------+
| Phone |<------->|VoIPBIN| | Call |<============>| Your App |
+-------+ +-------+ +-------+ +----------+
(signaling only) (direct audio access)
Key Differences from Traditional VoIP
Aspect |
Traditional SIP |
Media Streaming |
|---|---|---|
Audio Access |
Via RTP to SIP endpoints |
Direct WebSocket to your app |
Control |
SIP signaling |
API and WebSocket |
Integration |
Requires SIP stack |
Simple WebSocket client |
Use Cases |
Phone-to-phone calls |
AI, custom IVR, analysis |
System Components
+--------+ +-----------+
| Call |<------- RTP ------->+ | |
+--------+ | | Your |
+----+-----+ | App |
| VoIPBIN |<== WebSocket ===>| |
| Media | | - AI/ML |
| Bridge | | - STT/TTS |
+------------+ +----+-----+ | - IVR |
| Conference |<-- RTP --->+ | |
+------------+ +-----------+
The Media Bridge handles protocol conversion between RTP (VoIP standard) and WebSocket (web standard), enabling any WebSocket-capable application to process call audio.
Streaming Modes
VoIPBIN supports two streaming modes based on your application’s needs.
Bi-Directional Streaming
Your application both receives and sends audio through the same WebSocket connection.
+----------+ +----------+
| |======= audio IN ============>| |
| VoIPBIN | | Your App |
| |<====== audio OUT ============| |
+----------+ +----------+
Initiate via API:
GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>
Use cases: - AI voice assistants (listen and respond) - Interactive IVR systems - Real-time audio processing with feedback - Call bridging to custom systems
Uni-Directional Streaming
VoIPBIN receives audio from your server and plays it to the call. Your app sends audio but doesn’t receive call audio.
+----------+ +----------+
| | | |
| VoIPBIN |<====== audio only ===========| Your App |
| | | |
+----------+ +----------+
Initiate via Flow Action:
{
"type": "external_media_start",
"option": {
"url": "wss://your-server.com/audio",
"encapsulation": "audiosocket"
}
}
See detail here.
Use cases: - Custom music on hold - Pre-recorded message playback - Text-to-speech from external service - Audio announcements
Mode Comparison
Encapsulation Types
VoIPBIN supports three encapsulation types for different integration scenarios.
Decision Guide
What's your use case?
|
+----------------+----------------+
| |
Standard VoIP Simple audio
integration? processing?
| |
+-----+-----+ +-----+-----+
| | | |
Yes No Yes No
| | | |
v | v |
[RTP] | [SLN] |
| |
Asterisk |
integration? |
| |
+-----+-----+ |
| | |
Yes No |
| | |
v +---------------------------+
[AudioSocket] |
v
[RTP default]
RTP (Real-time Transport Protocol)
The standard protocol for audio/video over IP networks.
+------------------+------------------+
| RTP Header | Audio Payload |
| (12 bytes) | (160 bytes) |
+------------------+------------------+
Specification |
Value |
|---|---|
Protocol |
RTP over WebSocket |
Codec |
G.711 μ-law (ulaw) |
Sample Rate |
8 kHz |
Bit Depth |
16-bit |
Channels |
Mono |
Packet Size |
172 bytes (12 header + 160 payload = 20ms) |
Best for: Standard VoIP tools, industry compatibility, existing RTP processing pipelines.
SLN (Signed Linear)
Raw audio without protocol overhead.
+----------------------------------+
| Raw PCM Audio Data |
| (no headers, no padding) |
+----------------------------------+
Specification |
Value |
|---|---|
Format |
Raw PCM, signed linear |
Sample Rate |
8 kHz |
Bit Depth |
16-bit signed |
Channels |
Mono |
Byte Order |
Native |
Best for: Minimal overhead, simple audio processing, direct PCM access without parsing.
AudioSocket
Asterisk-specific protocol designed for simple audio streaming.
+------------------+------------------+
| AudioSocket Hdr | PCM Audio |
+------------------+------------------+
Specification |
Value |
|---|---|
Protocol |
Asterisk AudioSocket |
Format |
PCM little-endian |
Sample Rate |
8 kHz |
Bit Depth |
16-bit |
Channels |
Mono |
Chunk Size |
320 bytes (20ms of audio) |
Best for: Asterisk integration, simple streaming with minimal overhead.
See Asterisk AudioSocket Documentation for protocol details.
Encapsulation Comparison
Aspect |
RTP |
SLN |
AudioSocket |
|---|---|---|---|
Headers |
12 bytes |
None |
Protocol header |
Compatibility |
Industry standard |
Simple |
Asterisk |
Overhead |
Low |
Minimal |
Low |
Parsing Required |
Yes (RTP) |
No |
Yes (AudioSocket) |
Supported Resources
Media streaming is available for both calls and conferences.
Call Media Streaming
Stream audio from a single call.
GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=<type>&token=<token>
Audio contains: Both parties’ audio mixed together.
Conference Media Streaming
Stream audio from a conference with multiple participants.
GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=<type>&token=<token>
Audio contains: All participants’ audio mixed together.
Resource Comparison
Aspect |
Call |
Conference |
|---|---|---|
Audio Source |
Two-party conversation |
Multi-party conversation |
Audio Mix |
Caller + callee |
All participants |
Audio Injection |
Heard by both parties |
Heard by all participants |
Use Case |
1:1 AI assistant |
Conference monitoring/recording |
Connection Lifecycle
Understanding the WebSocket connection lifecycle helps build robust streaming applications.
Connection Flow
Your App VoIPBIN
| |
| GET /calls/{id}/media_stream |
+------------------------------>|
| |
| 101 Switching Protocols |
|<------------------------------+
| |
|<======= audio frames ========>| (bi-directional)
|<======= audio frames ========>|
|<======= audio frames ========>|
| |
| close() or call ends |
+------------------------------>|
| |
Connection States
connecting ---> open ---> streaming ---> closing ---> closed
|
v
(call ends)
|
v
closed
State Descriptions
State |
What’s happening |
|---|---|
connecting |
WebSocket handshake in progress |
open |
Connection established, ready for audio |
streaming |
Audio frames being sent/received |
closing |
Graceful shutdown initiated |
closed |
Connection terminated |
Connection Termination
The WebSocket connection closes when:
Your application closes the connection
The call or conference ends
Network failure occurs
Authentication token expires
Integration Patterns
Common patterns for integrating media streaming with your applications.
Pattern 1: AI Voice Assistant
Call Audio Your App AI Service
| | |
|====audio====> | |
| | STT |
| +------------------>|
| | |
| | AI response |
| |<------------------+
| | |
| | TTS |
| <====audio====+ |
| | |
Pattern 2: Real-Time Monitoring
Call Audio Your App Dashboard
| | |
|====audio====> | |
| | analyze |
| +------------------>|
| | sentiment, |
| | keywords, |
| | quality |
| | |
Pattern 3: Custom IVR
Call Audio Your App Logic Engine
| | |
|====audio====> | |
| | detect DTMF/speech|
| +------------------>|
| | |
| | next action |
| |<------------------+
| | |
| <====prompt===+ |
| | |
Pattern 4: Recording with Processing
Call Audio Your App Storage
| | |
|====audio====> | |
| | process |
| | (filter, enhance) |
| | |
| | store |
| +------------------>|
| | |
For working code examples of these patterns, see the Media Stream Tutorial.
Best Practices
1. Audio Timing
Send audio in consistent 20ms chunks
Maintain proper timing to avoid audio gaps or overlaps
Buffer incoming audio to handle network jitter
2. Connection Management
Implement automatic reconnection for dropped connections
Handle the
oncloseevent gracefullyClose connections when no longer needed
3. Resource Efficiency
Process audio asynchronously to avoid blocking
Use appropriate buffer sizes (typically 320 bytes for 20ms)
Monitor memory usage for long-running streams
4. Error Handling
Log connection errors for debugging
Implement exponential backoff for reconnection attempts
Handle authentication failures gracefully
Troubleshooting
Connection Issues
Symptom |
Solution |
|---|---|
Connection refused |
Verify call/conference is active and in “progressing” status |
401 Unauthorized |
Check API token is valid and has permissions |
Connection drops |
Implement reconnection logic; check network stability |
Audio Issues
Symptom |
Solution |
|---|---|
No audio received |
Verify call is answered and audio is flowing; check encapsulation type |
Audio quality poor |
Check network latency; verify correct audio format; monitor packet loss |
Audio choppy |
Implement jitter buffer; send in consistent 20ms chunks; check CPU usage |
Can’t send audio |
Use binary WebSocket frames; verify audio format matches encapsulation type |