Mediastream

Overview

In VoipBin, the media stream feature empowers users to directly control media transmission without relying on SIP (Session Initiation Protocol) signaling. Traditionally, SIP signaling is used to establish, modify, and terminate communication sessions in VoIP systems. However, the media stream feature in VoipBin introduces an alternative method for managing media streams independently of SIP signaling.

With the media stream feature, users can initiate, manipulate, and terminate media streams directly through the VoipBin platform, bypassing the need for SIP signaling. This capability offers several advantages:

Flexibility: Users have greater flexibility in controlling media streams, as they can manage them independently of SIP signaling. This flexibility allows for more dynamic and customizable communication experiences.

Efficiency: By eliminating the dependency on SIP signaling for media control, the media stream feature can streamline the process of initiating and managing media streams. This can lead to more efficient use of resources and reduced latency in media transmission.

Scalability: The media stream feature can enhance the scalability of VoipBin by reducing the overhead associated with SIP signaling for media control. This can support a larger number of concurrent media streams and accommodate higher traffic volumes.

Enhanced User Experience: By enabling direct control over media streams, VoipBin can offer users a more seamless and responsive communication experience. Users can adjust media settings in real-time without the constraints imposed by SIP signaling.

Overall, the media stream feature in VoipBin empowers users with greater control and flexibility in managing media transmission, enhancing the efficiency, scalability, and user experience of the platform. This capability enables a wide range of applications, from real-time communication to multimedia streaming, with minimal reliance on traditional SIP signaling mechanisms.

Available resources

Currently, support the below types of resources. * Call: See detail here <call-overview>. * Conference: See detail here <conference-overview>.

Bi-directional streaming

Bi-directional streaming allows for simultaneous transmission and reception of media. To establish bi-directional streaming, an additional API call is necessary:

Media stream for call

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=<encapsulation-type>&token=<token>

https://api.voipbin.net/v1.0/calls/652af662-eb45-11ee-b1a5-6fde165f9226/media_stream?encapsulation=rtp&token=some_token

Media stream for conference

GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=<encapsulation-type>&token=<token>

https://api.voipbin.net/v1.0/conferences/1ed12456-eb4b-11ee-bba8-1bfb2838807a/media_stream?encapsulation=rtp&token=some_token

By making this API call, a WebSocket connection is created, facilitating both the reception and transmission of media data. This means that both the server (VoipBin) and the client can send and receive media through the WebSocket connection.

+-----------------+        Websocket         +-----------------+
|                 |--------------------------|                 |
|     voipbin     |<----  Media In/Out  ---->|     Client      |
|                 |--------------------------|                 |
+-----------------+                          +-----------------+

Uni-directional streaming

Uni-directional streaming involves the transmission of media in one direction only. This can be achieved through the use of a flow action known as “external media start”. This flow action initiates the transmission of media from the server to the client, without the need for the client to send media back. See detail here.

Encapsulation Types

There are two encapsulation types supported for media streaming:

  • rtp: RTP (Real-time Transport Protocol). This is the standard protocol for transmitting audio and video over IP networks. It provides a means for transporting media streams (e.g., voice) from one endpoint to another.

  • sln: Signed Linear Mono. This media stream format does not include headers or padding. ulaw allowed.

  • audiosocket: AudioSocket. This is a protocol specific to Asterisk, known as Asterisk’s AudioSocket type. It is designed to facilitate simple audio streaming with minimal overhead. More details about AudioSocket can be found in the Asterisk AudioSocket Documentation(https://docs.asterisk.org/Configuration/Channel-Drivers/AudioSocket/).

Codec

  • VoipBin currently supports the ulaw codec. always 16-bit, 8kHz signed linear mono.

  • For AudioSocket, it uses 16-bit, 8kHz, mono PCM (little-endian).

Tutorial

Bi-Directional Media Streaming for Calls

Connect to a call’s media stream via WebSocket to send and receive audio in real-time. This allows you to build custom audio processing applications without SIP signaling.

Establish WebSocket Connection:

GET https://api.voipbin.net/v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/calls/652af662-eb45-11ee-b1a5-6fde165f9226/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This creates a bi-directional WebSocket connection where you can: - Receive audio from the call (what the other party is saying) - Send audio to the call (inject audio into the conversation)

Bi-Directional Media Streaming for Conferences

Access a conference’s media stream to monitor or participate in the conference audio.

GET https://api.voipbin.net/v1.0/conferences/<conference-id>/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

Example:

GET https://api.voipbin.net/v1.0/conferences/1ed12456-eb4b-11ee-bba8-1bfb2838807a/media_stream?encapsulation=rtp&token=<YOUR_AUTH_TOKEN>

This allows you to: - Listen to all conference participants - Inject audio into the conference - Build custom conference recording or analysis tools

Encapsulation Types

VoIPBIN supports three encapsulation types for media streaming:

1. RTP (Real-time Transport Protocol)

Standard protocol for audio/video over IP networks.

?encapsulation=rtp

Use cases: - Standard VoIP integration - Compatible with most audio processing tools - Industry-standard protocol

2. SLN (Signed Linear Mono)

Raw audio stream without headers or padding.

?encapsulation=sln

Use cases: - Minimal overhead needed - Simple audio processing - Direct PCM audio access

3. AudioSocket

Asterisk-specific protocol for simple audio streaming.

?encapsulation=audiosocket

Use cases: - Asterisk integration - Low-overhead streaming - Simple audio applications

Codec: All formats use 16-bit, 8kHz, mono audio (ulaw for RTP/SLN, PCM little-endian for AudioSocket)

WebSocket Client Examples

Python Example (RTP Streaming):

import websocket
import struct

def on_message(ws, message):
    """Receive audio data from the call"""
    # message contains RTP packets
    print(f"Received {len(message)} bytes of audio")

    # Process audio here
    # - Save to file
    # - Run speech recognition
    # - Analyze audio
    process_audio(message)

def on_open(ws):
    """Connection established, can start sending audio"""
    print("Media stream connected")

    # Send audio to the call
    # audio_data should be RTP packets
    audio_data = generate_audio()
    ws.send(audio_data, opcode=websocket.ABNF.OPCODE_BINARY)

def on_error(ws, error):
    print(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    print(f"Connection closed: {close_status_code}")

# Connect to media stream
call_id = "652af662-eb45-11ee-b1a5-6fde165f9226"
token = "<YOUR_AUTH_TOKEN>"
ws_url = f"wss://api.voipbin.net/v1.0/calls/{call_id}/media_stream?encapsulation=rtp&token={token}"

ws = websocket.WebSocketApp(
    ws_url,
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws.run_forever()

def process_audio(rtp_packet):
    """Process received RTP audio"""
    # Extract payload from RTP packet
    # RTP header is typically 12 bytes
    payload = rtp_packet[12:]

    # Save or process audio
    with open('received_audio.raw', 'ab') as f:
        f.write(payload)

def generate_audio():
    """Generate RTP packets to send"""
    # This is a simplified example
    # In production, properly construct RTP packets

    # Read audio file
    with open('audio_to_inject.raw', 'rb') as f:
        audio_data = f.read(160)  # 20ms of 8kHz audio

    # Construct RTP header (simplified)
    # In production, use a proper RTP library
    return audio_data

JavaScript Example (Browser):

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=rtp&token=${token}`;

const ws = new WebSocket(wsUrl);
ws.binaryType = 'arraybuffer';

ws.onopen = function() {
    console.log('Media stream connected');

    // Send audio to the call
    const audioData = generateAudio();
    ws.send(audioData);
};

ws.onmessage = function(event) {
    // Receive audio from the call
    const audioData = event.data;
    console.log(`Received ${audioData.byteLength} bytes`);

    // Process audio
    processAudio(new Uint8Array(audioData));
};

ws.onerror = function(error) {
    console.error('WebSocket error:', error);
};

ws.onclose = function() {
    console.log('Media stream closed');
};

function processAudio(audioBuffer) {
    // Process received audio
    // - Play through Web Audio API
    // - Run speech recognition
    // - Visualize audio
}

function generateAudio() {
    // Generate audio to send
    // Returns ArrayBuffer with RTP packets
    return new ArrayBuffer(172); // RTP packet size
}

Node.js Example (AudioSocket):

const WebSocket = require('ws');
const fs = require('fs');

const callId = '652af662-eb45-11ee-b1a5-6fde165f9226';
const token = '<YOUR_AUTH_TOKEN>';
const wsUrl = `wss://api.voipbin.net/v1.0/calls/${callId}/media_stream?encapsulation=audiosocket&token=${token}`;

const ws = new WebSocket(wsUrl);

ws.on('open', function() {
    console.log('AudioSocket connected');

    // Send audio file
    const audioFile = fs.readFileSync('audio.pcm');

    // Send in chunks (20ms = 320 bytes for 16-bit 8kHz mono)
    const chunkSize = 320;
    for (let i = 0; i < audioFile.length; i += chunkSize) {
        const chunk = audioFile.slice(i, i + chunkSize);
        ws.send(chunk);
    }
});

ws.on('message', function(data) {
    // Receive audio from call
    console.log(`Received ${data.length} bytes`);

    // Save received audio
    fs.appendFileSync('received_audio.pcm', data);
});

ws.on('error', function(error) {
    console.error('Error:', error);
});

ws.on('close', function() {
    console.log('AudioSocket closed');
});

Uni-Directional Streaming with Flow Action

For sending audio to a call without receiving audio back, use the external_media_start flow action.

Create Call with External Media:

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "external_media_start",
                "option": {
                    "url": "wss://your-media-server.com/audio-stream",
                    "encapsulation": "audiosocket"
                }
            }
        ]
    }'

This creates a uni-directional stream where VoIPBIN: 1. Establishes the call 2. Connects to your media server via WebSocket 3. Receives audio from your server 4. Plays that audio to the call participant

Your media server receives:

WebSocket connection from VoIPBIN
→ Send audio chunks (PCM format for AudioSocket)
→ VoIPBIN plays audio to call

Common Use Cases

1. Real-Time Speech Recognition:

# Python example
def on_message(ws, message):
    # Extract audio from RTP packet
    audio = extract_audio(message)

    # Send to speech recognition API
    text = speech_to_text(audio)
    print(f"Recognized: {text}")

    # Store transcription
    save_transcription(text)

2. Audio Injection / IVR Replacement:

# Node.js example
ws.on('open', function() {
    // Play custom audio prompts
    const prompt1 = fs.readFileSync('welcome.pcm');
    ws.send(prompt1);

    // Wait for DTMF or speech
    // Then play next prompt
});

3. Conference Recording:

# Python example
def on_message(ws, message):
    # Save all conference audio
    with open(f'conference_{conference_id}.raw', 'ab') as f:
        f.write(extract_audio(message))

4. Real-Time Audio Analysis:

def on_message(ws, message):
    audio = extract_audio(message)

    # Detect emotion
    emotion = analyze_emotion(audio)

    # Detect keywords
    if detect_keyword(audio, ['help', 'urgent']):
        alert_supervisor()

    # Calculate audio quality
    quality = measure_quality(audio)

5. Custom Music on Hold:

ws.on('open', function() {
    // Play custom music or messages
    const music = fs.readFileSync('hold_music.pcm');

    // Loop music while call is on hold
    setInterval(() => {
        ws.send(music);
    }, 1000);
});

6. AI-Powered Voice Assistant:

ws.on('message', function(data) {
    // Receive customer audio
    const audio = extractAudio(data);

    // Send to AI for processing
    const response = await aiProcess(audio);

    // Convert AI response to audio
    const responseAudio = textToSpeech(response);

    // Send back to call
    ws.send(responseAudio);
});

Audio Format Details

RTP Format: - Codec: ulaw (G.711 μ-law) - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Packet size: 160 bytes payload (20ms audio)

SLN Format: - Raw PCM audio - No headers or padding - Sample rate: 8 kHz - Bits: 16-bit signed - Channels: Mono

AudioSocket Format: - PCM little-endian - Sample rate: 8 kHz - Bits: 16-bit - Channels: Mono - Chunk size: 320 bytes (20ms of audio)

Best Practices

1. Buffer Management: - Maintain audio buffers to handle jitter - Send audio in consistent 20ms chunks - Don’t send too fast or too slow

2. Error Handling: - Implement reconnection logic - Handle WebSocket disconnections gracefully - Log errors for debugging

3. Audio Quality: - Use proper RTP packet construction - Maintain correct timing for audio chunks - Monitor for packet loss

4. Resource Management: - Close WebSocket when done - Don’t leave connections open indefinitely - Clean up audio buffers and files

5. Testing: - Test with various network conditions - Verify audio quality with real calls - Monitor latency and packet loss

6. Security: - Use WSS (secure WebSocket) in production - Validate authentication tokens - Encrypt sensitive audio data

Connection Lifecycle

1. Establish Connection:

GET /v1.0/calls/<call-id>/media_stream?encapsulation=rtp&token=<token>

2. WebSocket Upgrade:

HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade

3. Bi-Directional Communication:

Client ←→ VoIPBIN
- Send audio: Binary frames with RTP packets
- Receive audio: Binary frames with RTP packets

4. Close Connection:

ws.close()

Troubleshooting

Common Issues:

No audio received: - Check WebSocket connection is established - Verify call is active and answered - Ensure correct encapsulation type

Audio quality poor: - Check network latency - Verify audio format matches requirements - Monitor packet loss

Connection drops: - Implement reconnection logic - Check firewall rules for WebSocket - Verify authentication token is valid

Can’t send audio: - Ensure binary frames are used (not text) - Verify audio format is correct - Check audio chunk size (typically 20ms)

For more information about media stream configuration, see Media Stream Overview.