AI

Overview

VoIPBIN’s AI is a built-in AI agent that enables automated, intelligent voice interactions during live calls. Designed for seamless integration within VoIPBIN’s flow, the AI utilizes ChatGPT as its AI engine to process and respond to user inputs in real time. This allows developers to create dynamic and interactive voice experiences without requiring manual intervention.

How it works

Action component

The AI is integrated as one of the configurable components within a VoIPBIN flow. When a call reaches an AI action, the system triggers the AI to generate a response based on the provided prompt. The response is then processed and played back to the caller using text-to-speech (TTS). If the response is in a structured JSON format, VoIPBIN executes the defined actions accordingly.

AI component in action builder

TTS/STT + AI Engine

VoIPBIN’s AI is built using TTS/STT + AI Engine, where speech-to-text (STT) converts spoken words into text, and text-to-speech (TTS) converts responses back into audio. The system processes these in real time, enabling seamless conversations.

AI implementation using TTS/STT + AI Engine

Voice Detection and Play Interruption:

In addition to basic TTS and STT functionalities, VoIPBIN incorporates voice detection to create a more natural conversational flow. While the AI is speaking (i.e., playing TTS media), if the system detects the caller’s voice, it immediately stops the TTS playback and routes the caller’s speech (via STT) to the AI engine. This play interruption feature ensures that if the user starts talking, their input is prioritized, enabling a dynamic interaction that more closely resembles a real conversation.

Context Retention

VoIPBIN’s AI supports context saving. During a conversation, the AI remembers prior exchanges, allowing it to maintain continuity and respond based on earlier parts of the interaction. This provides a more natural and human-like dialogue experience.

Multilingual support

VoIPBIN’s AI supports multiple languages. See supported languages: supported languages.

External AI Agent Integration

For users who prefer to use external AI services, VoIPBIN offers media stream access via MCP (Media Control Protocol). This allows third-party AI engines to process voice data directly, enabling deeper customization and advanced AI capabilities.

MCP Server

A recommended open-source implementation is available here:

Using the AI

Initial Prompt

The initial prompt serves as the foundation for the AI’s behavior. A well-crafted prompt ensures accurate and relevant responses. There is no enforced limit to prompt length, but we recommend keeping this confidential to ensure consistent performance and security.

Example Prompt:

Pretend you are an expert customer service agent.

Please respond kindly.

AI Talk

AI Talk enables real-time conversational AI with voice in VoIPBIN, powered by ElevenLabs’ voice engine for natural-sounding speech.

AI Talk component in action builder

Key Features

  • Real-time Voice Interaction: AI generates responses in real-time based on user input and delivers them as speech.

  • Interruption Detection & Listening: If the other party speaks while the AI is talking, the system immediately stops the AI’s speech and switches to capturing the user’s voice via STT. This ensures a smooth and continuous conversation flow.

  • Low Latency Response: For longer prompts, AI Talk does not wait for the entire response to finish. Instead, it generates and plays speech in smaller chunks, reducing perceived response time for the user.

  • ElevenLabs Voice Engine: High-quality, natural-sounding voice output ensures the AI feels like a real conversation partner.

Built-in ElevenLabs Voice IDs

VoIPBIN uses a predefined set of voice IDs for various languages and genders. Here are the default ElevenLabs Voice IDs currently in use:

Language

Male Voice ID (Name)

Female Voice ID (Name)

Neutral Voice ID (Name)

English (Default)

21m00Tcm4TlvDq8ikWAM (Adam)

EXAVITQu4vr4xnSDxMaL (Rachel)

EXAVITQu4vr4xnSDxMaL (Rachel)

Japanese

Mv8AjrYZCBkdsmDHNwcB (Ishibashi)

PmgfHCGeS5b7sH90BOOJ (Fumi)

PmgfHCGeS5b7sH90BOOJ (Fumi)

Chinese

MI36FIkp9wRP7cpWKPTl (Evan)

ZL9dtgFhmkTzAHUUtQL8 (Xiao)

ZL9dtgFhmkTzAHUUtQL8 (Xiao)

German

uM8iMoqaSe1eDaJiWfxf (Felix)

nF7t9cuYo0u3kuVI9q4B (Dana)

nF7t9cuYo0u3kuVI9q4B (Dana)

French

IPgYtHTNLjC7Bq7IPHrm (Alexandre)

SmWACbi37pETyxxMhSpc

SmWACbi37pETyxxMhSpc

Hindi

IvLWq57RKibBrqZGpQrC (Leo)

MF4J4IDTRo0AxOO4dpFR (Devi)

MF4J4IDTRo0AxOO4dpFR (Devi)

Korean

nbrxrAz3eYm9NgojrmFK (Minjoon)

AW5wrnG1jVizOYY7R1Oo (Jiyoung)

AW5wrnG1jVizOYY7R1Oo (Jiyoung)

Italian

iLVmqjzCGGvqtMCk6vVQ

b8jhBTcGAq4kQGWmKprT (Sami)

b8jhBTcGAq4kQGWmKprT (Sami)

Spanish (Spain)

JjHBC66wF58p4ogebCNA (Eduardo)

UOIqAnmS11Reiei1Ytkc (Carolina)

UOIqAnmS11Reiei1Ytkc (Carolina)

Portuguese (Brazil)

NdHRjGnnDKGnnm2c19le (Tiago)

CZD4BJ803C6T0alQxsR7 (Andreia)

CZD4BJ803C6T0alQxsR7 (Andreia)

Dutch

G53Wkf3yrsXvhoQsmslL (James)

YUdpWWny7k5yb4QCeweX (Ruth)

YUdpWWny7k5yb4QCeweX (Ruth)

Russian

qJBO8ZmKp4te7NTtYgzz (Egor)

ymDCYd8puC7gYjxIamPt

ymDCYd8puC7gYjxIamPt

Arabic

s83SAGdFTflAwJcAV81K (Adeeb)

EXAVITQu4vr4xnSDxMaL (Farah)

4wf10lgibMnboGJGCLrP (Farah)

Polish

H5xTcsAIeS5RAykjz57a (Alex)

W0sqKm1Sfw1EzlCH14FQ (Beata)

W0sqKm1Sfw1EzlCH14FQ (Beata)

Other ElevenLabs Voice ID Options

VoIPBIN allows you to personalize the text-to-speech output by specifying a custom ElevenLabs Voice ID. By setting the voipbin.tts.elevenlabs.voice_id variable, you can override the default voice selection.

voipbin.tts.elevenlabs.voice_id: <Your Custom Voice ID>

See how to set the variables here.

AI Summary

The AI Summary feature in VoIPBIN generates structured summaries of call transcriptions, recordings, or conference discussions. It provides a concise summary of key points, decisions, and action items based on the provided transcription source.

AI summary component in action builder

Supported Resources

AI summaries work with a single resource at a time. The supported resources are:

Real-time Summary: * Live call transcription * Live conference transcription

Non-Real-time Summary: * Transcribed recordings (post-call) * Recorded conferences (post-call)

Choosing Between Real-time and Non-Real-time Summaries

Developers must decide whether to use a real-time or non-real-time summary based on their needs:

Use Case

Summary Type

Recommendation

Live call monitoring

Real-time

Use AI summary with a live call transcription

Live conference insights

Real-time

Use AI summary with a live conference transcription

Post-call analysis

Non-real-time

Use AI summary with transcribe_id from a completed call

Recorded conference summary

Non-real-time

Use AI summary with recording_id

AI Summary Behavior

  • The summary action processes only one resource at a time.

  • If multiple AI summary actions are used in a flow, each executes independently.

  • If an AI summary action is triggered multiple times for the same resource, it only returns the most recent segment.

  • In conference calls, the summary is unified across all participants rather than per speaker.

Ensuring Full Coverage

Since starting an AI summary action late in the call results in missing earlier conversations, developers should follow best practices: * Enable transcribe_start early: This ensures that transcriptions are available even if an AI summary action is triggered later. * Use transcribe_id instead of call_id: This allows summarizing a full transcription rather than just the latest segment. * For post-call summaries, use recording_id: This ensures that the full conversation is summarized from the recorded audio.

AI

AI

{
    "id": "<string>",
    "customer_id": "<string>",
    "name": "<string>",
    "detail": "<string>",
    "engine_type": "<string>",
    "init_prompt": "<string>",
    "tm_create": "<string>",
    "tm_update": "<string>",
    "tm_delete": "<string>"
}
  • id: AI’s ID.

  • customer_id: Customer’s ID.

  • name: AI’s name.

  • detail: AI’s detail.

  • engine_type: AI’s engine type. See detail here

  • init_prompt: Defines AI’s initial prompt. It will define the AI engine’s behavior.

Example

{
    "id": "a092c5d9-632c-48d7-b70b-499f2ca084b1",
    "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b",
    "name": "test AI",
    "detail": "test AI for simple scenario",
    "engine_type": "chatGPT",
    "tm_create": "2023-02-09 07:01:35.666687",
    "tm_update": "9999-01-01 00:00:00.000000",
    "tm_delete": "9999-01-01 00:00:00.000000"
}

Type

AI’s type.

Type

Description

chatGPT

Openai’s Chat AI. https://chat.openai.com/chat

clova

Naver’s Clova AI (coming soon). https://clova.ai/

Tutorial

Simple AI Voice Assistant

Create a basic AI voice assistant that answers questions during a call. The AI will listen to the caller’s speech, process it, and respond using text-to-speech.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "ai",
                "option": {
                    "initial_prompt": "You are a helpful customer service assistant. Answer questions politely and concisely.",
                    "language": "en-US",
                    "voice_type": "female"
                }
            }
        ]
    }'

This creates a call with an AI assistant that will: 1. Answer the incoming call 2. Listen to the caller’s speech using STT (Speech-to-Text) 3. Process the input through the AI engine with the given prompt 4. Respond using TTS (Text-to-Speech)

AI Talk with Real-Time Conversation

Use AI Talk for more natural, low-latency conversations powered by ElevenLabs. This enables interruption detection where the AI stops speaking when the caller starts talking.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "ai_talk",
                "option": {
                    "initial_prompt": "You are an expert sales representative for VoIPBIN. Help customers understand our calling and messaging platform. Be enthusiastic but professional.",
                    "language": "en-US",
                    "voice_type": "male"
                }
            }
        ]
    }'

AI Talk provides: - Interruption Detection: Stops speaking when caller talks - Low Latency: Streams responses in chunks for faster perceived response time - Natural Voice: Uses ElevenLabs for high-quality voice output - Context Retention: Remembers previous conversation exchanges

AI with Custom Voice ID

Customize the AI voice by specifying an ElevenLabs Voice ID using variables.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "variable_set",
                "option": {
                    "key": "voipbin.tts.elevenlabs.voice_id",
                    "value": "21m00Tcm4TlvDq8ikWAM"
                }
            },
            {
                "type": "ai_talk",
                "option": {
                    "initial_prompt": "You are a friendly receptionist. Greet callers warmly and help them with their inquiries.",
                    "language": "en-US"
                }
            }
        ]
    }'

See Built-in ElevenLabs Voice IDs for available voice options.

AI Summary for Call Transcription

Generate an AI-powered summary of a call transcription. This is useful for post-call analysis and record-keeping.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "transcribe_start",
                "option": {
                    "language": "en-US"
                }
            },
            {
                "type": "talk",
                "option": {
                    "text": "Hello! This call is being transcribed and summarized. Please tell me about your experience with our service.",
                    "language": "en-US"
                }
            },
            {
                "type": "wait",
                "option": {
                    "duration": 30000
                }
            },
            {
                "type": "ai_summary",
                "option": {
                    "source_type": "transcribe",
                    "source_id": "${voipbin.transcribe.id}"
                }
            },
            {
                "type": "talk",
                "option": {
                    "text": "Thank you for your feedback. We have recorded and summarized your call.",
                    "language": "en-US"
                }
            }
        ]
    }'

The AI summary will: - Process the transcription from transcribe_start - Generate a structured summary of key points - Store the summary in ${voipbin.ai_summary.result} - Can be accessed via webhook or API after the call

Real-Time AI Summary

Get AI summaries while the call is still active. Useful for live call monitoring and agent assistance.

$ curl --location --request POST 'https://api.voipbin.net/v1.0/calls?token=<YOUR_AUTH_TOKEN>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "source": {
            "type": "tel",
            "target": "+15551234567"
        },
        "destinations": [
            {
                "type": "tel",
                "target": "+15559876543"
            }
        ],
        "actions": [
            {
                "type": "answer"
            },
            {
                "type": "transcribe_start",
                "option": {
                    "language": "en-US",
                    "real_time": true
                }
            },
            {
                "type": "ai_summary",
                "option": {
                    "source_type": "call",
                    "source_id": "${voipbin.call.id}",
                    "real_time": true
                }
            },
            {
                "type": "connect",
                "option": {
                    "source": {
                        "type": "tel",
                        "target": "+15551234567"
                    },
                    "destinations": [
                        {
                            "type": "tel",
                            "target": "+15551111111"
                        }
                    ]
                }
            }
        ]
    }'

Real-time summaries provide: - Live Updates: Summary updates as conversation progresses - Agent Assistance: Provides context to agents joining mid-call - Call Monitoring: Enables supervisors to quickly understand ongoing calls

Best Practices

Initial Prompt Design: - Be specific about the AI’s role and behavior - Include constraints (e.g., “Keep responses under 30 seconds”) - Define the tone (professional, friendly, technical, etc.)

Language Support: - AI supports multiple languages (see supported languages) - Match the language parameter with caller’s expected language - AI can detect and respond in multiple languages if not constrained

Context Retention: - AI remembers conversation history within the same call - Variables set during the call are available to AI - Use context to build multi-turn conversations

Error Handling: - Always include fallback actions after AI actions - Handle cases where AI may not understand the input - Provide clear instructions to callers about what they can ask

For more details on AI features and configuration, see AI Overview.