.. _speaking-struct-speaking: Speaking ======== Speaking -------- .. code:: { "id": "", "customer_id": "", "reference_type": "", "reference_id": "", "language": "", "provider": "", "voice_id": "", "direction": "", "status": "", "tm_create": "", "tm_update": "", "tm_delete": "", } * ``id`` (UUID): The speaking session's unique identifier. Returned when creating a TTS session via ``POST /speakings`` or listing via ``GET /speakings``. * ``customer_id`` (UUID): The customer who owns this speaking session. Obtained from ``GET /customer``. * ``reference_type`` (enum string): The type of resource receiving TTS audio. See :ref:`Reference Type `. * ``reference_id`` (UUID): The ID of the resource receiving TTS audio. Depending on ``reference_type``, obtained from ``GET /calls`` or ``GET /conferences``. * ``language`` (String, BCP47): The language and locale for TTS synthesis (e.g., ``en-US``, ``ko-KR``). Must match the provider's supported languages. * ``provider`` (enum string, optional): The TTS provider used for synthesis. See :ref:`Provider `. If omitted, defaults to ``elevenlabs``. * ``voice_id`` (String, optional): A provider-specific voice identifier. If omitted, the provider's default voice for the specified language is used. Obtain available voices from the provider's documentation. * ``direction`` (enum string): The audio routing direction. See :ref:`Direction `. * ``status`` (enum string): The speaking session's current status. See :ref:`Status `. * ``tm_create`` (string, ISO 8601): Timestamp when the speaking session was created. * ``tm_update`` (string, ISO 8601): Timestamp of the last update to any speaking property. * ``tm_delete`` (string, ISO 8601): Timestamp when the speaking session was deleted. Set to ``9999-01-01 00:00:00.000000`` if not deleted. .. note:: **AI Implementation Hint** Timestamps set to ``9999-01-01 00:00:00.000000`` indicate the event has not yet occurred. For example, ``tm_delete`` with this value means the speaking session has not been deleted. Example +++++++ .. code:: { "id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "customer_id": "5e4a0680-804e-11ec-8477-2fea5968d85b", "reference_type": "call", "reference_id": "12f8f1c9-a6c3-4f81-93db-ae445dcf188f", "language": "en-US", "provider": "elevenlabs", "voice_id": "", "direction": "both", "status": "active", "tm_create": "2025-06-15 14:30:00.123456", "tm_update": "2025-06-15 14:30:02.456789", "tm_delete": "9999-01-01 00:00:00.000000" } .. _speaking-struct-speaking-reference_type: reference_type -------------- All possible values for the ``reference_type`` field: =========== ============ Type Description =========== ============ call Attach TTS to a live call. The ``reference_id`` is a call ID from ``GET /calls``. confbridge Attach TTS to a live conference. The ``reference_id`` is a conference ID from ``GET /conferences``. =========== ============ .. _speaking-struct-speaking-provider: provider -------- All possible values for the ``provider`` field: =========== ============ Provider Description =========== ============ elevenlabs ElevenLabs TTS. High-quality neural voices. Default provider if omitted. gcp Google Cloud Text-to-Speech. Wide language support with WaveNet and Neural2 voices. aws Amazon Polly. Neural and standard voices with SSML support. =========== ============ When creating a speaking session, the ``provider`` field is optional. If omitted, VoIPBIN defaults to ``elevenlabs``. .. _speaking-struct-speaking-status: status ------ All possible values for the ``status`` field: =========== ============ Status Description =========== ============ initiating TTS session is being set up. Provider connection is being established. Do not call ``POST /speakings/{id}/say`` in this state. active TTS session is ready. Send text via ``POST /speakings/{id}/say``. Audio is being injected into the call. stopped TTS session has ended. Stopped via ``POST /speakings/{id}/stop`` or the call was hung up. =========== ============ .. _speaking-struct-speaking-direction: direction --------- All possible values for the ``direction`` field: =========== ============ Direction Description =========== ============ in Audio injected toward the caller (remote party hears it, local party does not). out Audio injected toward the callee/local side (local party hears it, remote party does not). both Audio injected to both sides of the call. Both parties hear the synthesized speech. =========== ============