Dialogue tools

Kubeez

Generate single-voice text-to-speech audio on Replicate. The generate_dialogue tool accepts one prompt and one voice per call (multi-voice scenes are produced by stitching multiple calls), and supports two providers:

ElevenLabs v3 (provider: "elevenlabs", the default) — 26 human-named voices, the most natural prosody.
Google Gemini 3.1 Flash TTS (provider: "google") — 30 voices, 70+ languages, a natural-language style prompt, and inline expressive tags ([sigh], [laughing], [whispering], [shouting], [extremely fast], [like dracula]) that are actually performed.

Both bill at the same rate (26 credits / 1,000 characters). The full live catalog also flows through get_models (filter model_type: "speech").

#ElevenLabs v3 (default)

elevenlabs/v3 accepts these 26 voices (case-sensitive). Any value outside this list is rejected with 400 Unsupported voice and you are not charged. Audio tags like [laughs] are stripped server-side before synthesis (not spoken).

#Female voices

Voice ID	Description	Preview
Rachel	Calm, articulate American
Aria	Expressive, raspy American
Domi	Strong, confident young American
Sarah	Soft, warm young American
Jane	Mature, dignified Australian
Juniper	Natural, articulate American
Arabella	Mysterious British narrator
Hope	Bright, upbeat American
Blondie	Casual American conversationalist
Priyanka	Sultry, soothing Indian
Alexandra	Conversational young American
Monika	Deep, natural Indian

#Male voices

Voice ID	Description	Preview
Drew	Well-rounded American narrator
Clyde	Gritty war-veteran character
Paul	Authoritative ground reporter
Dave	Conversational young British
Roger	Classy American businessman
Fin	Sailor character, Irish accent
James	Calm Australian narrator
Bradford	Theatrical, articulate British
Reginald	Intense, dramatic British character
Gaming	Animated, energetic gaming character
Austin	Easygoing American country
Kuon	Cheerful, steady character voice
Mark	Casual, relaxed American
Grimblewood	Gruff fantasy creature character

#Google Gemini 3.1 Flash TTS

Pass provider: "google". Gemini adds a natural-language style prompt (set the tone, pace, accent, emotion, or a character) and performs inline [tags] instead of stripping them. It accepts 30 voices (case-sensitive) and BCP-47 language codes.

Inline tags are performed, not stripped: [sigh], [laughing], [uhm], [whispering], [shouting], [sarcasm], [robotic], [extremely fast], [short pause] / [medium pause] / [long pause], and free-form descriptive tags like [like dracula] or [excitedly].

#Female voices

Voice ID	Character	Preview
Kore	Firm
Zephyr	Bright
Leda	Youthful
Aoede	Breezy
Callirrhoe	Easy-going
Autonoe	Bright
Despina	Smooth
Erinome	Clear
Laomedeia	Upbeat
Achernar	Soft
Gacrux	Mature
Pulcherrima	Forward
Vindemiatrix	Gentle
Sulafat	Warm

#Male voices

Voice ID	Character	Preview
Puck	Upbeat
Charon	Informative
Fenrir	Excitable
Orus	Firm
Enceladus	Breathy
Iapetus	Clear
Umbriel	Easy-going
Algenib	Gravelly
Algieba	Smooth
Schedar	Even
Achird	Friendly
Zubenelgenubi	Casual
Sadachbia	Lively
Sadaltager	Knowledgeable
Alnilam	Firm
Rasalgethi	Informative

Languages: BCP-47 codes (e.g. en-US, en-GB, es-ES, es-MX, pt-BR, fr-FR, de-DE, it-IT, ja-JP, ko-KR, hi-IN, ar-001, ru-RU, ro-RO, tr-TR, vi-VN, th-TH, and 70+ more), or auto to auto-detect. Default en-US. Call get_limits_for_model('text-to-dialogue-gemini') for the full list.

#generate_dialogue

Generate a single-voice TTS clip.

Parameters:

Parameter	Type	Required	Description
`text` (or `prompt`)	string	Yes	ElevenLabs: 5–5000 characters (after `[bracket]` tags are stripped). Google: up to 4,000 bytes (UTF-8), inline `[tags]` kept.
`provider`	string	No	`elevenlabs` (default) or `google`.
`voice`	string	No	A voice ID for the chosen provider. Default: `Rachel` (ElevenLabs) / `Kore` (Google).
`style_prompt`	string	No	Google only. Natural-language delivery direction (tone, pace, accent, emotion, character). Up to 4,000 bytes; `text` + `style_prompt` must be ≤ 8,000 bytes combined. Default: `Say the following.`
`language_code`	string	No	ElevenLabs: ISO code (default `en`; 29 supported). Google: BCP-47 code (default `en-US`) or `auto`.
`stability`	number	No	ElevenLabs only. `0..1`, default `0.5`. Higher = more stable, lower = more expressive.
`similarity_boost`	number	No	ElevenLabs only. `0..1`, default `0.75`.
`style`	number	No	ElevenLabs only. `0..1`, default `0`. Style exaggeration.
`speed`	number	No	ElevenLabs only. `0.7..1.2`, default `1.0`.
`previous_text` / `next_text`	string	No	ElevenLabs only. Surrounding context to keep prosody consistent across stitched chunks.

ElevenLabs strips [HEY] / [laughs] / [whispers] before synthesis. Google performs inline tags — see the tag list above.

Example (Google Gemini):

{
  "provider": "google",
  "text": "[whispering] I have a secret. [laughing] Just kidding!",
  "voice": "Callirrhoe",
  "style_prompt": "Speak playfully, like sharing a fun secret with a friend.",
  "language_code": "en-US"
}

Example (ElevenLabs):

{
  "text": "Welcome back, ready to generate?",
  "voice": "Rachel",
  "stability": 0.5,
  "language_code": "en"
}

Response: Returns a generation_id. Poll with get_generation_status.

#Multi-voice scenes

generate_dialogue is single-voice. For dialogue between two speakers, call the tool once per line (ElevenLabs: pass previous_text / next_text for prosody continuity), then concatenate the resulting audio files yourself.

#get_generation_status

Use the generation_id returned by generate_dialogue to check progress. When the status is completed, the audio file URL is in the outputs array (media_type: "audio").

#Credits and limits

Cost: 26 credits per 1,000 characters, both providers (rounding: decimal ≤ 0.3 floors, > 0.3 ceils).
Minimum: 1 credit for any non-empty text.
Minimum length: 5 characters after audio tags are stripped.
Maximum length: ElevenLabs 5,000 characters; Google 4,000 bytes for text (and ≤ 8,000 bytes for text + style_prompt combined).

See Limitations for full details.