Chatterbox Turbo: Free Voice Cloning TTS — 20 Voices, MIT License

Q: What is Chatterbox Turbo?

Chatterbox Turbo is an open-source text-to-speech model by Resemble AI with 350 million parameters. It delivers natural speech with zero-shot voice cloning, emotion exaggeration control, and paralinguistic tags — all under an MIT license.

Q: How does zero-shot voice cloning work in Chatterbox?

Chatterbox Turbo can clone any voice from just 5 seconds of reference audio. It analyzes the vocal characteristics — pitch, tone, cadence — and generates new speech matching that voice with no fine-tuning or training required.

Q: What are paralinguistic tags in Chatterbox Turbo?

Paralinguistic tags are text markers like [laugh], [cough], [sigh], and [chuckle] that you insert into your text. The model generates natural-sounding vocal reactions at those points, adding realism that most TTS models cannot achieve.

Q: How does emotion control work?

Chatterbox Turbo includes an expression exaggeration parameter that controls emotional intensity on a continuous scale. Set it low for monotone, neutral delivery or high for dramatically expressive speech — all from the same text and voice.

Q: Is Chatterbox Turbo free to use?

The model is MIT licensed — completely free to download, modify, and use commercially. Self-hosting requires a GPU. Cloud APIs like Replicate charge approximately $0.025 per 1,000 input characters.

Q: How does Chatterbox Turbo compare to ElevenLabs?

Chatterbox Turbo is fully open source (MIT) while ElevenLabs is proprietary. Chatterbox offers unique paralinguistic tags and emotion control, while ElevenLabs provides a larger voice library and more mature platform. Chatterbox is significantly cheaper, especially self-hosted.

Q: What languages does Chatterbox Turbo support?

Chatterbox Turbo supports 23 languages including English, Spanish, French, German, Japanese, Chinese, Arabic, and more. English has the strongest support with all 20 pre-made voices optimized for it.

Q: What is the PerTh neural watermark?

Every audio file generated by Chatterbox includes Resemble AI's PerTh (Perceptual Threshold) watermark — an imperceptible neural signature that survives compression and editing, enabling detection of AI-generated speech for safety and provenance.

Pre-made Voices

<150ms

First-Chunk Latency

Languages

$0.025/1K

API Cost

Paralinguistic Tags

Insert natural vocal reactions directly into your text. Chatterbox generates these sounds in the cloned or selected voice — no post-processing, no splicing, no manual editing.

[laugh]

A natural, spontaneous laugh

[chuckle]

A soft, quiet laugh

[cough]

A realistic cough sound

[sigh]

An exhaled breath expressing emotion

[gasp]

A sharp intake of breath, surprise

[groan]

A low vocal expression of discomfort

[sniff]

A nasal inhalation

[clear throat]

A throat-clearing sound

[sush]

A hushing or shushing sound

Example Prompts

Storytelling with laughter

"And then she looked at me and said, 'You forgot the cake?' [laugh] I couldn't believe it either."

Emotional narration

[sigh] It had been a long journey. But standing at the top of the mountain, looking out at the world below, every step felt worth it.

Conversational reaction

[gasp] Wait, you're telling me the entire server went down during the demo? [chuckle] That's exactly what happened to us last year.

Professional with natural pauses

[clear throat] Good morning, everyone. Today I'd like to share some findings that I think will change how we approach this problem.

Emotion Exaggeration Control

A single continuous parameter controls how expressive the voice is. Same text, same voice — completely different delivery. No other open-source model offers this level of control.

Low (0.1 – 0.3)

Monotone, neutral delivery. Best for IVR systems, automated announcements, and clinical readouts where personality should stay minimal.

Natural (0.6 – 0.9)

Balanced, conversational expression. The default range for narration, podcasts, tutorials, and general-purpose content.

High (1.2 – 1.6)

Expressive and animated. Ideal for storytelling, audio dramas, advertisements, and content that needs to grab attention.

Dramatic (1.7 – 2.0)

Maximum expression exaggeration. Best for character voices, comedic content, trailers, and highly theatrical delivery.

Zero-Shot Voice Cloning

Clone any voice from a 5-second audio sample. No training, no fine-tuning, no waiting. The model captures pitch, tone, cadence, and vocal texture in a single forward pass.

1. Capture

Provide 5+ seconds of reference audio — a recording, an upload, or a microphone capture.

2. Analyze

Chatterbox extracts vocal fingerprint — pitch range, tonal quality, speaking pace, and unique characteristics.

3. Generate

Type any text and Chatterbox generates speech in the cloned voice. Emotion control and paralinguistic tags work with cloned voices too.

Safety: PerTh Neural Watermark

Every generated audio file includes an imperceptible watermark that survives MP3 compression and editing. This enables detection of synthetic speech for deepfake prevention and content provenance.

All 20 Pre-Made Voices

Each voice has a distinct personality and style. Click to see detailed characteristics, use cases, and tips.

Aaron

Professional • male

Clear and professional male voice with confident delivery. Great for corporate content and presentations.

Abigail

Warm • female

Warm and approachable female voice with natural tone. Perfect for narration and conversational content.

Anaya

Expressive • female

Dynamic and expressive female voice with rich emotional range. Ideal for storytelling and creative content.

Andy

Friendly • male

Friendly and versatile male voice. The default Chatterbox voice, great for general-purpose TTS.

Archer

Authoritative • male

Strong and authoritative male voice with depth. Suited for documentaries and serious narration.

Brian

Casual • male

Laid-back and casual male voice with easygoing delivery. Works well for podcasts and informal content.

Chloe

Bright • female

Bright and energetic female voice with youthful enthusiasm. Great for social media and upbeat content.

Dylan

Narrative • male

Smooth narrative male voice with engaging delivery. Excellent for audiobooks and long-form content.

Emmanuel

Deep • male

Deep and resonant male voice with gravitas. Perfect for trailers, promos, and dramatic content.

Ethan

Conversational • male

Natural and conversational male voice. Ideal for tutorials, explainers, and casual narration.

Evelyn

Elegant • female

Sophisticated and elegant female voice with refined delivery. Great for luxury brands and premium content.

Gavin

Energetic • male

Energetic and dynamic male voice with enthusiasm. Perfect for gaming, sports, and high-energy content.

Gordon

Mature • male

Mature and seasoned male voice with wisdom. Suited for documentaries, history, and educational content.

Ivan

Commanding • male

Commanding male voice with presence and power. Ideal for announcements and authoritative content.

Laura

Soothing • female

Soothing and calming female voice with gentle delivery. Perfect for meditation, wellness, and ASMR.

Lucy

Lively • female

Lively and charismatic female voice with personality. Great for entertainment, ads, and engaging content.

Madison

Clear • female

Clear and articulate female voice with precision. Ideal for educational content and instructions.

Marisol

Vibrant • female

Vibrant and passionate female voice with energy. Works well for creative projects and storytelling.

Meera

Thoughtful • female

Thoughtful and measured female voice with intelligence. Perfect for tech content, science, and analysis.

Walter

Classic • male

Classic announcer-style male voice with polish. Ideal for commercials, intros, and professional voiceovers.

Use Cases

Where Chatterbox Turbo's unique capabilities create the most value.

Voice Cloning for Content Creators

Clone your own voice and scale content production without re-recording. Podcasters, YouTubers, and course creators use Chatterbox to generate drafts, localize content, or produce variations of their voice at any time.

5-second voice captureNo training or fine-tuningPreserves vocal identity

Emotion-Rich Audiobooks & Audio Dramas

The emotion exaggeration slider lets narrators dial expression from subtle to theatrical. Combined with paralinguistic tags, Chatterbox produces audiobooks and dramas that feel performed, not generated.

Continuous emotion controlCharacter differentiationNatural breathing and pauses

Voice Agents & Conversational AI

Sub-150ms latency makes Chatterbox viable for real-time voice agents. The paralinguistic tags add human-like reactions — an agent that can [chuckle] at a joke or [sigh] with empathy feels fundamentally different from flat TTS.

<150ms first-chunk latencyReal-time streamingNatural vocal reactions

Accessible Open-Source Development

MIT license means full model weights, free commercial use, and no vendor lock-in. Run it on your own GPU, modify the architecture, fine-tune on your data — complete freedom.

MIT license350M parametersSelf-hostable on consumer GPUs

How Chatterbox Compares

Feature comparison against leading TTS services.

Feature	Chatterbox Turbo	ElevenLabs	OpenAI TTS
License	MIT (fully open)	Proprietary	Proprietary
Voice cloning	5s zero-shot	Instant + Professional	Not available
Emotion control	Continuous slider	Style presets	Not available
Paralinguistic tags	9 tags (laugh, sigh, etc.)	Not available	Not available
Pre-made voices	20	1000+	9
Languages	23	29	57
Latency	<150ms	~300ms	~500ms
Self-hostable	Yes (GPU required)	No	No
Cost (API)	$0.025/1K chars	$0.30/1K chars	$0.015/1K chars
Neural watermark	PerTh built-in	Optional	Not available

Pricing

Two ways to run Chatterbox Turbo — cloud API or self-hosted.

Replicate API

$0.025

per 1K input characters

+ Pay-as-you-go
+ No subscription
+ Voice cloning included

Self-hosted

Free

MIT license

+ Full model weights
+ Commercial use
+ No API costs

Cost example: A 200-word blog post (~1,000 characters) costs about $0.025 to generate. A full audiobook chapter (~5,000 characters) costs about $0.13.

Pros

+MIT licensed — fully open source and free to self-host
+Voice cloning from just 5 seconds of audio
+Natural paralinguistic sounds (laughs, coughs, sighs)
+Emotion intensity slider from monotone to dramatic
+Very low cost via API ($0.025/1K chars)
+350M params — runs on modest hardware

Cons

-Newer model, smaller community than ElevenLabs
-English-primary (other languages expanding)
-No built-in editor or project management
-Self-hosting requires GPU

Technical Specifications

Model Size350M parameters

ArchitectureStreaming encoder-decoder transformer

First-Chunk Latency<150ms

Real-Time Factor0.499 on RTX 4090

Output FormatWAV

Voice Clone Input5+ seconds of reference audio

Max Input Length500 characters per request

Emotion ControlTemperature 0.05 – 2.0 (continuous)

SafetyPerTh neural watermark (imperceptible)

LicenseMIT — free commercial use

Source Codegithub.com/resemble-ai/chatterbox

Available OnHuggingFace, Replicate, Fal, RunPod, Modal

Frequently Asked Questions

What is Chatterbox Turbo?

Chatterbox Turbo is an open-source text-to-speech model by Resemble AI with 350 million parameters. It delivers natural speech with zero-shot voice cloning, emotion exaggeration control, and paralinguistic tags — all under an MIT license.

How does zero-shot voice cloning work in Chatterbox?

Chatterbox Turbo can clone any voice from just 5 seconds of reference audio. It analyzes the vocal characteristics — pitch, tone, cadence — and generates new speech matching that voice with no fine-tuning or training required.

What are paralinguistic tags in Chatterbox Turbo?

Paralinguistic tags are text markers like [laugh], [cough], [sigh], and [chuckle] that you insert into your text. The model generates natural-sounding vocal reactions at those points, adding realism that most TTS models cannot achieve.

How does emotion control work?

Chatterbox Turbo includes an expression exaggeration parameter that controls emotional intensity on a continuous scale. Set it low for monotone, neutral delivery or high for dramatically expressive speech — all from the same text and voice.

Is Chatterbox Turbo free to use?

The model is MIT licensed — completely free to download, modify, and use commercially. Self-hosting requires a GPU. Cloud APIs like Replicate charge approximately $0.025 per 1,000 input characters.

How does Chatterbox Turbo compare to ElevenLabs?

Chatterbox Turbo is fully open source (MIT) while ElevenLabs is proprietary. Chatterbox offers unique paralinguistic tags and emotion control, while ElevenLabs provides a larger voice library and more mature platform. Chatterbox is significantly cheaper, especially self-hosted.

What languages does Chatterbox Turbo support?

Chatterbox Turbo supports 23 languages including English, Spanish, French, German, Japanese, Chinese, Arabic, and more. English has the strongest support with all 20 pre-made voices optimized for it.

What is the PerTh neural watermark?

Every audio file generated by Chatterbox includes Resemble AI's PerTh (Perceptual Threshold) watermark — an imperceptible neural signature that survives compression and editing, enabling detection of AI-generated speech for safety and provenance.