Home/Chatterbox Turbo
C

Chatterbox Turbo

by Resemble AI • MIT License • 350M Parameters

The open-source TTS model that does what proprietary models won't — voice cloning from 5 seconds of audio, a continuous emotion slider, and paralinguistic tags that make AI speech sound human. Laughs, sighs, coughs, and all.

Voice CloningOpen SourceEmotion ControlParalinguistic Tags
20
Pre-made Voices
<150ms
First-Chunk Latency
23
Languages
$0.025/1K
API Cost

Paralinguistic Tags

Insert natural vocal reactions directly into your text. Chatterbox generates these sounds in the cloned or selected voice — no post-processing, no splicing, no manual editing.

[laugh]

A natural, spontaneous laugh

[chuckle]

A soft, quiet laugh

[cough]

A realistic cough sound

[sigh]

An exhaled breath expressing emotion

[gasp]

A sharp intake of breath, surprise

[groan]

A low vocal expression of discomfort

[sniff]

A nasal inhalation

[clear throat]

A throat-clearing sound

[sush]

A hushing or shushing sound

Example Prompts

Storytelling with laughter

"And then she looked at me and said, 'You forgot the cake?' [laugh] I couldn't believe it either."

Emotional narration

[sigh] It had been a long journey. But standing at the top of the mountain, looking out at the world below, every step felt worth it.

Conversational reaction

[gasp] Wait, you're telling me the entire server went down during the demo? [chuckle] That's exactly what happened to us last year.

Professional with natural pauses

[clear throat] Good morning, everyone. Today I'd like to share some findings that I think will change how we approach this problem.

Emotion Exaggeration Control

A single continuous parameter controls how expressive the voice is. Same text, same voice — completely different delivery. No other open-source model offers this level of control.

Low (0.1 – 0.3)

Monotone, neutral delivery. Best for IVR systems, automated announcements, and clinical readouts where personality should stay minimal.

Natural (0.6 – 0.9)

Balanced, conversational expression. The default range for narration, podcasts, tutorials, and general-purpose content.

High (1.2 – 1.6)

Expressive and animated. Ideal for storytelling, audio dramas, advertisements, and content that needs to grab attention.

Dramatic (1.7 – 2.0)

Maximum expression exaggeration. Best for character voices, comedic content, trailers, and highly theatrical delivery.

Zero-Shot Voice Cloning

Clone any voice from a 5-second audio sample. No training, no fine-tuning, no waiting. The model captures pitch, tone, cadence, and vocal texture in a single forward pass.

1. Capture

Provide 5+ seconds of reference audio — a recording, an upload, or a microphone capture.

2. Analyze

Chatterbox extracts vocal fingerprint — pitch range, tonal quality, speaking pace, and unique characteristics.

3. Generate

Type any text and Chatterbox generates speech in the cloned voice. Emotion control and paralinguistic tags work with cloned voices too.

Safety: PerTh Neural Watermark

Every generated audio file includes an imperceptible watermark that survives MP3 compression and editing. This enables detection of synthetic speech for deepfake prevention and content provenance.

All 20 Pre-Made Voices

Each voice has a distinct personality and style. Click to see detailed characteristics, use cases, and tips.

A
Aaron
Professionalmale

Clear and professional male voice with confident delivery. Great for corporate content and presentations.

A
Abigail
Warmfemale

Warm and approachable female voice with natural tone. Perfect for narration and conversational content.

A
Anaya
Expressivefemale

Dynamic and expressive female voice with rich emotional range. Ideal for storytelling and creative content.

A
Andy
Friendlymale

Friendly and versatile male voice. The default Chatterbox voice, great for general-purpose TTS.

A
Archer
Authoritativemale

Strong and authoritative male voice with depth. Suited for documentaries and serious narration.

B
Brian
Casualmale

Laid-back and casual male voice with easygoing delivery. Works well for podcasts and informal content.

C
Chloe
Brightfemale

Bright and energetic female voice with youthful enthusiasm. Great for social media and upbeat content.

D
Dylan
Narrativemale

Smooth narrative male voice with engaging delivery. Excellent for audiobooks and long-form content.

E
Emmanuel
Deepmale

Deep and resonant male voice with gravitas. Perfect for trailers, promos, and dramatic content.

E
Ethan
Conversationalmale

Natural and conversational male voice. Ideal for tutorials, explainers, and casual narration.

E
Evelyn
Elegantfemale

Sophisticated and elegant female voice with refined delivery. Great for luxury brands and premium content.

G
Gavin
Energeticmale

Energetic and dynamic male voice with enthusiasm. Perfect for gaming, sports, and high-energy content.

G
Gordon
Maturemale

Mature and seasoned male voice with wisdom. Suited for documentaries, history, and educational content.

I
Ivan
Commandingmale

Commanding male voice with presence and power. Ideal for announcements and authoritative content.

L
Laura
Soothingfemale

Soothing and calming female voice with gentle delivery. Perfect for meditation, wellness, and ASMR.

L
Lucy
Livelyfemale

Lively and charismatic female voice with personality. Great for entertainment, ads, and engaging content.

M
Madison
Clearfemale

Clear and articulate female voice with precision. Ideal for educational content and instructions.

M
Marisol
Vibrantfemale

Vibrant and passionate female voice with energy. Works well for creative projects and storytelling.

M
Meera
Thoughtfulfemale

Thoughtful and measured female voice with intelligence. Perfect for tech content, science, and analysis.

W
Walter
Classicmale

Classic announcer-style male voice with polish. Ideal for commercials, intros, and professional voiceovers.

Use Cases

Where Chatterbox Turbo's unique capabilities create the most value.

Voice Cloning for Content Creators

Clone your own voice and scale content production without re-recording. Podcasters, YouTubers, and course creators use Chatterbox to generate drafts, localize content, or produce variations of their voice at any time.

5-second voice captureNo training or fine-tuningPreserves vocal identity

Emotion-Rich Audiobooks & Audio Dramas

The emotion exaggeration slider lets narrators dial expression from subtle to theatrical. Combined with paralinguistic tags, Chatterbox produces audiobooks and dramas that feel performed, not generated.

Continuous emotion controlCharacter differentiationNatural breathing and pauses

Voice Agents & Conversational AI

Sub-150ms latency makes Chatterbox viable for real-time voice agents. The paralinguistic tags add human-like reactions — an agent that can [chuckle] at a joke or [sigh] with empathy feels fundamentally different from flat TTS.

<150ms first-chunk latencyReal-time streamingNatural vocal reactions

Accessible Open-Source Development

MIT license means full model weights, free commercial use, and no vendor lock-in. Run it on your own GPU, modify the architecture, fine-tune on your data — complete freedom.

MIT license350M parametersSelf-hostable on consumer GPUs

How Chatterbox Compares

Feature comparison against leading TTS services.

FeatureChatterbox TurboElevenLabsOpenAI TTS
LicenseMIT (fully open)ProprietaryProprietary
Voice cloning5s zero-shotInstant + ProfessionalNot available
Emotion controlContinuous sliderStyle presetsNot available
Paralinguistic tags9 tags (laugh, sigh, etc.)Not availableNot available
Pre-made voices201000+9
Languages232957
Latency<150ms~300ms~500ms
Self-hostableYes (GPU required)NoNo
Cost (API)$0.025/1K chars$0.30/1K chars$0.015/1K chars
Neural watermarkPerTh built-inOptionalNot available

Pricing

Two ways to run Chatterbox Turbo — cloud API or self-hosted.

Replicate API
$0.025
per 1K input characters
  • + Pay-as-you-go
  • + No subscription
  • + Voice cloning included
Self-hosted
Free
MIT license
  • + Full model weights
  • + Commercial use
  • + No API costs

Cost example: A 200-word blog post (~1,000 characters) costs about $0.025 to generate. A full audiobook chapter (~5,000 characters) costs about $0.13.

Pros

  • +MIT licensed — fully open source and free to self-host
  • +Voice cloning from just 5 seconds of audio
  • +Natural paralinguistic sounds (laughs, coughs, sighs)
  • +Emotion intensity slider from monotone to dramatic
  • +Very low cost via API ($0.025/1K chars)
  • +350M params — runs on modest hardware

Cons

  • -Newer model, smaller community than ElevenLabs
  • -English-primary (other languages expanding)
  • -No built-in editor or project management
  • -Self-hosting requires GPU

Technical Specifications

Model Size350M parameters
ArchitectureStreaming encoder-decoder transformer
First-Chunk Latency<150ms
Real-Time Factor0.499 on RTX 4090
Output FormatWAV
Voice Clone Input5+ seconds of reference audio
Max Input Length500 characters per request
Emotion ControlTemperature 0.05 – 2.0 (continuous)
SafetyPerTh neural watermark (imperceptible)
LicenseMIT — free commercial use
Source Codegithub.com/resemble-ai/chatterbox
Available OnHuggingFace, Replicate, Fal, RunPod, Modal

Frequently Asked Questions

What is Chatterbox Turbo?

Chatterbox Turbo is an open-source text-to-speech model by Resemble AI with 350 million parameters. It delivers natural speech with zero-shot voice cloning, emotion exaggeration control, and paralinguistic tags — all under an MIT license.

How does zero-shot voice cloning work in Chatterbox?

Chatterbox Turbo can clone any voice from just 5 seconds of reference audio. It analyzes the vocal characteristics — pitch, tone, cadence — and generates new speech matching that voice with no fine-tuning or training required.

What are paralinguistic tags in Chatterbox Turbo?

Paralinguistic tags are text markers like [laugh], [cough], [sigh], and [chuckle] that you insert into your text. The model generates natural-sounding vocal reactions at those points, adding realism that most TTS models cannot achieve.

How does emotion control work?

Chatterbox Turbo includes an expression exaggeration parameter that controls emotional intensity on a continuous scale. Set it low for monotone, neutral delivery or high for dramatically expressive speech — all from the same text and voice.

Is Chatterbox Turbo free to use?

The model is MIT licensed — completely free to download, modify, and use commercially. Self-hosting requires a GPU. Cloud APIs like Replicate charge approximately $0.025 per 1,000 input characters.

How does Chatterbox Turbo compare to ElevenLabs?

Chatterbox Turbo is fully open source (MIT) while ElevenLabs is proprietary. Chatterbox offers unique paralinguistic tags and emotion control, while ElevenLabs provides a larger voice library and more mature platform. Chatterbox is significantly cheaper, especially self-hosted.

What languages does Chatterbox Turbo support?

Chatterbox Turbo supports 23 languages including English, Spanish, French, German, Japanese, Chinese, Arabic, and more. English has the strongest support with all 20 pre-made voices optimized for it.

What is the PerTh neural watermark?

Every audio file generated by Chatterbox includes Resemble AI's PerTh (Perceptual Threshold) watermark — an imperceptible neural signature that survives compression and editing, enabling detection of AI-generated speech for safety and provenance.

Compare with Other Services