The Best Open-Source TTS Models Right Now
The gap between open-source and commercial TTS has nearly closed. In 2023, the best open-weight model trailed commercial leaders by 223 ELO points on the Artificial Analysis Speech Arena. By mid-2026, that gap is down to 81 ELO — and in blind tests, individual open-source models now beat ElevenLabs more often than not.
If you just want the quick recommendation: use Qwen3-TTS for the most capable all-rounder (10 languages, voice cloning, natural language direction — all free). Use Kokoro if you need something that runs on a Raspberry Pi with zero GPU. Use Chatterbox if you need voice cloning that beats ElevenLabs in blind tests. Use Fish Audio S2 Pro if you need commercial-grade quality with a cloud API fallback.
Why Open-Source TTS Makes Sense Now (When It Didn't Before)
Three years ago, open-source TTS meant robotic voices and painful setup. That's no longer true. Here's what changed:
- Quality parity — Chatterbox beats ElevenLabs in 65.3% of blind comparisons. Voxtral wins 68.4% against ElevenLabs Flash v2.5. Fish Audio S2 Pro scored the highest Bradley-Terry coefficient of any model (3.07) in independent A/B tests.
- Cost — ElevenLabs charges $0.18–$0.30 per 1,000 characters depending on plan. Open-source models cost $0 to run on your hardware. Even hosted options like Fish Audio's API charge just $15/1M characters.
- Vendor lock-in is real — Play.ht got acquired by Meta and shut down overnight, deleting all user data. Coqui AI shut down its commercial offering. With open-source, your models can't be taken away (see our Play.ht shutdown guide for the cautionary tale).
- Privacy — Local inference means your text never leaves your machine. Critical for medical, legal, and financial applications.
Open-Source TTS Models Compared (2026)
| Model | Params | License | Languages | Voice Clone | GPU Required |
|---|---|---|---|---|---|
| Qwen3-TTS | 1.7B | Apache 2.0 | 10 | 3 sec | Yes (6GB+) |
| Kokoro | 82M | Apache 2.0 | 1 (EN) | No | No (CPU) |
| Chatterbox | 350M | MIT | 23+ | 5 sec | Yes (4GB+) |
| Fish Audio S2 Pro | 4.4B | Apache 2.0* | 80+ | 10–30 sec | Yes (12GB+) |
| Voxtral | 4B | CC BY-NC 4.0* | 9 | 3 sec | Yes (16GB+) |
| Dia 2 | 1.6B | Apache 2.0 | 1 (EN) | No | Yes (8GB+) |
| CosyVoice2 | 0.5B | Apache 2.0 | 9+ | Yes | Yes (6GB+) |
| Piper | ~15M | MIT | 47 | No | No (CPU) |
* Fish Audio S2 Pro open weights are Apache 2.0 for research; commercial API use requires a paid license. Voxtral CC BY-NC 4.0 is non-commercial; commercial license available from Mistral.
Qwen3-TTS — The New All-Rounder (1.7B, Apache 2.0)
Released January 2026 by Alibaba, Qwen3-TTS is the most capable open-source TTS model available. It speaks 10 languages, clones voices from 3 seconds of reference audio, and — this is the killer feature — accepts natural language voice direction. Instead of SSML tags or phoneme markup, you tell it "speak slowly with a warm, reassuring tone" and it does.
The benchmarks back it up: lowest word error rate in 6 of 10 tested languages compared to ElevenLabs Multilingual v2, and a speaker similarity score of 0.789 for voice cloning. The 97ms inference latency makes it viable for near-real-time applications.
The catch: you need an NVIDIA GPU with at least 6GB VRAM (RTX 3060 12GB recommended). No Mac support, no AMD support. If you don't have compatible hardware, look at Kokoro (runs on CPU) or use Fish Audio's cloud API. Full setup instructions and benchmarks in our Qwen3-TTS review.
Kokoro — Tiny Model, Incredible Quality (82M, Apache 2.0)
Kokoro proves you don't need billions of parameters for great TTS. At 82 million parameters — roughly 300MB on disk — it hit #1 on the TTS Arena and achieves the highest Mean Opinion Score (4.2/5) in its weight class. It runs 96x real-time on GPU and 210x on optimized setups.
The biggest selling point: Kokoro runs on CPU. No GPU, no CUDA, no NVIDIA dependency. It'll run on a $35 Raspberry Pi. This makes it the go-to for embedded systems, edge devices, accessibility tools, and anyone who just wants local TTS without hardware hassles.
Limitations are real: English only, no voice cloning, limited voice selection. If you need multilingual or voice cloning, Qwen3-TTS or Chatterbox are better fits. But for English narration, audiobook generation, and accessibility — Kokoro punches way above its weight. Full analysis in our Kokoro TTS review (which hit Google page 1 in its first week — clearly there's demand for independent reviews of this model).
Chatterbox — Best Voice Cloning, Beats ElevenLabs (350M, MIT)
Chatterbox is the headline act of open-source TTS in 2026. Built by Resemble AI on top of a 0.5B Llama backbone, it won 65.3% of blind tests against ElevenLabs (vs 24.5% for ElevenLabs, 10.2% ties). The voice cloning from just 5 seconds of reference audio is remarkably accurate.
What makes Chatterbox unique: native paralinguistic tags. You can insert [laugh], [cough], [sigh], [gasp] directly in your text and the model produces natural-sounding non-speech audio. No other model — open-source or commercial — handles this as cleanly.
The MIT license is the most permissive option on this list: use it commercially, modify it, redistribute it, no attribution required. For teams that need voice cloning without per-character API costs, Chatterbox is the clear winner. Full review on our Chatterbox page.
Fish Audio S2 Pro — #1 in Blind Tests, Cloud + Self-Host (4.4B, Apache 2.0)
Fish Audio S2 Pro holds the highest Bradley-Terry score of any TTS model: 3.07, evaluated across 71,000 blind A/B comparison pairs. On the Artificial Analysis Arena, its ELO of 1123 makes it the #1 open-weight model overall — just 81 points behind the commercial leader. It wins 61% of its head-to-head matchups.
The dual appeal: you can use the managed API at $15/1M characters (6–11x cheaper than ElevenLabs) or self-host under Apache 2.0. The model supports 80+ languages with a DualAR architecture trained on 300,000+ hours for English and Chinese. Self-hosting breakeven is roughly 50 hours of audio per month on an RTX 4090.
The gotcha: billing is per UTF-8 byte, not per character. For English, 1 character = 1 byte, so $15/1M characters is accurate. For Chinese, Japanese, or Korean, each character is 3 bytes — effectively 3x the cost. We break this down in our Fish Audio pricing guide. For a head-to-head with the commercial leader, see Fish Audio vs ElevenLabs.
Voxtral — Mistral's ElevenLabs Killer (4B, CC BY-NC)
Released March 26, 2026, Voxtral TTS is Mistral AI's entry into text-to-speech — and the benchmarks are aggressive. In Mistral's own evaluation (note: internal, not third-party), Voxtral won 68.4% of zero-shot voice cloning comparisons against ElevenLabs Flash v2.5 and 69.9% on voice customization tasks. Against the more expensive ElevenLabs v3, the wins narrow to 55.4% — still a majority.
The 4-billion parameter model runs on a single GPU (16GB VRAM minimum, RTX 4090 recommended) and supports 9 languages. Voice cloning from 3 seconds of audio. The API pricing at $0.016/1K characters is 73% cheaper than ElevenLabs Flash and 87% cheaper than ElevenLabs v3.
The critical licensing caveat: Voxtral uses CC BY-NC 4.0. That means free for research, personal projects, and non-commercial use — but commercial applications require a separate license from Mistral. This is the biggest practical difference from Apache 2.0 models like Qwen3 and Kokoro, which have no commercial restrictions. Check Mistral's pricing page for commercial licensing terms.
Dia 2 — Multi-Speaker Dialogue (1.6B, Apache 2.0)
Dia is the specialist for podcast-style and conversational content. Built by Nari Labs, it generates realistic multi-speaker dialogue with distinct voices, turn-taking, and emotion tags — all from a single model. No other open-source TTS handles multi-speaker scenarios this well.
The 1.6B parameter model runs on an 8GB+ GPU and supports streaming via fal.ai (approximately $40/1M characters hosted). Under Apache 2.0, you can self-host for free. The limitation is English-only, and voice cloning isn't supported — Dia generates its own speaker voices rather than cloning yours. Full review in our Dia TTS review.
CosyVoice2 — Multilingual Streaming (0.5B, Apache 2.0)
CosyVoice2-0.5B is the rising star in open-source TTS communities, consistently recommended alongside Fish Audio and IndexTTS in LocalLLaMA discussions. The standout feature is 150ms ultra-low latency in streaming mode — virtually lossless compared to batch synthesis. It achieves human-parity on Chinese "hard" test sets.
The model supports 9 base languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) plus 18+ Chinese dialects. Cross-lingual voice cloning works well — clone a voice in English, speak in Chinese. At 0.5B parameters, it's one of the most efficient multilingual models available, with 30–50% fewer pronunciation errors than CosyVoice 1.
Piper — 900+ Voices for Embedded and Offline (MIT)
Piper isn't the most natural-sounding model on this list, but it's the most practical for offline and embedded use. With 900+ voices across 47 languages, ONNX runtime for cross-platform deployment, and models as small as ~15MB, Piper runs anywhere — Raspberry Pi, Home Assistant, offline kiosks, embedded devices.
If you're building a smart home integration, an offline accessibility tool, or a kiosk that needs TTS without internet — Piper is the answer. The voice quality is below the neural models above, but the reliability and deployment flexibility are unmatched.
What Happened to Coqui and Bark?
Two models you'll see in older guides but shouldn't use for new projects:
- Coqui TTS — The company behind Coqui shut down its commercial operations. The open-source library still works and has a community following, but it's no longer actively maintained. Quality is well below current models.
- Bark — Suno's text-to-audio model can generate speech, music, and sound effects. Interesting for creative applications, but unreliable for straightforward TTS — it sometimes hallucinates words or produces garbled output. Not recommended for production speech synthesis.
Both were significant for the open-source TTS ecosystem, but the models listed above have surpassed them in every dimension.
When to Use Open-Source vs Commercial TTS
Use Open-Source When:
- Cost is the primary constraint
- Privacy requirements prohibit external API calls
- You need to customize the model (fine-tuning, new voices)
- Vendor lock-in is a dealbreaker
- You have GPU hardware available
- Latency requirements exceed what cloud APIs offer
Use Commercial When:
- You need maximum voice variety (1,000+ voices)
- Setup time matters more than recurring cost
- You don't want to manage GPU infrastructure
- Enterprise support, SLAs, and compliance matter
- You need a polished web interface, not just an API
- Volume is low enough that per-character pricing is cheaper than hardware
Many teams use both: open-source for bulk processing (training data, automated narration, CI/CD pipelines) and commercial APIs for customer-facing audio where maximum quality matters. Our TTS pricing comparison covers 11 services including self-hosted cost estimates, and the cost calculator can help you model the economics at your specific usage level.
License Comparison: What You Can Actually Do Commercially
This is where most guides get it wrong. "Open-source" doesn't always mean "free for commercial use." Here's the actual breakdown:
| License | Models | Commercial Use | Modifications |
|---|---|---|---|
| Apache 2.0 | Qwen3-TTS, Kokoro, Dia, Fish Audio S2, CosyVoice2 | Free, unrestricted | Allowed |
| MIT | Chatterbox, Piper | Free, unrestricted | Allowed |
| CC BY-NC 4.0 | Voxtral | License from Mistral required | Allowed (share-alike) |
If you're building a commercial product and don't want to deal with licensing lawyers, stick with Apache 2.0 or MIT models. They're the safest options for production deployment.
Getting Started: Hardware and Setup
Hardware requirements vary dramatically across models. Here's a practical guide:
No GPU? Start Here
Kokoro — pip install, runs on any machine with Python 3.8+. 300MB model download. Works on Mac, Linux, Windows. The easiest on-ramp to local TTS. See our Kokoro setup guide.
Have an NVIDIA GPU?
Qwen3-TTS (6GB+ VRAM) or Chatterbox (4GB+ VRAM). Both install via pip with CUDA support. Qwen3 for multilingual, Chatterbox for voice cloning. See our Qwen3-TTS review.
Want a Cloud API Instead?
Fish Audio API ($15/1M chars) gives you S2 Pro quality without managing hardware. The best of both worlds — open-source model quality with API convenience. See our Fish Audio pricing breakdown.
Embedded / Offline?
Piper for maximum compatibility (ONNX, 900+ voices, 47 languages) or Kokoro for better quality if English-only is acceptable.
How We Evaluated These Models
We cross-referenced three data sources: the Artificial Analysis Speech Arena (ELO-based blind human evaluation across 71,000+ comparison pairs), Fish Audio's independent Bradley-Terry A/B testing, and Mistral's internal benchmark results. We also tracked LocalLLaMA community recommendations (where developers share real-world results without marketing bias) and referenced our own reviews of Kokoro, Qwen3-TTS, Fish Audio, Dia, and Chatterbox.
For a commercial TTS comparison including all the paid services these open-source models compete against, see our best text-to-speech guide and best TTS API comparison.
Related Guides
By TextToLab Research Team · Last verified June 2026. ELO rankings from Artificial Analysis Speech Arena. Bradley-Terry scores from Fish Audio's independent blind testing (71K pairs). Voxtral benchmarks from Mistral AI internal evaluation. Community data from r/LocalLLaMA. Pricing from official product pages. ElevenLabs affiliate link disclosed — all other recommendations are independent.