What is the best open-source text to speech model in 2026?

Qwen3-TTS is the most capable all-rounder: 1.7B parameters, 10 languages, voice cloning from 3 seconds, and natural language voice direction — all under Apache 2.0. For English-only on CPU, Kokoro (82M params) achieves commercial-grade quality. For voice cloning specifically, Chatterbox wins 65.3% of blind tests against ElevenLabs.

Can open-source TTS compete with ElevenLabs?

Yes. In blind listening tests, Chatterbox beats ElevenLabs 65.3% vs 24.5%. Voxtral wins 68.4% of zero-shot voice cloning comparisons against ElevenLabs Flash v2.5. Fish Audio S2 Pro has the highest Bradley-Terry score (3.07) of any TTS model in independent A/B testing. The ELO gap between open-source and commercial models has narrowed from 223 to 81 points since 2023.

Which open-source TTS model runs without a GPU?

Kokoro (82M parameters, Apache 2.0) runs on CPU with no GPU requirement. It works on Mac, Linux, and Windows with just Python 3.8+. The model is only 300MB and achieves 96x real-time speed even on CPU. Piper TTS also runs on CPU and supports 47 languages with 900+ voices, ideal for embedded systems.

Is Voxtral TTS really free for commercial use?

No. Voxtral uses a CC BY-NC 4.0 license, which prohibits commercial use without a separate license from Mistral AI. For free commercial use, choose Apache 2.0 models (Qwen3-TTS, Kokoro, Fish Audio S2 Pro, Dia, CosyVoice2) or MIT-licensed models (Chatterbox, Piper).

How much does it cost to self-host open-source TTS?

Hardware costs vary by model: Kokoro runs on any CPU ($0 if you have a computer). Chatterbox needs a 4GB+ GPU (RTX 3060 ~$250 used). Qwen3-TTS needs 6GB+ VRAM. Fish Audio S2 Pro needs 12GB+ VRAM (RTX 4090 ~$1,600). Self-hosting Fish Audio breaks even vs their $15/1M API at roughly 50 hours of audio per month.

What happened to Coqui TTS?

Coqui shut down its commercial operations. The open-source library still works but is no longer actively maintained. Voice quality is well below current models like Kokoro, Chatterbox, and Qwen3-TTS. It was significant for the open-source TTS ecosystem but has been surpassed in every dimension.

Best Open-Source Text to Speech 2026: 8 Models Compared (Benchmarks, Licenses, Hardware)

The Best Open-Source TTS Models Right Now

The gap between open-source and commercial TTS has nearly closed. In 2023, the best open-weight model trailed commercial leaders by 223 ELO points on the Artificial Analysis Speech Arena. By mid-2026, that gap is down to 81 ELO — and in blind tests, individual open-source models now beat ElevenLabs more often than not.

If you just want the quick recommendation: use Qwen3-TTS for the most capable all-rounder (10 languages, voice cloning, natural language direction — all free). Use Kokoro if you need something that runs on a Raspberry Pi with zero GPU. Use Chatterbox if you need voice cloning that beats ElevenLabs in blind tests. Use Fish Audio S2 Pro if you need commercial-grade quality with a cloud API fallback.

Why Open-Source TTS Makes Sense Now (When It Didn't Before)

Three years ago, open-source TTS meant robotic voices and painful setup. That's no longer true. Here's what changed:

Quality parity — Chatterbox beats ElevenLabs in 65.3% of blind comparisons. Voxtral wins 68.4% against ElevenLabs Flash v2.5. Fish Audio S2 Pro scored the highest Bradley-Terry coefficient of any model (3.07) in independent A/B tests.
Cost — ElevenLabs charges $0.18–$0.30 per 1,000 characters depending on plan. Open-source models cost $0 to run on your hardware. Even hosted options like Fish Audio's API charge just $15/1M characters.
Vendor lock-in is real — Play.ht got acquired by Meta and shut down overnight, deleting all user data. Coqui AI shut down its commercial offering. With open-source, your models can't be taken away (see our Play.ht shutdown guide for the cautionary tale).
Privacy — Local inference means your text never leaves your machine. Critical for medical, legal, and financial applications.

Open-Source TTS Models Compared (2026)

Model	Params	License	Languages	Voice Clone	GPU Required
Qwen3-TTS	1.7B	Apache 2.0	10	3 sec	Yes (6GB+)
Kokoro	82M	Apache 2.0	1 (EN)	No	No (CPU)
Chatterbox	350M	MIT	23+	5 sec	Yes (4GB+)
Fish Audio S2 Pro	4.4B	Apache 2.0*	80+	10–30 sec	Yes (12GB+)
Voxtral	4B	CC BY-NC 4.0*	9	3 sec	Yes (16GB+)
Dia 2	1.6B	Apache 2.0	1 (EN)	No	Yes (8GB+)
CosyVoice2	0.5B	Apache 2.0	9+	Yes	Yes (6GB+)
Piper	~15M	MIT	47	No	No (CPU)

* Fish Audio S2 Pro open weights are Apache 2.0 for research; commercial API use requires a paid license. Voxtral CC BY-NC 4.0 is non-commercial; commercial license available from Mistral.

Qwen3-TTS — The New All-Rounder (1.7B, Apache 2.0)

Released January 2026 by Alibaba, Qwen3-TTS is the most capable open-source TTS model available. It speaks 10 languages, clones voices from 3 seconds of reference audio, and — this is the killer feature — accepts natural language voice direction. Instead of SSML tags or phoneme markup, you tell it "speak slowly with a warm, reassuring tone" and it does.

The benchmarks back it up: lowest word error rate in 6 of 10 tested languages compared to ElevenLabs Multilingual v2, and a speaker similarity score of 0.789 for voice cloning. The 97ms inference latency makes it viable for near-real-time applications.

The catch: you need an NVIDIA GPU with at least 6GB VRAM (RTX 3060 12GB recommended). No Mac support, no AMD support. If you don't have compatible hardware, look at Kokoro (runs on CPU) or use Fish Audio's cloud API. Full setup instructions and benchmarks in our Qwen3-TTS review.

Kokoro — Tiny Model, Incredible Quality (82M, Apache 2.0)

Kokoro proves you don't need billions of parameters for great TTS. At 82 million parameters — roughly 300MB on disk — it hit #1 on the TTS Arena and achieves the highest Mean Opinion Score (4.2/5) in its weight class. It runs 96x real-time on GPU and 210x on optimized setups.

The biggest selling point: Kokoro runs on CPU. No GPU, no CUDA, no NVIDIA dependency. It'll run on a $35 Raspberry Pi. This makes it the go-to for embedded systems, edge devices, accessibility tools, and anyone who just wants local TTS without hardware hassles.

Limitations are real: English only, no voice cloning, limited voice selection. If you need multilingual or voice cloning, Qwen3-TTS or Chatterbox are better fits. But for English narration, audiobook generation, and accessibility — Kokoro punches way above its weight. Full analysis in our Kokoro TTS review (which hit Google page 1 in its first week — clearly there's demand for independent reviews of this model).

Chatterbox — Best Voice Cloning, Beats ElevenLabs (350M, MIT)

Chatterbox is the headline act of open-source TTS in 2026. Built by Resemble AI on top of a 0.5B Llama backbone, it won 65.3% of blind tests against ElevenLabs (vs 24.5% for ElevenLabs, 10.2% ties). The voice cloning from just 5 seconds of reference audio is remarkably accurate.

What makes Chatterbox unique: native paralinguistic tags. You can insert [laugh], [cough], [sigh], [gasp] directly in your text and the model produces natural-sounding non-speech audio. No other model — open-source or commercial — handles this as cleanly.

The MIT license is the most permissive option on this list: use it commercially, modify it, redistribute it, no attribution required. For teams that need voice cloning without per-character API costs, Chatterbox is the clear winner. Full review on our Chatterbox page.

Fish Audio S2 Pro — #1 in Blind Tests, Cloud + Self-Host (4.4B, Apache 2.0)

Fish Audio S2 Pro holds the highest Bradley-Terry score of any TTS model: 3.07, evaluated across 71,000 blind A/B comparison pairs. On the Artificial Analysis Arena, its ELO of 1123 makes it the #1 open-weight model overall — just 81 points behind the commercial leader. It wins 61% of its head-to-head matchups.

The dual appeal: you can use the managed API at $15/1M characters (6–11x cheaper than ElevenLabs) or self-host under Apache 2.0. The model supports 80+ languages with a DualAR architecture trained on 300,000+ hours for English and Chinese. Self-hosting breakeven is roughly 50 hours of audio per month on an RTX 4090.

The gotcha: billing is per UTF-8 byte, not per character. For English, 1 character = 1 byte, so $15/1M characters is accurate. For Chinese, Japanese, or Korean, each character is 3 bytes — effectively 3x the cost. We break this down in our Fish Audio pricing guide. For a head-to-head with the commercial leader, see Fish Audio vs ElevenLabs.

Voxtral — Mistral's ElevenLabs Killer (4B, CC BY-NC)

Released March 26, 2026, Voxtral TTS is Mistral AI's entry into text-to-speech — and the benchmarks are aggressive. In Mistral's own evaluation (note: internal, not third-party), Voxtral won 68.4% of zero-shot voice cloning comparisons against ElevenLabs Flash v2.5 and 69.9% on voice customization tasks. Against the more expensive ElevenLabs v3, the wins narrow to 55.4% — still a majority.

The 4-billion parameter model runs on a single GPU (16GB VRAM minimum, RTX 4090 recommended) and supports 9 languages. Voice cloning from 3 seconds of audio. The API pricing at $0.016/1K characters is 73% cheaper than ElevenLabs Flash and 87% cheaper than ElevenLabs v3.

The critical licensing caveat: Voxtral uses CC BY-NC 4.0. That means free for research, personal projects, and non-commercial use — but commercial applications require a separate license from Mistral. This is the biggest practical difference from Apache 2.0 models like Qwen3 and Kokoro, which have no commercial restrictions. Check Mistral's pricing page for commercial licensing terms.

Dia 2 — Multi-Speaker Dialogue (1.6B, Apache 2.0)

Dia is the specialist for podcast-style and conversational content. Built by Nari Labs, it generates realistic multi-speaker dialogue with distinct voices, turn-taking, and emotion tags — all from a single model. No other open-source TTS handles multi-speaker scenarios this well.

The 1.6B parameter model runs on an 8GB+ GPU and supports streaming via fal.ai (approximately $40/1M characters hosted). Under Apache 2.0, you can self-host for free. The limitation is English-only, and voice cloning isn't supported — Dia generates its own speaker voices rather than cloning yours. Full review in our Dia TTS review.

CosyVoice2 — Multilingual Streaming (0.5B, Apache 2.0)

CosyVoice2-0.5B is the rising star in open-source TTS communities, consistently recommended alongside Fish Audio and IndexTTS in LocalLLaMA discussions. The standout feature is 150ms ultra-low latency in streaming mode — virtually lossless compared to batch synthesis. It achieves human-parity on Chinese "hard" test sets.

The model supports 9 base languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) plus 18+ Chinese dialects. Cross-lingual voice cloning works well — clone a voice in English, speak in Chinese. At 0.5B parameters, it's one of the most efficient multilingual models available, with 30–50% fewer pronunciation errors than CosyVoice 1.

Piper — 900+ Voices for Embedded and Offline (MIT)

Piper isn't the most natural-sounding model on this list, but it's the most practical for offline and embedded use. With 900+ voices across 47 languages, ONNX runtime for cross-platform deployment, and models as small as ~15MB, Piper runs anywhere — Raspberry Pi, Home Assistant, offline kiosks, embedded devices.

If you're building a smart home integration, an offline accessibility tool, or a kiosk that needs TTS without internet — Piper is the answer. The voice quality is below the neural models above, but the reliability and deployment flexibility are unmatched.

What Happened to Coqui and Bark?

Two models you'll see in older guides but shouldn't use for new projects:

Coqui TTS — The company behind Coqui shut down its commercial operations. The open-source library still works and has a community following, but it's no longer actively maintained. Quality is well below current models.
Bark — Suno's text-to-audio model can generate speech, music, and sound effects. Interesting for creative applications, but unreliable for straightforward TTS — it sometimes hallucinates words or produces garbled output. Not recommended for production speech synthesis.

Both were significant for the open-source TTS ecosystem, but the models listed above have surpassed them in every dimension.

When to Use Open-Source vs Commercial TTS

Use Open-Source When:

Cost is the primary constraint
Privacy requirements prohibit external API calls
You need to customize the model (fine-tuning, new voices)
Vendor lock-in is a dealbreaker
You have GPU hardware available
Latency requirements exceed what cloud APIs offer

Use Commercial When:

You need maximum voice variety (1,000+ voices)
Setup time matters more than recurring cost
You don't want to manage GPU infrastructure
Enterprise support, SLAs, and compliance matter
You need a polished web interface, not just an API
Volume is low enough that per-character pricing is cheaper than hardware

Many teams use both: open-source for bulk processing (training data, automated narration, CI/CD pipelines) and commercial APIs for customer-facing audio where maximum quality matters. Our TTS pricing comparison covers 11 services including self-hosted cost estimates, and the cost calculator can help you model the economics at your specific usage level.

License Comparison: What You Can Actually Do Commercially

This is where most guides get it wrong. "Open-source" doesn't always mean "free for commercial use." Here's the actual breakdown:

License	Models	Commercial Use	Modifications
Apache 2.0	Qwen3-TTS, Kokoro, Dia, Fish Audio S2, CosyVoice2	Free, unrestricted	Allowed
MIT	Chatterbox, Piper	Free, unrestricted	Allowed
CC BY-NC 4.0	Voxtral	License from Mistral required	Allowed (share-alike)

If you're building a commercial product and don't want to deal with licensing lawyers, stick with Apache 2.0 or MIT models. They're the safest options for production deployment.

Getting Started: Hardware and Setup

Hardware requirements vary dramatically across models. Here's a practical guide:

No GPU? Start Here

Kokoro — pip install, runs on any machine with Python 3.8+. 300MB model download. Works on Mac, Linux, Windows. The easiest on-ramp to local TTS. See our Kokoro setup guide.

Have an NVIDIA GPU?

Qwen3-TTS (6GB+ VRAM) or Chatterbox (4GB+ VRAM). Both install via pip with CUDA support. Qwen3 for multilingual, Chatterbox for voice cloning. See our Qwen3-TTS review.

Want a Cloud API Instead?

Fish Audio API ($15/1M chars) gives you S2 Pro quality without managing hardware. The best of both worlds — open-source model quality with API convenience. See our Fish Audio pricing breakdown.

Embedded / Offline?

Piper for maximum compatibility (ONNX, 900+ voices, 47 languages) or Kokoro for better quality if English-only is acceptable.

How We Evaluated These Models

We cross-referenced three data sources: the Artificial Analysis Speech Arena (ELO-based blind human evaluation across 71,000+ comparison pairs), Fish Audio's independent Bradley-Terry A/B testing, and Mistral's internal benchmark results. We also tracked LocalLLaMA community recommendations (where developers share real-world results without marketing bias) and referenced our own reviews of Kokoro, Qwen3-TTS, Fish Audio, Dia, and Chatterbox.

For a commercial TTS comparison including all the paid services these open-source models compete against, see our best text-to-speech guide and best TTS API comparison. For cloud-native alternatives with free tiers, check our Google Cloud TTS pricing guide and Azure TTS pricing guide.

By TextToLab Research Team · Last verified June 2026. ELO rankings from Artificial Analysis Speech Arena. Bradley-Terry scores from Fish Audio's independent blind testing (71K pairs). Voxtral benchmarks from Mistral AI internal evaluation. Community data from r/LocalLLaMA. Pricing from official product pages. ElevenLabs affiliate link disclosed — all other recommendations are independent.