Guide16 min readJune 3, 2026

By TextToLab Research Team

Best Open-Source Text to Speech 2026: 8 Models Compared (Benchmarks, Licenses, Hardware)

Independent comparison of 8 open-source TTS models: Qwen3-TTS, Kokoro, Chatterbox, Fish Audio S2 Pro, Voxtral, Dia, CosyVoice2, and Piper. With ELO rankings, blind test data, license comparison, and hardware requirements.

The Best Open-Source TTS Models Right Now

The gap between open-source and commercial TTS has nearly closed. In 2023, the best open-weight model trailed commercial leaders by 223 ELO points on the Artificial Analysis Speech Arena. By mid-2026, that gap is down to 81 ELO — and in blind tests, individual open-source models now beat ElevenLabs more often than not.

If you just want the quick recommendation: use Qwen3-TTS for the most capable all-rounder (10 languages, voice cloning, natural language direction — all free). Use Kokoro if you need something that runs on a Raspberry Pi with zero GPU. Use Chatterbox if you need voice cloning that beats ElevenLabs in blind tests. Use Fish Audio S2 Pro if you need commercial-grade quality with a cloud API fallback.

Why Open-Source TTS Makes Sense Now (When It Didn't Before)

Three years ago, open-source TTS meant robotic voices and painful setup. That's no longer true. Here's what changed:

Open-Source TTS Models Compared (2026)

ModelParamsLicenseLanguagesVoice CloneGPU Required
Qwen3-TTS1.7BApache 2.0103 secYes (6GB+)
Kokoro82MApache 2.01 (EN)NoNo (CPU)
Chatterbox350MMIT23+5 secYes (4GB+)
Fish Audio S2 Pro4.4BApache 2.0*80+10–30 secYes (12GB+)
Voxtral4BCC BY-NC 4.0*93 secYes (16GB+)
Dia 21.6BApache 2.01 (EN)NoYes (8GB+)
CosyVoice20.5BApache 2.09+YesYes (6GB+)
Piper~15MMIT47NoNo (CPU)

* Fish Audio S2 Pro open weights are Apache 2.0 for research; commercial API use requires a paid license. Voxtral CC BY-NC 4.0 is non-commercial; commercial license available from Mistral.

Qwen3-TTS — The New All-Rounder (1.7B, Apache 2.0)

Released January 2026 by Alibaba, Qwen3-TTS is the most capable open-source TTS model available. It speaks 10 languages, clones voices from 3 seconds of reference audio, and — this is the killer feature — accepts natural language voice direction. Instead of SSML tags or phoneme markup, you tell it "speak slowly with a warm, reassuring tone" and it does.

The benchmarks back it up: lowest word error rate in 6 of 10 tested languages compared to ElevenLabs Multilingual v2, and a speaker similarity score of 0.789 for voice cloning. The 97ms inference latency makes it viable for near-real-time applications.

The catch: you need an NVIDIA GPU with at least 6GB VRAM (RTX 3060 12GB recommended). No Mac support, no AMD support. If you don't have compatible hardware, look at Kokoro (runs on CPU) or use Fish Audio's cloud API. Full setup instructions and benchmarks in our Qwen3-TTS review.

Kokoro — Tiny Model, Incredible Quality (82M, Apache 2.0)

Kokoro proves you don't need billions of parameters for great TTS. At 82 million parameters — roughly 300MB on disk — it hit #1 on the TTS Arena and achieves the highest Mean Opinion Score (4.2/5) in its weight class. It runs 96x real-time on GPU and 210x on optimized setups.

The biggest selling point: Kokoro runs on CPU. No GPU, no CUDA, no NVIDIA dependency. It'll run on a $35 Raspberry Pi. This makes it the go-to for embedded systems, edge devices, accessibility tools, and anyone who just wants local TTS without hardware hassles.

Limitations are real: English only, no voice cloning, limited voice selection. If you need multilingual or voice cloning, Qwen3-TTS or Chatterbox are better fits. But for English narration, audiobook generation, and accessibility — Kokoro punches way above its weight. Full analysis in our Kokoro TTS review (which hit Google page 1 in its first week — clearly there's demand for independent reviews of this model).

Chatterbox — Best Voice Cloning, Beats ElevenLabs (350M, MIT)

Chatterbox is the headline act of open-source TTS in 2026. Built by Resemble AI on top of a 0.5B Llama backbone, it won 65.3% of blind tests against ElevenLabs (vs 24.5% for ElevenLabs, 10.2% ties). The voice cloning from just 5 seconds of reference audio is remarkably accurate.

What makes Chatterbox unique: native paralinguistic tags. You can insert [laugh], [cough], [sigh], [gasp] directly in your text and the model produces natural-sounding non-speech audio. No other model — open-source or commercial — handles this as cleanly.

The MIT license is the most permissive option on this list: use it commercially, modify it, redistribute it, no attribution required. For teams that need voice cloning without per-character API costs, Chatterbox is the clear winner. Full review on our Chatterbox page.

Fish Audio S2 Pro — #1 in Blind Tests, Cloud + Self-Host (4.4B, Apache 2.0)

Fish Audio S2 Pro holds the highest Bradley-Terry score of any TTS model: 3.07, evaluated across 71,000 blind A/B comparison pairs. On the Artificial Analysis Arena, its ELO of 1123 makes it the #1 open-weight model overall — just 81 points behind the commercial leader. It wins 61% of its head-to-head matchups.

The dual appeal: you can use the managed API at $15/1M characters (6–11x cheaper than ElevenLabs) or self-host under Apache 2.0. The model supports 80+ languages with a DualAR architecture trained on 300,000+ hours for English and Chinese. Self-hosting breakeven is roughly 50 hours of audio per month on an RTX 4090.

The gotcha: billing is per UTF-8 byte, not per character. For English, 1 character = 1 byte, so $15/1M characters is accurate. For Chinese, Japanese, or Korean, each character is 3 bytes — effectively 3x the cost. We break this down in our Fish Audio pricing guide. For a head-to-head with the commercial leader, see Fish Audio vs ElevenLabs.

Voxtral — Mistral's ElevenLabs Killer (4B, CC BY-NC)

Released March 26, 2026, Voxtral TTS is Mistral AI's entry into text-to-speech — and the benchmarks are aggressive. In Mistral's own evaluation (note: internal, not third-party), Voxtral won 68.4% of zero-shot voice cloning comparisons against ElevenLabs Flash v2.5 and 69.9% on voice customization tasks. Against the more expensive ElevenLabs v3, the wins narrow to 55.4% — still a majority.

The 4-billion parameter model runs on a single GPU (16GB VRAM minimum, RTX 4090 recommended) and supports 9 languages. Voice cloning from 3 seconds of audio. The API pricing at $0.016/1K characters is 73% cheaper than ElevenLabs Flash and 87% cheaper than ElevenLabs v3.

The critical licensing caveat: Voxtral uses CC BY-NC 4.0. That means free for research, personal projects, and non-commercial use — but commercial applications require a separate license from Mistral. This is the biggest practical difference from Apache 2.0 models like Qwen3 and Kokoro, which have no commercial restrictions. Check Mistral's pricing page for commercial licensing terms.

Dia 2 — Multi-Speaker Dialogue (1.6B, Apache 2.0)

Dia is the specialist for podcast-style and conversational content. Built by Nari Labs, it generates realistic multi-speaker dialogue with distinct voices, turn-taking, and emotion tags — all from a single model. No other open-source TTS handles multi-speaker scenarios this well.

The 1.6B parameter model runs on an 8GB+ GPU and supports streaming via fal.ai (approximately $40/1M characters hosted). Under Apache 2.0, you can self-host for free. The limitation is English-only, and voice cloning isn't supported — Dia generates its own speaker voices rather than cloning yours. Full review in our Dia TTS review.

CosyVoice2 — Multilingual Streaming (0.5B, Apache 2.0)

CosyVoice2-0.5B is the rising star in open-source TTS communities, consistently recommended alongside Fish Audio and IndexTTS in LocalLLaMA discussions. The standout feature is 150ms ultra-low latency in streaming mode — virtually lossless compared to batch synthesis. It achieves human-parity on Chinese "hard" test sets.

The model supports 9 base languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) plus 18+ Chinese dialects. Cross-lingual voice cloning works well — clone a voice in English, speak in Chinese. At 0.5B parameters, it's one of the most efficient multilingual models available, with 30–50% fewer pronunciation errors than CosyVoice 1.

Piper — 900+ Voices for Embedded and Offline (MIT)

Piper isn't the most natural-sounding model on this list, but it's the most practical for offline and embedded use. With 900+ voices across 47 languages, ONNX runtime for cross-platform deployment, and models as small as ~15MB, Piper runs anywhere — Raspberry Pi, Home Assistant, offline kiosks, embedded devices.

If you're building a smart home integration, an offline accessibility tool, or a kiosk that needs TTS without internet — Piper is the answer. The voice quality is below the neural models above, but the reliability and deployment flexibility are unmatched.

What Happened to Coqui and Bark?

Two models you'll see in older guides but shouldn't use for new projects:

Both were significant for the open-source TTS ecosystem, but the models listed above have surpassed them in every dimension.

When to Use Open-Source vs Commercial TTS

Use Open-Source When:

  • Cost is the primary constraint
  • Privacy requirements prohibit external API calls
  • You need to customize the model (fine-tuning, new voices)
  • Vendor lock-in is a dealbreaker
  • You have GPU hardware available
  • Latency requirements exceed what cloud APIs offer

Use Commercial When:

  • You need maximum voice variety (1,000+ voices)
  • Setup time matters more than recurring cost
  • You don't want to manage GPU infrastructure
  • Enterprise support, SLAs, and compliance matter
  • You need a polished web interface, not just an API
  • Volume is low enough that per-character pricing is cheaper than hardware

Many teams use both: open-source for bulk processing (training data, automated narration, CI/CD pipelines) and commercial APIs for customer-facing audio where maximum quality matters. Our TTS pricing comparison covers 11 services including self-hosted cost estimates, and the cost calculator can help you model the economics at your specific usage level.

License Comparison: What You Can Actually Do Commercially

This is where most guides get it wrong. "Open-source" doesn't always mean "free for commercial use." Here's the actual breakdown:

LicenseModelsCommercial UseModifications
Apache 2.0Qwen3-TTS, Kokoro, Dia, Fish Audio S2, CosyVoice2Free, unrestrictedAllowed
MITChatterbox, PiperFree, unrestrictedAllowed
CC BY-NC 4.0VoxtralLicense from Mistral requiredAllowed (share-alike)

If you're building a commercial product and don't want to deal with licensing lawyers, stick with Apache 2.0 or MIT models. They're the safest options for production deployment.

Getting Started: Hardware and Setup

Hardware requirements vary dramatically across models. Here's a practical guide:

No GPU? Start Here

Kokoro — pip install, runs on any machine with Python 3.8+. 300MB model download. Works on Mac, Linux, Windows. The easiest on-ramp to local TTS. See our Kokoro setup guide.

Have an NVIDIA GPU?

Qwen3-TTS (6GB+ VRAM) or Chatterbox (4GB+ VRAM). Both install via pip with CUDA support. Qwen3 for multilingual, Chatterbox for voice cloning. See our Qwen3-TTS review.

Want a Cloud API Instead?

Fish Audio API ($15/1M chars) gives you S2 Pro quality without managing hardware. The best of both worlds — open-source model quality with API convenience. See our Fish Audio pricing breakdown.

Embedded / Offline?

Piper for maximum compatibility (ONNX, 900+ voices, 47 languages) or Kokoro for better quality if English-only is acceptable.

How We Evaluated These Models

We cross-referenced three data sources: the Artificial Analysis Speech Arena (ELO-based blind human evaluation across 71,000+ comparison pairs), Fish Audio's independent Bradley-Terry A/B testing, and Mistral's internal benchmark results. We also tracked LocalLLaMA community recommendations (where developers share real-world results without marketing bias) and referenced our own reviews of Kokoro, Qwen3-TTS, Fish Audio, Dia, and Chatterbox.

For a commercial TTS comparison including all the paid services these open-source models compete against, see our best text-to-speech guide and best TTS API comparison.

Related Guides

By TextToLab Research Team · Last verified June 2026. ELO rankings from Artificial Analysis Speech Arena. Bradley-Terry scores from Fish Audio's independent blind testing (71K pairs). Voxtral benchmarks from Mistral AI internal evaluation. Community data from r/LocalLLaMA. Pricing from official product pages. ElevenLabs affiliate link disclosed — all other recommendations are independent.