The 30-Second Verdict
Voxtral TTS is Mistral AI's 4-billion-parameter text-to-speech model that beats ElevenLabs Flash v2.5 in 68.4% of blind listening tests — and costs 73% less on the API. At $0.016 per 1,000 characters ($16/1M), it's one of the cheapest commercial TTS APIs available. Open weights are free to download for non-commercial use.
The model shipped on March 26, 2026, with zero-shot voice cloning from 3 seconds of audio, 70ms time-to-first-audio, streaming output, and support for 9 languages. It's built on a hybrid architecture that chains an autoregressive decoder with a flow-matching acoustic transformer — a design that produces noticeably more natural prosody than pure autoregressive approaches.
I've been testing Voxtral against every major TTS provider we cover on this site. Here's the honest breakdown: what's genuinely impressive, what's overhyped, and where the license catches most people off guard.
Voxtral TTS at a Glance
How Voxtral Works: The Three-Stage Pipeline
Most TTS models are either autoregressive (good at natural rhythm, slow at inference) or non-autoregressive (fast, but robotic pacing). Voxtral uses both. The architecture has three components that run in sequence, each handling a different part of speech generation:
- Decoder backbone (3.4B parameters): Built on Ministral 3B, this transformer auto-regressively predicts semantic tokens — the "what to say and when" layer. It determines pacing, emphasis, and sentence-level prosody.
- Flow-matching acoustic transformer (390M): Takes the semantic tokens and converts them into detailed acoustic features — the "how it should sound" layer. This is where voice timbre, pitch contours, and microexpression come from.
- Neural audio codec (300M): A convolutional-transformer autoencoder called Voxtral Codec that compresses 24kHz audio into 12.5Hz frames of 37 discrete tokens (1 semantic + 36 acoustic) at 2.14 kbps. It reconstructs the final waveform from the acoustic predictions.
The practical result: Voxtral generates speech at 9.7x real-time speed with a real-time factor that makes streaming viable even on modest hardware. The hybrid approach explains why it sounds more natural than pure non-autoregressive models like Kokoro — it can model long-range dependencies in pacing that fixed-length methods miss.
Voice Quality: What the Blind Tests Actually Show
Mistral published blind test results comparing Voxtral against ElevenLabs — the current commercial benchmark. The headline number is real, but the details matter:
| Test Scenario | Voxtral Win Rate | vs Model | Context |
|---|---|---|---|
| Multilingual zero-shot cloning | 68.4% | ElevenLabs Flash v2.5 | Across 9 languages |
| Preset voices (implicit emotion) | 58.3% | ElevenLabs Flash v2.5 | Flagship preset voices |
| Preset voices | 55.4% | ElevenLabs v3 (flagship) | Narrower margin |
The 68.4% number comes from multilingual zero-shot scenarios — cloned voices across different languages. That's Voxtral's strongest showing. Against ElevenLabs' flagship v3 model (which costs more per character), the gap narrows to 55.4%. Still a win, but a modest one.
In my testing, Voxtral's preset voices handle conversational content well — podcast-style narration, documentation reads, explainer videos. The prosody feels more grounded than Kokoro, especially for longer passages where Kokoro sometimes drifts into a monotone. French and Spanish output is notably good — Mistral's European roots show. Hindi and Arabic are functional but noticeably less polished than the European languages.
Voice Cloning: 3 Seconds to a Custom Voice
Voxtral's zero-shot voice cloning requires just 3 seconds of reference audio. That's the same minimum as ElevenLabs' Instant Voice Cloning. In practice, quality improves significantly with 10-30 seconds of clean audio.
Early community reports on the cloning quality are mixed. The voice timbre capture is strong — recognizably the same person. But some users have reported instability in cloned voice output for longer passages, particularly with non-English reference audio. Mistral has acknowledged this as an area for improvement.
The cloning works differently from ElevenLabs. ElevenLabs stores your voice profile on their servers and refines it over time. Voxtral processes the reference audio at inference time — no stored profile, no account needed. For privacy-conscious applications, this is a significant advantage. For consistency across many API calls, it means slight variations between generations.
Pricing: $16/1M Characters vs the Competition
Voxtral's API pricing through Mistral's platform is straightforward: $0.016 per 1,000 characters, or $16 per million. No subscription tiers, no character quotas — pure pay-per-use. That works out to roughly $0.79 per hour of generated audio at 150 words per minute.
| Service | Price per 1M Chars | Cost per Hour | vs Voxtral |
|---|---|---|---|
| Voxtral (Mistral API) | $16 | ~$0.79 | — |
| Fish Audio | $15 (UTF-8 bytes) | ~$0.74 | 6% cheaper |
| Deepgram Aura-2 | $30 | ~$1.80 | 1.9x more |
| OpenAI TTS | $15–$100 | ~$0.74–$4.95 | 1–6x more |
| Cartesia Sonic | ~$35–$47 | ~$1.73–$2.33 | 2–3x more |
| ElevenLabs | $60–$165 | ~$2.97–$8.17 | 4–10x more |
| Kokoro (self-hosted) | $0 | $0 (+ hardware) | Free (English only) |
The pricing puts Voxtral in an interesting position. It's not the cheapest (Fish Audio's UTF-8 byte pricing beats it slightly for English text), but it's dramatically cheaper than ElevenLabs and Cartesia while producing better audio quality in blind tests. For a more complete pricing analysis, check our TTS pricing comparison.
Self-Hosting: Free but Not for Business
Here's where Voxtral gets confusing — and where I've seen many articles get it wrong.
License Warning
Voxtral's open weights are released under CC BY-NC 4.0 — that's non-commercial only. You can download, modify, and self-host Voxtral for free for research, personal projects, and education. But you cannot use the self-hosted model to generate audio for any commercial purpose. For commercial use, you must go through Mistral's API at $0.016/1K chars.
This is a significant difference from models like Kokoro (Apache 2.0) and Chatterbox (MIT), which allow full commercial self-hosting. If your use case is commercial, Voxtral's "open-source" angle is really an API product with a free trial for non-commercial users.
Hardware Requirements for Self-Hosting
| Configuration | VRAM | Model Size | Cloud GPU Cost |
|---|---|---|---|
| BF16 (full precision) | 16GB minimum | ~8GB | ~$0.40–$0.80/hr (A100/L4) |
| INT8 quantized | 8GB | ~4GB | ~$0.20–$0.40/hr (T4/L4) |
| INT4 quantized | 6GB | ~3GB | ~$0.15–$0.30/hr (T4) |
For non-commercial self-hosting, the GPU requirements are reasonable. A consumer RTX 3060 (12GB) handles quantized inference comfortably. There's also a pure C implementation on GitHub that runs without Python dependencies — useful for embedded or edge deployments.
Voxtral vs Every Open-Source TTS Model
The open-source TTS landscape changed dramatically in early 2026. Here's how Voxtral stacks up against the best alternatives. For a deeper comparison, see our complete open-source TTS guide.
| Feature | Voxtral | Kokoro | Fish Audio S2 | Chatterbox |
|---|---|---|---|---|
| Parameters | 4B | 82M | 4.4B | ~500M |
| License | CC BY-NC 4.0 | Apache 2.0 | CC BY-NC-SA 4.0 | MIT |
| Commercial Self-Host | No | Yes | No | Yes |
| Languages | 9 | 1 (English) | 13+ | 1 (English) |
| Voice Cloning | Yes (3s) | No | Yes (10s) | Yes (6s) |
| GPU Required | Yes (16GB) | No (CPU works) | Yes (24GB) | Yes (8GB) |
| Streaming | Yes | No | Yes | No |
| Best For | Multilingual API use | English, low-cost | CJK languages | Commercial self-host |
Voxtral vs ElevenLabs: Should You Switch?
The honest answer: probably not yet, unless price is your primary driver. Voxtral wins on cost and competitive quality, but ElevenLabs still leads in several areas that matter for production use:
- Language coverage: ElevenLabs supports 70+ languages vs Voxtral's 9. If you need Japanese, Korean, Mandarin, or any Asian language, Voxtral isn't an option.
- Voice library: ElevenLabs has 1,000+ community voices and Professional Voice Cloning with higher fidelity. Voxtral has 20 presets.
- Ecosystem: ElevenLabs offers ElevenReader, Spotify integration, Conversational AI agents, Projects for long-form, and dubbing tools. Voxtral is TTS only.
- Stability: ElevenLabs has been in production for 3+ years. Voxtral shipped 10 weeks ago. Enterprise SLAs, uptime guarantees, and support maturity aren't there yet.
Where Voxtral wins: if you're building a European-language product, doing batch audio generation, or spending $500+/month on ElevenLabs API calls for English/French/Spanish content. At those volumes, switching to Voxtral saves $375+/month with comparable quality. For a full pricing comparison, see our ElevenLabs pricing guide.
Limitations and Gotchas
- License confusion: Multiple articles incorrectly describe Voxtral as "Apache 2.0" or "fully open-source." It's CC BY-NC 4.0. Non-commercial self-hosting only. Commercial use requires the paid API.
- 9 languages vs 70+: The European language coverage is solid, but no support for CJK (Chinese, Japanese, Korean), Turkish, Thai, Vietnamese, or any African language. For CJK, look at Fish Audio S2 Pro.
- Voice cloning instability: Community reports of inconsistent output when using cloned voices for longer passages, particularly for non-English reference audio.
- No SSML support: Unlike Amazon Polly and Google Cloud TTS, Voxtral doesn't accept SSML markup for fine-grained control of pauses, emphasis, or pronunciation.
- API only for commercial: Unlike Kokoro and Chatterbox, you can't run Voxtral on your own servers for a commercial product. That's a dealbreaker for applications that require data sovereignty or air-gapped deployments.
- Young model: 10 weeks old. Edge cases, long-form coherence, and pronunciation of domain-specific terminology haven't been battle-tested at scale the way ElevenLabs or Amazon Polly have.
Who Should (and Shouldn't) Use Voxtral
Good fit
- European-language content at scale
- Developers wanting cheap API with good quality
- Researchers and hobbyists (free self-hosting)
- Projects needing voice cloning without stored profiles
- Multilingual content (EN/FR/DE/ES/PT)
- Cost-sensitive batch audio generation
Not the right choice
- Asian language content (use Fish Audio or ElevenLabs)
- Enterprise needing SLAs and support
- Commercial self-hosting (use Kokoro or Chatterbox)
- Products needing SSML control (use Amazon Polly)
- Voice agent apps needing sub-50ms latency (use Cartesia)
If Voxtral doesn't fit, compare all your options on our TTS pricing comparison page or find a free alternative in our open-source TTS guide. You can also calculate your exact costs with the TTS cost calculator.
Related Guides
By TextToLab Research Team · Last verified June 2026. Model data from Mistral AI technical report and HuggingFace model card (mistralai/Voxtral-4B-TTS-2603). Blind test results from Mistral AI published evaluation. Architecture specifications from the Voxtral TTS research paper (arXiv 2603.25551). Pricing verified against Mistral AI API documentation. Open-source TTS comparison data from Artificial Analysis Speech Arena, Fish Audio blind test study, Kokoro GitHub (hexgrad/kokoro), and Chatterbox documentation.