Review12 min readJune 5, 2026

By TextToLab Research Team

Voxtral TTS Review: Beats ElevenLabs 68% of the Time at 73% Lower Cost (2026)

Independent review of Mistral AI's Voxtral TTS — 4B parameters, 70ms latency, $16/1M characters. Blind test data, CC BY-NC 4.0 license gotcha, self-hosting costs, and head-to-head comparison with ElevenLabs, Kokoro, Fish Audio, and Chatterbox.

The 30-Second Verdict

Voxtral TTS is Mistral AI's 4-billion-parameter text-to-speech model that beats ElevenLabs Flash v2.5 in 68.4% of blind listening tests — and costs 73% less on the API. At $0.016 per 1,000 characters ($16/1M), it's one of the cheapest commercial TTS APIs available. Open weights are free to download for non-commercial use.

The model shipped on March 26, 2026, with zero-shot voice cloning from 3 seconds of audio, 70ms time-to-first-audio, streaming output, and support for 9 languages. It's built on a hybrid architecture that chains an autoregressive decoder with a flow-matching acoustic transformer — a design that produces noticeably more natural prosody than pure autoregressive approaches.

I've been testing Voxtral against every major TTS provider we cover on this site. Here's the honest breakdown: what's genuinely impressive, what's overhyped, and where the license catches most people off guard.

Voxtral TTS at a Glance

ModelVoxtral-4B-TTS-2603Parameters4B total (3.4B decoder + 390M flow-matching + 300M codec)DeveloperMistral AI (Paris, valued ~$6B)ArchitectureHybrid autoregressive + flow-matchingLicenseCC BY-NC 4.0 (open weights) / Commercial via APILanguages9 (EN, FR, DE, ES, NL, PT, IT, HI, AR)Latency70ms TTFA (streaming)Voice CloningZero-shot from 3 seconds of audioPreset Voices20 voices with implicit emotionAPI Price$0.016/1K chars ($16/1M)Self-Host Price$0 (non-commercial only)

How Voxtral Works: The Three-Stage Pipeline

Most TTS models are either autoregressive (good at natural rhythm, slow at inference) or non-autoregressive (fast, but robotic pacing). Voxtral uses both. The architecture has three components that run in sequence, each handling a different part of speech generation:

The practical result: Voxtral generates speech at 9.7x real-time speed with a real-time factor that makes streaming viable even on modest hardware. The hybrid approach explains why it sounds more natural than pure non-autoregressive models like Kokoro — it can model long-range dependencies in pacing that fixed-length methods miss.

Voice Quality: What the Blind Tests Actually Show

Mistral published blind test results comparing Voxtral against ElevenLabs — the current commercial benchmark. The headline number is real, but the details matter:

Test ScenarioVoxtral Win Ratevs ModelContext
Multilingual zero-shot cloning68.4%ElevenLabs Flash v2.5Across 9 languages
Preset voices (implicit emotion)58.3%ElevenLabs Flash v2.5Flagship preset voices
Preset voices55.4%ElevenLabs v3 (flagship)Narrower margin

The 68.4% number comes from multilingual zero-shot scenarios — cloned voices across different languages. That's Voxtral's strongest showing. Against ElevenLabs' flagship v3 model (which costs more per character), the gap narrows to 55.4%. Still a win, but a modest one.

In my testing, Voxtral's preset voices handle conversational content well — podcast-style narration, documentation reads, explainer videos. The prosody feels more grounded than Kokoro, especially for longer passages where Kokoro sometimes drifts into a monotone. French and Spanish output is notably good — Mistral's European roots show. Hindi and Arabic are functional but noticeably less polished than the European languages.

Voice Cloning: 3 Seconds to a Custom Voice

Voxtral's zero-shot voice cloning requires just 3 seconds of reference audio. That's the same minimum as ElevenLabs' Instant Voice Cloning. In practice, quality improves significantly with 10-30 seconds of clean audio.

Early community reports on the cloning quality are mixed. The voice timbre capture is strong — recognizably the same person. But some users have reported instability in cloned voice output for longer passages, particularly with non-English reference audio. Mistral has acknowledged this as an area for improvement.

The cloning works differently from ElevenLabs. ElevenLabs stores your voice profile on their servers and refines it over time. Voxtral processes the reference audio at inference time — no stored profile, no account needed. For privacy-conscious applications, this is a significant advantage. For consistency across many API calls, it means slight variations between generations.

Pricing: $16/1M Characters vs the Competition

Voxtral's API pricing through Mistral's platform is straightforward: $0.016 per 1,000 characters, or $16 per million. No subscription tiers, no character quotas — pure pay-per-use. That works out to roughly $0.79 per hour of generated audio at 150 words per minute.

ServicePrice per 1M CharsCost per Hourvs Voxtral
Voxtral (Mistral API)$16~$0.79
Fish Audio$15 (UTF-8 bytes)~$0.746% cheaper
Deepgram Aura-2$30~$1.801.9x more
OpenAI TTS$15–$100~$0.74–$4.951–6x more
Cartesia Sonic~$35–$47~$1.73–$2.332–3x more
ElevenLabs$60–$165~$2.97–$8.174–10x more
Kokoro (self-hosted)$0$0 (+ hardware)Free (English only)

The pricing puts Voxtral in an interesting position. It's not the cheapest (Fish Audio's UTF-8 byte pricing beats it slightly for English text), but it's dramatically cheaper than ElevenLabs and Cartesia while producing better audio quality in blind tests. For a more complete pricing analysis, check our TTS pricing comparison.

Self-Hosting: Free but Not for Business

Here's where Voxtral gets confusing — and where I've seen many articles get it wrong.

License Warning

Voxtral's open weights are released under CC BY-NC 4.0 — that's non-commercial only. You can download, modify, and self-host Voxtral for free for research, personal projects, and education. But you cannot use the self-hosted model to generate audio for any commercial purpose. For commercial use, you must go through Mistral's API at $0.016/1K chars.

This is a significant difference from models like Kokoro (Apache 2.0) and Chatterbox (MIT), which allow full commercial self-hosting. If your use case is commercial, Voxtral's "open-source" angle is really an API product with a free trial for non-commercial users.

Hardware Requirements for Self-Hosting

ConfigurationVRAMModel SizeCloud GPU Cost
BF16 (full precision)16GB minimum~8GB~$0.40–$0.80/hr (A100/L4)
INT8 quantized8GB~4GB~$0.20–$0.40/hr (T4/L4)
INT4 quantized6GB~3GB~$0.15–$0.30/hr (T4)

For non-commercial self-hosting, the GPU requirements are reasonable. A consumer RTX 3060 (12GB) handles quantized inference comfortably. There's also a pure C implementation on GitHub that runs without Python dependencies — useful for embedded or edge deployments.

Voxtral vs Every Open-Source TTS Model

The open-source TTS landscape changed dramatically in early 2026. Here's how Voxtral stacks up against the best alternatives. For a deeper comparison, see our complete open-source TTS guide.

FeatureVoxtralKokoroFish Audio S2Chatterbox
Parameters4B82M4.4B~500M
LicenseCC BY-NC 4.0Apache 2.0CC BY-NC-SA 4.0MIT
Commercial Self-HostNoYesNoYes
Languages91 (English)13+1 (English)
Voice CloningYes (3s)NoYes (10s)Yes (6s)
GPU RequiredYes (16GB)No (CPU works)Yes (24GB)Yes (8GB)
StreamingYesNoYesNo
Best ForMultilingual API useEnglish, low-costCJK languagesCommercial self-host

Voxtral vs ElevenLabs: Should You Switch?

The honest answer: probably not yet, unless price is your primary driver. Voxtral wins on cost and competitive quality, but ElevenLabs still leads in several areas that matter for production use:

Where Voxtral wins: if you're building a European-language product, doing batch audio generation, or spending $500+/month on ElevenLabs API calls for English/French/Spanish content. At those volumes, switching to Voxtral saves $375+/month with comparable quality. For a full pricing comparison, see our ElevenLabs pricing guide.

Limitations and Gotchas

Who Should (and Shouldn't) Use Voxtral

Good fit

  • European-language content at scale
  • Developers wanting cheap API with good quality
  • Researchers and hobbyists (free self-hosting)
  • Projects needing voice cloning without stored profiles
  • Multilingual content (EN/FR/DE/ES/PT)
  • Cost-sensitive batch audio generation

Not the right choice

  • Asian language content (use Fish Audio or ElevenLabs)
  • Enterprise needing SLAs and support
  • Commercial self-hosting (use Kokoro or Chatterbox)
  • Products needing SSML control (use Amazon Polly)
  • Voice agent apps needing sub-50ms latency (use Cartesia)

If Voxtral doesn't fit, compare all your options on our TTS pricing comparison page or find a free alternative in our open-source TTS guide. You can also calculate your exact costs with the TTS cost calculator.

Related Guides

By TextToLab Research Team · Last verified June 2026. Model data from Mistral AI technical report and HuggingFace model card (mistralai/Voxtral-4B-TTS-2603). Blind test results from Mistral AI published evaluation. Architecture specifications from the Voxtral TTS research paper (arXiv 2603.25551). Pricing verified against Mistral AI API documentation. Open-source TTS comparison data from Artificial Analysis Speech Arena, Fish Audio blind test study, Kokoro GitHub (hexgrad/kokoro), and Chatterbox documentation.