Review13 min readMay 27, 2026

By TextToLab Research Team

Kokoro TTS Review 2026: The 82M Parameter Model That Hit #1 on TTS Arena — For Free

Kokoro-82M topped the TTS Arena leaderboard, runs on CPU, costs $0, and fits in 300MB. Independent review with quality tests, ElevenLabs comparison, open-source TTS showdown, and setup guide.

The 30-Second Verdict

Kokoro is an 82-million-parameter open-source TTS model that hit #1 on the TTS Arena leaderboard in January 2026, beating models 10–100x its size. It runs on a CPU. The whole model is 300MB. It costs nothing. And in casual listening, most people can't tell it apart from ElevenLabs.

The catch: it only supports English (American and British accents), doesn't do voice cloning, and lacks the emotional range of commercial services. If you need multilingual support, custom voices, or enterprise reliability, you still need a paid provider. But for English narration, prototyping, accessibility, or any case where cost is the primary constraint — Kokoro is genuinely remarkable.

I've tested Kokoro against every major TTS provider we cover. The quality-to-size ratio shouldn't be possible. Here's what's real and what's hype.

Kokoro TTS at a Glance

ModelKokoro-82M (v1.0)Parameters82M (vs ElevenLabs: billions)Model Size300MB (164MB FP16)ArchitectureStyleTTS 2 (modified)LicenseApache 2.0 (full commercial use)LanguagesEnglish (US + British)HardwareRuns on CPU — no GPU requiredSpeed36–96x real-time on GPUPrice$0 (open-weight)TTS Arena#1 (Jan 2026)

Why 82 Million Parameters Changes Everything

Here's the number that matters: ElevenLabs charges $5–$99/month for plans, plus $0.06–$0.18 per 1,000 characters on the API. Over three years of moderate usage, that's roughly $3,600–$11,800. Kokoro costs $0. Forever. No API keys, no rate limits, no usage tracking.

The efficiency is where it gets genuinely surprising. Most commercial TTS models use billions of parameters and require high-end GPUs. Kokoro uses 82 million — trained on fewer than 100 hours of audio — and produces output that topped the TTS Arena leaderboard. On a free Google Colab GPU, it generates speech at 36x real-time speed. On a mid-range consumer GPU, it hits 96x. On a modern CPU, it still runs faster than real-time.

The 300MB model file means you can embed Kokoro in a mobile app, run it on a Raspberry Pi, or deploy it on the cheapest cloud instances. That opens TTS use cases that were cost-prohibitive before: local accessibility tools, offline applications, IoT devices, and prototypes where you don't want to commit to an API provider.

Voice Quality: What "#1 on the Arena" Actually Means

In January 2026, Kokoro-82M climbed to the top position on the TTS Arena leaderboard, beating XTTS v2 (467M parameters, 10,000+ training hours) and MetaVoice (1.2B parameters, 100,000+ hours). That ranking is real, but it needs context.

The TTS Arena tests short clips across a crowdsourced audience. For clean, well-punctuated English text — news articles, blog posts, documentation — Kokoro produces natural prosody that holds up against paid services. The pacing is good. Pronunciation is accurate. The voice presets sound human.

Where it falls short:

My honest take: for 90% of English narration tasks, Kokoro delivers 90%+ of ElevenLabs quality. That last 10% matters if you're producing a polished audiobook or emotional content. It doesn't matter if you're reading a blog post aloud, building an accessibility tool, or prototyping a voice product.

Kokoro vs ElevenLabs: The Real Tradeoffs

CategoryKokoroElevenLabs
Price$0 (open-weight)$5–$330/mo + API fees
TTS Arena#1 (Jan 2026)#4 (ELO 1,179)
Emotional RangeLimited — neutral toneExcellent — full range
LanguagesEnglish only (2 accents)70+ languages
Voice CloningNoInstant + Professional
Voices~10 presets + community4,000+ library
HardwareCPU (no GPU needed)Cloud API only
Latency~100ms local75–300ms API
Model Size300MBProprietary (cloud)
Commercial LicenseApache 2.0 (unrestricted)Per plan terms

The cost difference is staggering. A developer using ElevenLabs' Pro plan ($99/month) for three years spends $3,564. The same developer running Kokoro locally spends $0 on the model and maybe $5–$20/month on cloud compute if they don't have local hardware. Over three years, that's a potential savings of $3,204–$3,564 per user.

But ElevenLabs isn't just selling voice quality — it's selling convenience, ecosystem, and reliability. The studio editor, 4,000+ voices, instant voice cloning, and 70+ languages are features Kokoro simply doesn't have. For a full pricing breakdown, see our ElevenLabs pricing guide.

Kokoro vs Other Open-Source TTS: The 2026 Showdown

Kokoro isn't the only free TTS worth considering. 2026 has been an extraordinary year for open-source voice synthesis. Here's how the main contenders stack up:

FeatureKokoroFish Audio S2VoxtralChatterboxDia (Nari)
Parameters82M~1B+4B~400M1.6B / 2B
Quality RankTTS Arena #1Blind test BT 3.0768.4% vs EL Flash63.75% vs ELBest nonverbal
LanguagesEnglish only80+9English onlyEnglish only
Voice CloningNo10–30s3s5sReference only
GPU RequiredNo (CPU OK)Yes (16GB+ VRAM)Yes (GPU recommended)Yes (10GB+ VRAM)Yes (10GB+ VRAM)
API Pricing$0 (self-host)$15/1M (or self-host)$16/1M via Mistral$0 (self-host)~$40/1M via fal.ai
LicenseApache 2.0Open weightsCC BY-NC 4.0MITApache 2.0
Best ForCPU, budget, embedMultilingual, qualityVoice cloning, EUClone, EnglishDialogue, expression

When to Pick Which

Choose Kokoro if you need English-only TTS on minimal hardware, want zero ongoing cost, or need to embed a TTS engine in a resource-constrained environment. It's also the best option for prototyping before committing to a paid API.

Choose Fish Audio S2 Pro if you need the best overall quality across multiple languages, or want a hosted API without managing infrastructure. At $15/1M characters, it's the best value in paid TTS.

Choose Voxtral if you need voice cloning from just 3 seconds of audio with 9-language support, or want a model backed by Mistral AI's $6B+ ecosystem. Note: the CC BY-NC 4.0 license restricts commercial self-hosting.

Choose Chatterbox if voice cloning is the priority. Five-second cloning with MIT licensing and no watermarks. The quality trail behind ElevenLabs on emotional delivery but beats it on some specific tasks.

Choose Dia if you need multi-speaker dialogue with natural nonverbal expressions (laughing, sighing, hesitation) in a single pass. Nothing else does this as well.

How to Get Started With Kokoro TTS

There are three ways to use Kokoro, depending on your technical comfort level:

Option 1: Browser (Zero Setup)

The fastest way to try Kokoro is through the HuggingFace Space or voice-generator.pages.dev. Type your text, pick a voice, hit generate. No account, no API key, no installation. The quality is identical to running it locally — it's the same model running on HuggingFace's infrastructure.

Option 2: Local Installation (5 Minutes)

Quick Setup

  1. Install dependencies: pip install kokoro soundfile
  2. Install espeak (Linux: apt-get install espeak-ng, macOS: brew install espeak)
  3. Run: from kokoro import KPipeline — the model downloads automatically on first run
  4. Select a voice (e.g., af_heart for American female, bf_isabella for British female)

Option 3: Docker (Production)

For production deployments, the kokoro-fastapi Docker image gives you a REST API out of the box. CPU version: docker run -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4. GPU versions available for faster inference. You get an OpenAI-compatible API endpoint at port 8880.

Available Voices

Kokoro ships with ~10 official voice presets across American and British English accents, with male and female options. The community has created additional voice packs. Voice names follow a pattern:af_(American female),am_(American male),bf_(British female),bm_(British male).

Hardware Requirements: What You Actually Need

HardwareSpeedVRAM/RAMCost
Modern CPU (8+ cores)1–3x real-time~2GB RAM$0 (existing hardware)
GTX 1060 6GB10–30x real-time<2GB VRAM$0 (existing hardware)
Free Colab GPU (T4)36x real-time<2GB VRAM$0
RTX 409096–200x real-time<2GB VRAM~$1,500 (card)
Apple M4~100ms latencyUnified memory$0 (existing Mac)
Raspberry Pi 5Near real-time~2GB~$80

The key takeaway: Kokoro's 82M parameters and 300MB model size mean it runs on hardware you already own. Compare that to Dia TTS (needs 10GB VRAM), Fish Audio S2 (needs 16GB+ VRAM for self-hosting), or Voxtral (needs a serious GPU). Kokoro is the only open-source TTS that genuinely runs on consumer hardware without compromise.

The Real Limitations (Don't Skip This)

Who Should (and Shouldn't) Use Kokoro

Great for:

  • Developers prototyping voice features
  • Accessibility tools and screen readers
  • Converting ebooks and articles to audio
  • Offline/edge applications with no internet
  • Privacy-sensitive environments (no data leaves your machine)
  • Budget-constrained projects ($0 forever)
  • Hobbyist projects and personal tools

Not recommended for:

  • Multilingual content (English only)
  • Fiction audiobooks needing emotional delivery
  • Products requiring voice cloning
  • Enterprise deployments needing SLAs
  • High-volume production without technical staff

If Kokoro doesn't fit your needs, compare all your options on our TTS pricing comparison page or calculate your exact costs with the TTS cost calculator.

Related Guides

By TextToLab Research Team · Last verified May 2026. Model data from Kokoro GitHub repository (hexgrad/kokoro). TTS Arena rankings from HuggingFace TTS Arena leaderboard. Hardware benchmarks from community testing on Spheron, Clore.ai, and ariya.io. Open-source TTS comparison data from Artificial Analysis Speech Arena, Fish Audio published blind test study, and Mistral AI technical report.