The Short Answer
The best text-to-speech API depends on what you're building. ElevenLabs ($0.18–$0.30/1K chars) wins for voice quality and cloning. Cartesia Sonic 3 (40ms TTFA) wins for real-time voice agents. Fish Audio S2 Pro ($15/1M chars) wins for quality-per-dollar. OpenAI TTS ($15/1M) wins for simplicity if you're already in the OpenAI ecosystem. And Kokoro ($0, runs on CPU) wins for prototyping and cost-sensitive projects.
I've spent weeks testing every major TTS API against our pricing data from 10 individual pricing breakdowns. Every other "best TTS API" list on Google is written by a TTS provider ranking themselves first. This one isn't. We only have an affiliate relationship with ElevenLabs (disclosed), and we'll tell you when a $15/1M option genuinely beats the $180/1M one.
Quick Comparison: 12 TTS APIs at a Glance
Prices verified against official documentation as of May 2026. Arena rankings from the Speech Arena blind evaluation (crowd-sourced ELO).
| API | Price / 1M Chars | Latency (TTFA) | Languages | Voice Cloning | Free Tier |
|---|---|---|---|---|---|
| ElevenLabs v3 | $60–$180 | 75–300ms | 32 | Yes (30s sample) | 10K chars/mo |
| Fish Audio S2 Pro | $15 | ~200ms | 80+ | Yes (10–30s) | 1K chars/day |
| Cartesia Sonic 3 | $20–$33 | 40ms | 42 | Yes (3s sample) | 20K chars/mo |
| OpenAI TTS | $15 (tts-1) | 200–400ms | 57 | No | None |
| Amazon Polly | $4–$16 | 100–200ms | 33 | No | 5M chars/12mo |
| Deepgram Aura-2 | $27–$30 | ~150ms | 7 | No | $200 credit |
| Inworld TTS-2 | $25–$35 | <200ms | 25+ | Custom voices | Contact sales |
| Google Cloud TTS | $4–$30 | 100–250ms | 70+ | Custom Voice (enterprise) | 1M chars/mo |
| Azure Speech | $16–$24 | 100–200ms | 140+ | Custom Neural Voice | 500K chars/mo |
| Murf Falcon API | $10–$30 | 150–300ms | 20 | No | None |
| Kokoro-82M | $0 (self-hosted) | ~50ms (GPU) | 1 (English) | No | Free forever |
| Chatterbox | $0 (self-hosted) | Variable | 1 (English) | Yes (5s sample) | Free forever |
Use our TTS cost calculator to estimate costs for your specific usage volume.
Best Overall: ElevenLabs
ElevenLabs is still the default recommendation for most projects, and it's not close on voice quality. Their v3 model handles emotion, pacing, and pronunciation better than any other commercial API. Voice cloning from 30 seconds of audio actually works — I've tested it against Cartesia (3-second cloning) and Fish Audio (10-second), and ElevenLabs produces the most consistent, natural-sounding clones.
The Python and Node.js SDKs are excellent. WebSocket streaming is stable. Documentation is the best in the industry — actually shows working code examples instead of pseudo-code. And the 1,000+ pre-built voices mean you can ship a prototype in hours.
The downside is cost. At $60–$180/1M characters on the API, ElevenLabs is 4–12x more expensive than Fish Audio and 4x more than OpenAI. For high-volume production (generating millions of characters monthly), that difference compounds fast. See our full ElevenLabs pricing breakdown for tier-by-tier numbers.
- Best for: Content creation, audiobooks, apps where voice quality is the primary differentiator
- SDK support: Python, Node.js, Go, REST, WebSocket streaming
- Gotcha: Free tier is only 10K characters/month — about 2 minutes of speech. You'll burn through it testing.
Best Quality-Per-Dollar: Fish Audio S2 Pro
Fish Audio is the most underrated TTS API in 2026. Their S2 Pro model ranked #1 in blind A/B tests with a Bradley-Terry score of 3.07 from 71,000+ evaluation pairs — meaning listeners preferred it to every other model including ElevenLabs when they couldn't see the brand name. At $15/1M characters, that's 11x cheaper than ElevenLabs for arguably better raw quality.
The 80+ language support is legitimate — Fish Audio is particularly strong in CJK languages (Chinese, Japanese, Korean), which is where most Western TTS providers fall flat. Voice cloning works from 10–30 seconds of audio.
Watch out for the billing model: Fish Audio charges per UTF-8 byte, not per character. For English text, 1 character ≈ 1 byte. For Chinese or Japanese text, 1 character = 3 bytes — tripling your effective cost. At $45/1M CJK characters, it's still cheaper than ElevenLabs, but the headline "$15/1M" rate is misleading for non-Latin scripts. Read our Fish Audio vs ElevenLabs comparison for the full picture.
- Best for: High-volume production, multilingual content, CJK languages, cost-conscious teams that still need top-tier quality
- SDK support: Python, REST, WebSocket
- Gotcha: UTF-8 byte billing makes CJK text 3x the headline rate
Best for Real-Time Voice Agents: Cartesia Sonic 3
If you're building a voice agent — phone bot, customer service AI, interactive NPC — latency matters more than voice quality. Users tolerate slightly mechanical speech far better than awkward silence. Cartesia's Sonic 3 delivers 40ms time-to-first-audio, beating every other provider by 2–5x. The Turbo variants push that even lower.
Built on SSM architecture (from Stanford founders with $100M+ in funding), Cartesia's approach is fundamentally different from transformer-based TTS. It trades some voice expressiveness for deterministic, low-latency streaming via WebSocket. Voice cloning from just 3 seconds of audio is fast but less accurate than ElevenLabs' 30-second approach.
Pricing is credit-based and confusing at first — see our Cartesia pricing breakdown for the effective per-character rates. At scale, expect $20–$33/1M characters with 1.5x surcharges for Pro Voice Cloning and per-minute phone connection fees ($0.014/min).
- Best for: Voice agents, phone bots, real-time conversational AI, gaming NPCs
- SDK support: Python, Node.js, WebSocket streaming, gRPC
- Gotcha: Voice quality ranks #10 on the Arena (ELO 1,054) — noticeably below ElevenLabs and Fish Audio in side-by-side tests. See our Cartesia vs ElevenLabs comparison
Best for OpenAI Stack: OpenAI TTS
If you're already paying for GPT-4o and using the OpenAI SDK, adding TTS is two lines of code. That's the real value proposition — not voice quality, which is mid-tier, but zero additional vendor management.
Three models available: tts-1 ($15/1M chars, lower quality but faster), tts-1-hd ($30/1M, higher quality), and gpt-4o-mini-tts (~$15/1M with steerable instructions). The steerable model is genuinely useful — you can tell it "speak excitedly" or "whisper this part" in plain English. 11 preset voices, no cloning, 57 languages.
The biggest limitation: no free tier. At all. You pay from character one. ElevenLabs gives you 10K free, Google gives 1M free, Amazon gives 5M free for 12 months. OpenAI gives you nothing. For prototyping and testing, that's a real friction point. Full breakdown in our OpenAI TTS pricing guide.
- Best for: Teams already using OpenAI APIs who want unified billing and a single SDK
- SDK support: Python, Node.js (official), REST
- Gotcha: No free tier, no voice cloning, 11 preset voices only
Best for AWS Stack: Amazon Polly
Amazon Polly is the cheapest commercial TTS API if you're already on AWS. Standard voices at $4/1M characters, Neural voices at $16/1M, and the newer Generative engine at $30/1M. The free tier — 5M standard or 1M neural characters per month for 12 months — is the most generous in the industry.
Where Polly shines is infrastructure integration. SpeechMarks give you word-level timing for lip sync and subtitles. SSML support is the most complete of any provider. Integration with AWS Lambda, S3, CloudFront is turnkey. For IVR systems, accessibility features, or any AWS-native application, Polly is hard to beat on price.
The tradeoff is voice quality. Even Polly's Neural voices sound noticeably more robotic than ElevenLabs, Fish Audio, or even OpenAI. The Generative engine closes the gap but is 7.5x the standard price. And there's no voice cloning at all. Full analysis in our Amazon Polly pricing guide.
- Best for: AWS-native apps, IVR/telephony, accessibility features, high-volume/low-cost batch processing
- SDK support: Python (boto3), Java, .NET, Go, PHP, Ruby — full AWS SDK coverage
- Gotcha: SSML markup counts toward billing, adding 10–30% to effective cost
Best for Voice Agent Platforms: Deepgram & Inworld
Deepgram and Inworld take a different approach: instead of selling TTS as a standalone API, they bundle it into voice agent platforms with STT + LLM + TTS pipelines optimized to work together.
Deepgram's Aura-2 runs at $30/1M characters standalone, but the Voice Agent API bundles STT + TTS for $4.50/hour — which is often cheaper if your agents are handling conversations at volume. The $200 free credit goes a long way for testing. Main limitation: only 7 languages.
Inworld's TTS-2 is the highest-ranked commercial model on the Speech Arena (ELO 1,236+). Sub-200ms latency, closed-loop architecture optimized for real-time conversations. Pricing starts at $25–$35/1M but drops to $5–$10/1M at enterprise volume. The catch: you mostly need to talk to sales, and they push their full character platform rather than TTS-only access.
Best Free / Open Source: Kokoro, Chatterbox & Qwen3-TTS
The open-source TTS landscape changed dramatically in early 2026. Three models now offer genuinely usable quality at $0:
- Kokoro-82M — 300MB model that hit #1 on the TTS Arena. Runs on CPU. English only, no voice cloning, but the quality is indistinguishable from paid services in blind tests. Perfect for prototyping and English narration.
- Chatterbox — MIT-licensed voice cloning from 5 seconds of audio. English only. Quality is a step below ElevenLabs cloning, but it's free and runs locally. Best for projects that need voice cloning without API costs.
- Qwen3-TTS — Alibaba's 1.7B model supports 10 languages, voice cloning from 3 seconds, and natural-language voice direction ("speak sadly"). Needs a GPU (6–8GB VRAM). The most capable open-source TTS overall, but the hardware requirement limits who can use it.
For a detailed showdown between all open-source options, see our Kokoro review which includes a head-to-head comparison table.
Also Worth Considering
Google Cloud Text-to-Speech
The most generous free tier (1M chars/month standard, 100K neural) and 70+ languages. Chirp 3 HD at $30/1M is competitive with ElevenLabs on quality. Custom Neural Voice for enterprise is expensive but effective. The issue is complexity — GCP auth, service accounts, and IAM roles add friction that AWS and OpenAI don't.
Azure Cognitive Services Speech
140+ languages (more than anyone else), Custom Neural Voice with professional quality, and tight integration with Azure Bot Service. The 500K chars/month free tier is solid. Best choice for enterprise Microsoft shops with compliance requirements.
Gemini Flash TTS
Google's newest entry at ~$12/1M characters output with 200+ audio style tags and 70+ languages. Ranked #2 on the Speech Arena. Still in preview with limited documentation, but the quality-per-dollar ratio is excellent. Worth watching. Read our Gemini TTS review for details.
Murf Falcon API
Murf's API offering at $10–$30/1K characters. Decent voice quality for scripted content, but the Gen-3 engine doesn't match newer models. Limited to 20 languages. Best for teams already using Murf Studio for voiceover production.
Free Tier Comparison
Free tiers matter for prototyping and testing. Here's what each provider gives you before you pay:
| Provider | Free Allowance | Duration | ≈ Minutes of Speech |
|---|---|---|---|
| Amazon Polly (Standard) | 5M chars/mo | 12 months | ~833 min/mo |
| Google Cloud (Standard) | 1M chars/mo | Ongoing | ~167 min/mo |
| Azure Speech | 500K chars/mo | Ongoing | ~83 min/mo |
| Deepgram | $200 credit | One-time | ~1,111 min total |
| Cartesia | 20K chars/mo | Ongoing | ~3 min/mo |
| ElevenLabs | 10K chars/mo | Ongoing | ~2 min/mo |
| OpenAI TTS | None | — | 0 |
| Kokoro / Chatterbox | Unlimited | Forever | ∞ |
How to Choose: Decision Tree by Use Case
Content Creation / Audiobooks
Quality is everything. Budget secondary.
→ ElevenLabs (best quality) or Fish Audio (best value)
Voice Agents / Phone Bots
Latency is everything. Sub-200ms required.
Prototyping / MVP
Speed to ship + minimal cost. Quality can improve later.
→ OpenAI TTS (if using GPT) or Kokoro (free, runs locally)
Enterprise / Compliance
Data residency, SLAs, SOC 2. Voice quality is secondary.
→ Azure Speech or Amazon Polly (cloud compliance built in)
Multilingual Content
Need 20+ languages with natural pronunciation.
→ Azure (140+), Google Cloud (70+), or Fish Audio (80+, best CJK)
5 Pricing Gotchas That Add 20–50% to Your Bill
After writing 10 individual pricing breakdowns, these are the hidden costs that catch developers off guard:
- SSML markup billing (Amazon Polly) — SSML tags count as characters. A paragraph with emphasis, breaks, and prosody tags can be 30% longer than the visible text.
- UTF-8 byte billing (Fish Audio) — 1 Chinese character = 3 bytes. What looks like $15/1M becomes $45/1M for CJK content.
- Concurrent session limits (Cartesia, Inworld) — Hitting the concurrent limit queues requests, adding latency that defeats the purpose of choosing a low-latency provider.
- Pro Voice Cloning surcharge (Cartesia) — 1.5x multiplier on all characters generated with cloned voices. Plus $0.014/min phone connection fees.
- Output token billing (Gemini Flash) — Gemini charges per output audio token, not input character. Cost depends on speech duration, not text length — longer pauses cost more.
SDK & Streaming Support Matrix
| API | Python | Node.js | Go | Streaming |
|---|---|---|---|---|
| ElevenLabs | ✓ Official | ✓ Official | ✓ Official | WebSocket |
| OpenAI | ✓ Official | ✓ Official | Community | HTTP chunked |
| Cartesia | ✓ Official | ✓ Official | — | WebSocket + gRPC |
| Fish Audio | ✓ Official | — | — | WebSocket |
| Amazon Polly | ✓ boto3 | ✓ AWS SDK | ✓ AWS SDK | HTTP chunked |
| Deepgram | ✓ Official | ✓ Official | ✓ Official | WebSocket |
| Kokoro | ✓ pip install | — | — | Local only |
The Bottom Line
The TTS API market split into three tiers in 2026: premium quality (ElevenLabs, Fish Audio S2 Pro, Inworld TTS-2), mid-tier value (OpenAI, Cartesia, Deepgram, Gemini Flash), and free open-source (Kokoro, Chatterbox, Qwen3-TTS). The quality gap between tiers shrinks every quarter. A year ago, using anything other than ElevenLabs for production content felt like a compromise. Now Fish Audio matches or beats it in blind tests at a fraction of the price, and open-source models handle English narration convincingly.
My recommendation: start with Kokoro or OpenAI TTS for prototyping, then evaluate Fish Audio and ElevenLabs for production based on your quality and language requirements. Use our TTS cost calculator to model costs at your expected volume before committing.
Related Pricing & Comparison Guides
By TextToLab Research Team · Last verified May 2026. Pricing from official API documentation. Arena rankings from Speech Arena (blind crowdsourced evaluation). ElevenLabs affiliate link disclosed — all other recommendations are independent.