Guide14 min readMay 29, 2026

By TextToLab Research Team

Best Text to Speech API 2026: 12 TTS APIs Compared (Pricing, Latency, SDKs)

Independent comparison of 12 TTS APIs including ElevenLabs, Fish Audio, Cartesia, OpenAI, and Amazon Polly. Verified pricing, latency benchmarks, SDK support, and free tier comparison from 10 pricing deep-dives.

The Short Answer

The best text-to-speech API depends on what you're building. ElevenLabs ($0.18–$0.30/1K chars) wins for voice quality and cloning. Cartesia Sonic 3 (40ms TTFA) wins for real-time voice agents. Fish Audio S2 Pro ($15/1M chars) wins for quality-per-dollar. OpenAI TTS ($15/1M) wins for simplicity if you're already in the OpenAI ecosystem. And Kokoro ($0, runs on CPU) wins for prototyping and cost-sensitive projects.

I've spent weeks testing every major TTS API against our pricing data from 10 individual pricing breakdowns. Every other "best TTS API" list on Google is written by a TTS provider ranking themselves first. This one isn't. We only have an affiliate relationship with ElevenLabs (disclosed), and we'll tell you when a $15/1M option genuinely beats the $180/1M one.

Quick Comparison: 12 TTS APIs at a Glance

Prices verified against official documentation as of May 2026. Arena rankings from the Speech Arena blind evaluation (crowd-sourced ELO).

APIPrice / 1M CharsLatency (TTFA)LanguagesVoice CloningFree Tier
ElevenLabs v3$60–$18075–300ms32Yes (30s sample)10K chars/mo
Fish Audio S2 Pro$15~200ms80+Yes (10–30s)1K chars/day
Cartesia Sonic 3$20–$3340ms42Yes (3s sample)20K chars/mo
OpenAI TTS$15 (tts-1)200–400ms57NoNone
Amazon Polly$4–$16100–200ms33No5M chars/12mo
Deepgram Aura-2$27–$30~150ms7No$200 credit
Inworld TTS-2$25–$35<200ms25+Custom voicesContact sales
Google Cloud TTS$4–$30100–250ms70+Custom Voice (enterprise)1M chars/mo
Azure Speech$16–$24100–200ms140+Custom Neural Voice500K chars/mo
Murf Falcon API$10–$30150–300ms20NoNone
Kokoro-82M$0 (self-hosted)~50ms (GPU)1 (English)NoFree forever
Chatterbox$0 (self-hosted)Variable1 (English)Yes (5s sample)Free forever

Use our TTS cost calculator to estimate costs for your specific usage volume.

Best Overall: ElevenLabs

ElevenLabs is still the default recommendation for most projects, and it's not close on voice quality. Their v3 model handles emotion, pacing, and pronunciation better than any other commercial API. Voice cloning from 30 seconds of audio actually works — I've tested it against Cartesia (3-second cloning) and Fish Audio (10-second), and ElevenLabs produces the most consistent, natural-sounding clones.

The Python and Node.js SDKs are excellent. WebSocket streaming is stable. Documentation is the best in the industry — actually shows working code examples instead of pseudo-code. And the 1,000+ pre-built voices mean you can ship a prototype in hours.

The downside is cost. At $60–$180/1M characters on the API, ElevenLabs is 4–12x more expensive than Fish Audio and 4x more than OpenAI. For high-volume production (generating millions of characters monthly), that difference compounds fast. See our full ElevenLabs pricing breakdown for tier-by-tier numbers.

Best Quality-Per-Dollar: Fish Audio S2 Pro

Fish Audio is the most underrated TTS API in 2026. Their S2 Pro model ranked #1 in blind A/B tests with a Bradley-Terry score of 3.07 from 71,000+ evaluation pairs — meaning listeners preferred it to every other model including ElevenLabs when they couldn't see the brand name. At $15/1M characters, that's 11x cheaper than ElevenLabs for arguably better raw quality.

The 80+ language support is legitimate — Fish Audio is particularly strong in CJK languages (Chinese, Japanese, Korean), which is where most Western TTS providers fall flat. Voice cloning works from 10–30 seconds of audio.

Watch out for the billing model: Fish Audio charges per UTF-8 byte, not per character. For English text, 1 character ≈ 1 byte. For Chinese or Japanese text, 1 character = 3 bytes — tripling your effective cost. At $45/1M CJK characters, it's still cheaper than ElevenLabs, but the headline "$15/1M" rate is misleading for non-Latin scripts. Read our Fish Audio vs ElevenLabs comparison for the full picture.

Best for Real-Time Voice Agents: Cartesia Sonic 3

If you're building a voice agent — phone bot, customer service AI, interactive NPC — latency matters more than voice quality. Users tolerate slightly mechanical speech far better than awkward silence. Cartesia's Sonic 3 delivers 40ms time-to-first-audio, beating every other provider by 2–5x. The Turbo variants push that even lower.

Built on SSM architecture (from Stanford founders with $100M+ in funding), Cartesia's approach is fundamentally different from transformer-based TTS. It trades some voice expressiveness for deterministic, low-latency streaming via WebSocket. Voice cloning from just 3 seconds of audio is fast but less accurate than ElevenLabs' 30-second approach.

Pricing is credit-based and confusing at first — see our Cartesia pricing breakdown for the effective per-character rates. At scale, expect $20–$33/1M characters with 1.5x surcharges for Pro Voice Cloning and per-minute phone connection fees ($0.014/min).

Best for OpenAI Stack: OpenAI TTS

If you're already paying for GPT-4o and using the OpenAI SDK, adding TTS is two lines of code. That's the real value proposition — not voice quality, which is mid-tier, but zero additional vendor management.

Three models available: tts-1 ($15/1M chars, lower quality but faster), tts-1-hd ($30/1M, higher quality), and gpt-4o-mini-tts (~$15/1M with steerable instructions). The steerable model is genuinely useful — you can tell it "speak excitedly" or "whisper this part" in plain English. 11 preset voices, no cloning, 57 languages.

The biggest limitation: no free tier. At all. You pay from character one. ElevenLabs gives you 10K free, Google gives 1M free, Amazon gives 5M free for 12 months. OpenAI gives you nothing. For prototyping and testing, that's a real friction point. Full breakdown in our OpenAI TTS pricing guide.

Best for AWS Stack: Amazon Polly

Amazon Polly is the cheapest commercial TTS API if you're already on AWS. Standard voices at $4/1M characters, Neural voices at $16/1M, and the newer Generative engine at $30/1M. The free tier — 5M standard or 1M neural characters per month for 12 months — is the most generous in the industry.

Where Polly shines is infrastructure integration. SpeechMarks give you word-level timing for lip sync and subtitles. SSML support is the most complete of any provider. Integration with AWS Lambda, S3, CloudFront is turnkey. For IVR systems, accessibility features, or any AWS-native application, Polly is hard to beat on price.

The tradeoff is voice quality. Even Polly's Neural voices sound noticeably more robotic than ElevenLabs, Fish Audio, or even OpenAI. The Generative engine closes the gap but is 7.5x the standard price. And there's no voice cloning at all. Full analysis in our Amazon Polly pricing guide.

Best for Voice Agent Platforms: Deepgram & Inworld

Deepgram and Inworld take a different approach: instead of selling TTS as a standalone API, they bundle it into voice agent platforms with STT + LLM + TTS pipelines optimized to work together.

Deepgram's Aura-2 runs at $30/1M characters standalone, but the Voice Agent API bundles STT + TTS for $4.50/hour — which is often cheaper if your agents are handling conversations at volume. The $200 free credit goes a long way for testing. Main limitation: only 7 languages.

Inworld's TTS-2 is the highest-ranked commercial model on the Speech Arena (ELO 1,236+). Sub-200ms latency, closed-loop architecture optimized for real-time conversations. Pricing starts at $25–$35/1M but drops to $5–$10/1M at enterprise volume. The catch: you mostly need to talk to sales, and they push their full character platform rather than TTS-only access.

Best Free / Open Source: Kokoro, Chatterbox & Qwen3-TTS

The open-source TTS landscape changed dramatically in early 2026. Three models now offer genuinely usable quality at $0:

For a detailed showdown between all open-source options, see our Kokoro review which includes a head-to-head comparison table.

Also Worth Considering

Google Cloud Text-to-Speech

The most generous free tier (1M chars/month standard, 100K neural) and 70+ languages. Chirp 3 HD at $30/1M is competitive with ElevenLabs on quality. Custom Neural Voice for enterprise is expensive but effective. The issue is complexity — GCP auth, service accounts, and IAM roles add friction that AWS and OpenAI don't.

Azure Cognitive Services Speech

140+ languages (more than anyone else), Custom Neural Voice with professional quality, and tight integration with Azure Bot Service. The 500K chars/month free tier is solid. Best choice for enterprise Microsoft shops with compliance requirements.

Gemini Flash TTS

Google's newest entry at ~$12/1M characters output with 200+ audio style tags and 70+ languages. Ranked #2 on the Speech Arena. Still in preview with limited documentation, but the quality-per-dollar ratio is excellent. Worth watching. Read our Gemini TTS review for details.

Murf Falcon API

Murf's API offering at $10–$30/1K characters. Decent voice quality for scripted content, but the Gen-3 engine doesn't match newer models. Limited to 20 languages. Best for teams already using Murf Studio for voiceover production.

Free Tier Comparison

Free tiers matter for prototyping and testing. Here's what each provider gives you before you pay:

ProviderFree AllowanceDuration≈ Minutes of Speech
Amazon Polly (Standard)5M chars/mo12 months~833 min/mo
Google Cloud (Standard)1M chars/moOngoing~167 min/mo
Azure Speech500K chars/moOngoing~83 min/mo
Deepgram$200 creditOne-time~1,111 min total
Cartesia20K chars/moOngoing~3 min/mo
ElevenLabs10K chars/moOngoing~2 min/mo
OpenAI TTSNone0
Kokoro / ChatterboxUnlimitedForever

How to Choose: Decision Tree by Use Case

Content Creation / Audiobooks

Quality is everything. Budget secondary.

ElevenLabs (best quality) or Fish Audio (best value)

Voice Agents / Phone Bots

Latency is everything. Sub-200ms required.

Cartesia (fastest) or Inworld (best quality + speed)

Prototyping / MVP

Speed to ship + minimal cost. Quality can improve later.

OpenAI TTS (if using GPT) or Kokoro (free, runs locally)

Enterprise / Compliance

Data residency, SLAs, SOC 2. Voice quality is secondary.

→ Azure Speech or Amazon Polly (cloud compliance built in)

Multilingual Content

Need 20+ languages with natural pronunciation.

→ Azure (140+), Google Cloud (70+), or Fish Audio (80+, best CJK)

Zero Budget / Open Source

No API costs. Self-hosted acceptable.

Kokoro (CPU, English) or Qwen3-TTS (GPU, multilingual + cloning)

5 Pricing Gotchas That Add 20–50% to Your Bill

After writing 10 individual pricing breakdowns, these are the hidden costs that catch developers off guard:

SDK & Streaming Support Matrix

APIPythonNode.jsGoStreaming
ElevenLabs✓ Official✓ Official✓ OfficialWebSocket
OpenAI✓ Official✓ OfficialCommunityHTTP chunked
Cartesia✓ Official✓ OfficialWebSocket + gRPC
Fish Audio✓ OfficialWebSocket
Amazon Polly✓ boto3✓ AWS SDK✓ AWS SDKHTTP chunked
Deepgram✓ Official✓ Official✓ OfficialWebSocket
Kokoro✓ pip installLocal only

The Bottom Line

The TTS API market split into three tiers in 2026: premium quality (ElevenLabs, Fish Audio S2 Pro, Inworld TTS-2), mid-tier value (OpenAI, Cartesia, Deepgram, Gemini Flash), and free open-source (Kokoro, Chatterbox, Qwen3-TTS). The quality gap between tiers shrinks every quarter. A year ago, using anything other than ElevenLabs for production content felt like a compromise. Now Fish Audio matches or beats it in blind tests at a fraction of the price, and open-source models handle English narration convincingly.

My recommendation: start with Kokoro or OpenAI TTS for prototyping, then evaluate Fish Audio and ElevenLabs for production based on your quality and language requirements. Use our TTS cost calculator to model costs at your expected volume before committing.

Related Pricing & Comparison Guides

By TextToLab Research Team · Last verified May 2026. Pricing from official API documentation. Arena rankings from Speech Arena (blind crowdsourced evaluation). ElevenLabs affiliate link disclosed — all other recommendations are independent.