What is the best text to speech API in 2026?

ElevenLabs is the best overall for voice quality and cloning ($0.18–$0.30/1K characters). Fish Audio S2 Pro offers the best quality-per-dollar at $15/1M characters — ranked #1 in blind tests. Cartesia Sonic 3 is best for real-time voice agents with 40ms latency. OpenAI TTS ($15/1M) is best for teams already using the OpenAI SDK. Kokoro is best for free/open-source at $0 on CPU.

Which TTS API has a free tier?

Amazon Polly has the most generous free tier: 5M standard characters per month for 12 months. Google Cloud offers 1M standard characters per month ongoing. Azure gives 500K characters per month. ElevenLabs offers 10K characters per month. Cartesia provides 20K characters per month. OpenAI TTS has no free tier at all. Open-source models like Kokoro and Chatterbox are completely free to self-host.

What is the cheapest TTS API?

For managed APIs: Amazon Polly Standard at $4/1M characters is the cheapest, followed by OpenAI TTS at $15/1M. Fish Audio S2 Pro offers the best quality-per-dollar at $15/1M characters with #1 blind test rankings. For free: Kokoro, Chatterbox, and Qwen3-TTS are open-source and cost $0 to self-host.

What TTS API has the lowest latency?

Cartesia Sonic 3 has the lowest time-to-first-audio at approximately 40ms — about 2-5x faster than any competitor. Deepgram Aura-2 follows at ~150ms. ElevenLabs ranges from 75-300ms depending on model. OpenAI TTS is typically 200-400ms. For real-time voice agents and phone bots, Cartesia or Deepgram are the recommended choices.

Which TTS API is best for voice cloning?

ElevenLabs offers the best voice cloning quality from 30 seconds of audio, with Professional Clone available for even higher fidelity. Fish Audio S2 Pro clones from 10-30 seconds with excellent results at a fraction of the cost. Cartesia offers the fastest clone setup from just 3 seconds. For free voice cloning, Qwen3-TTS clones from 3 seconds and Chatterbox from 5 seconds — both open-source.

Is there a free open-source text to speech API?

Yes. Kokoro-82M (Apache 2.0, runs on CPU, English only, 300MB) hit #1 on the TTS Arena. Chatterbox (MIT license, English, includes voice cloning from 5 seconds) is another strong option. Qwen3-TTS (Apache 2.0, 10 languages, voice cloning from 3 seconds) is the most capable but requires an NVIDIA GPU. All three are free for commercial use.

Best Text to Speech API 2026: 12 TTS APIs Compared (Pricing, Latency, SDKs)

The Short Answer

The best text-to-speech API depends on what you're building. ElevenLabs ($0.18–$0.30/1K chars) wins for voice quality and cloning. Cartesia Sonic 3 (40ms TTFA) wins for real-time voice agents. Fish Audio S2 Pro ($15/1M chars) wins for quality-per-dollar. OpenAI TTS ($15/1M) wins for simplicity if you're already in the OpenAI ecosystem. And Kokoro ($0, runs on CPU) wins for prototyping and cost-sensitive projects.

I've spent weeks testing every major TTS API against our pricing data from 10 individual pricing breakdowns. Every other "best TTS API" list on Google is written by a TTS provider ranking themselves first. This one isn't. We only have an affiliate relationship with ElevenLabs (disclosed), and we'll tell you when a $15/1M option genuinely beats the $180/1M one.

Quick Comparison: 12 TTS APIs at a Glance

Prices verified against official documentation as of May 2026. Arena rankings from the Speech Arena blind evaluation (crowd-sourced ELO).

API	Price / 1M Chars	Latency (TTFA)	Languages	Voice Cloning	Free Tier
ElevenLabs v3	$60–$180	75–300ms	32	Yes (30s sample)	10K chars/mo
Fish Audio S2 Pro	$15	~200ms	80+	Yes (10–30s)	1K chars/day
Cartesia Sonic 3	$20–$33	40ms	42	Yes (3s sample)	20K chars/mo
OpenAI TTS	$15 (tts-1)	200–400ms	57	No	None
Amazon Polly	$4–$16	100–200ms	33	No	5M chars/12mo
Deepgram Aura-2	$27–$30	~150ms	7	No	$200 credit
Inworld TTS-2	$25–$35	<200ms	25+	Custom voices	Contact sales
Google Cloud TTS	$4–$30	100–250ms	70+	Custom Voice (enterprise)	1M chars/mo
Azure Speech	$16–$24	100–200ms	140+	Custom Neural Voice	500K chars/mo
Murf Falcon API	$10–$30	150–300ms	20	No	None
Kokoro-82M	$0 (self-hosted)	~50ms (GPU)	1 (English)	No	Free forever
Chatterbox	$0 (self-hosted)	Variable	1 (English)	Yes (5s sample)	Free forever

Use our TTS cost calculator to estimate costs for your specific usage volume.

Best Overall: ElevenLabs

ElevenLabs is still the default recommendation for most projects, and it's not close on voice quality. Their v3 model handles emotion, pacing, and pronunciation better than any other commercial API. Voice cloning from 30 seconds of audio actually works — I've tested it against Cartesia (3-second cloning) and Fish Audio (10-second), and ElevenLabs produces the most consistent, natural-sounding clones.

The Python and Node.js SDKs are excellent. WebSocket streaming is stable. Documentation is the best in the industry — actually shows working code examples instead of pseudo-code. And the 1,000+ pre-built voices mean you can ship a prototype in hours.

The downside is cost. At $60–$180/1M characters on the API, ElevenLabs is 4–12x more expensive than Fish Audio and 4x more than OpenAI. For high-volume production (generating millions of characters monthly), that difference compounds fast. See our full ElevenLabs pricing breakdown for tier-by-tier numbers.

Best for: Content creation, audiobooks, apps where voice quality is the primary differentiator
SDK support: Python, Node.js, Go, REST, WebSocket streaming
Gotcha: Free tier is only 10K characters/month — about 2 minutes of speech. You'll burn through it testing.

Best Quality-Per-Dollar: Fish Audio S2 Pro

Fish Audio is the most underrated TTS API in 2026. Their S2 Pro model ranked #1 in blind A/B tests with a Bradley-Terry score of 3.07 from 71,000+ evaluation pairs — meaning listeners preferred it to every other model including ElevenLabs when they couldn't see the brand name. At $15/1M characters, that's 11x cheaper than ElevenLabs for arguably better raw quality.

The 80+ language support is legitimate — Fish Audio is particularly strong in CJK languages (Chinese, Japanese, Korean), which is where most Western TTS providers fall flat. Voice cloning works from 10–30 seconds of audio.

Watch out for the billing model: Fish Audio charges per UTF-8 byte, not per character. For English text, 1 character ≈ 1 byte. For Chinese or Japanese text, 1 character = 3 bytes — tripling your effective cost. At $45/1M CJK characters, it's still cheaper than ElevenLabs, but the headline "$15/1M" rate is misleading for non-Latin scripts. Read our Fish Audio vs ElevenLabs comparison for the full picture.

Best for: High-volume production, multilingual content, CJK languages, cost-conscious teams that still need top-tier quality
SDK support: Python, REST, WebSocket
Gotcha: UTF-8 byte billing makes CJK text 3x the headline rate

Best for Real-Time Voice Agents: Cartesia Sonic 3

If you're building a voice agent — phone bot, customer service AI, interactive NPC — latency matters more than voice quality. Users tolerate slightly mechanical speech far better than awkward silence. Cartesia's Sonic 3 delivers 40ms time-to-first-audio, beating every other provider by 2–5x. The Turbo variants push that even lower.

Built on SSM architecture (from Stanford founders with $100M+ in funding), Cartesia's approach is fundamentally different from transformer-based TTS. It trades some voice expressiveness for deterministic, low-latency streaming via WebSocket. Voice cloning from just 3 seconds of audio is fast but less accurate than ElevenLabs' 30-second approach.

Pricing is credit-based and confusing at first — see our Cartesia pricing breakdown for the effective per-character rates. At scale, expect $20–$33/1M characters with 1.5x surcharges for Pro Voice Cloning and per-minute phone connection fees ($0.014/min).

Best for: Voice agents, phone bots, real-time conversational AI, gaming NPCs
SDK support: Python, Node.js, WebSocket streaming, gRPC
Gotcha: Voice quality ranks #10 on the Arena (ELO 1,054) — noticeably below ElevenLabs and Fish Audio in side-by-side tests. See our Cartesia vs ElevenLabs comparison

Best for OpenAI Stack: OpenAI TTS

If you're already paying for GPT-4o and using the OpenAI SDK, adding TTS is two lines of code. That's the real value proposition — not voice quality, which is mid-tier, but zero additional vendor management.

Three models available: tts-1 ($15/1M chars, lower quality but faster), tts-1-hd ($30/1M, higher quality), and gpt-4o-mini-tts (~$15/1M with steerable instructions). The steerable model is genuinely useful — you can tell it "speak excitedly" or "whisper this part" in plain English. 11 preset voices, no cloning, 57 languages.

The biggest limitation: no free tier. At all. You pay from character one. ElevenLabs gives you 10K free, Google gives 1M free, Amazon gives 5M free for 12 months. OpenAI gives you nothing. For prototyping and testing, that's a real friction point. Full breakdown in our OpenAI TTS pricing guide.

Best for: Teams already using OpenAI APIs who want unified billing and a single SDK
SDK support: Python, Node.js (official), REST
Gotcha: No free tier, no voice cloning, 11 preset voices only

Best for AWS Stack: Amazon Polly

Amazon Polly is the cheapest commercial TTS API if you're already on AWS. Standard voices at $4/1M characters, Neural voices at $16/1M, and the newer Generative engine at $30/1M. The free tier — 5M standard or 1M neural characters per month for 12 months — is the most generous in the industry.

Where Polly shines is infrastructure integration. SpeechMarks give you word-level timing for lip sync and subtitles. SSML support is the most complete of any provider. Integration with AWS Lambda, S3, CloudFront is turnkey. For IVR systems, accessibility features, or any AWS-native application, Polly is hard to beat on price.

The tradeoff is voice quality. Even Polly's Neural voices sound noticeably more robotic than ElevenLabs, Fish Audio, or even OpenAI. The Generative engine closes the gap but is 7.5x the standard price. And there's no voice cloning at all. Full analysis in our Amazon Polly pricing guide.

Best for: AWS-native apps, IVR/telephony, accessibility features, high-volume/low-cost batch processing
SDK support: Python (boto3), Java, .NET, Go, PHP, Ruby — full AWS SDK coverage
Gotcha: SSML markup counts toward billing, adding 10–30% to effective cost

Best for Voice Agent Platforms: Deepgram & Inworld

Deepgram and Inworld take a different approach: instead of selling TTS as a standalone API, they bundle it into voice agent platforms with STT + LLM + TTS pipelines optimized to work together.

Deepgram's Aura-2 runs at $30/1M characters standalone, but the Voice Agent API bundles STT + TTS for $4.50/hour — which is often cheaper if your agents are handling conversations at volume. The $200 free credit goes a long way for testing. Main limitation: only 7 languages.

Inworld's TTS-2 is the highest-ranked commercial model on the Speech Arena (ELO 1,236+). Sub-200ms latency, closed-loop architecture optimized for real-time conversations. Pricing starts at $25–$35/1M but drops to $5–$10/1M at enterprise volume. The catch: you mostly need to talk to sales, and they push their full character platform rather than TTS-only access.

Best Free / Open Source: Kokoro, Chatterbox & Qwen3-TTS

The open-source TTS landscape changed dramatically in early 2026. Three models now offer genuinely usable quality at $0:

Kokoro-82M — 300MB model that hit #1 on the TTS Arena. Runs on CPU. English only, no voice cloning, but the quality is indistinguishable from paid services in blind tests. Perfect for prototyping and English narration.
Chatterbox — MIT-licensed voice cloning from 5 seconds of audio. English only. Quality is a step below ElevenLabs cloning, but it's free and runs locally. Best for projects that need voice cloning without API costs.
Qwen3-TTS — Alibaba's 1.7B model supports 10 languages, voice cloning from 3 seconds, and natural-language voice direction ("speak sadly"). Needs a GPU (6–8GB VRAM). The most capable open-source TTS overall, but the hardware requirement limits who can use it.

For a detailed showdown between all open-source options — including benchmarks, licenses, and hardware requirements for each model — see our open-source text-to-speech comparison.

Also Worth Considering

Google Cloud Text-to-Speech

The most generous free tier (1M chars/month standard, 100K neural) and 70+ languages. Chirp 3 HD at $30/1M is competitive with ElevenLabs on quality. Custom Neural Voice for enterprise is expensive but effective. The issue is complexity — GCP auth, service accounts, and IAM roles add friction that AWS and OpenAI don't. See our Google Cloud TTS pricing guide for a full tier breakdown.

Azure Cognitive Services Speech

140+ languages (more than anyone else), Custom Neural Voice with professional quality, and tight integration with Azure Bot Service. The 500K chars/month free tier is solid. Best choice for enterprise Microsoft shops with compliance requirements. Full breakdown in our Azure Text to Speech pricing guide.

Gemini Flash TTS

Google's newest entry at ~$12/1M characters output with 200+ audio style tags and 70+ languages. Ranked #2 on the Speech Arena. Still in preview with limited documentation, but the quality-per-dollar ratio is excellent. Worth watching. Read our Gemini TTS review for details.

Murf Falcon API

Murf's API offering at $10–$30/1K characters. Decent voice quality for scripted content, but the Gen-3 engine doesn't match newer models. Limited to 20 languages. Best for teams already using Murf Studio for voiceover production.

Free Tier Comparison

Free tiers matter for prototyping and testing. Here's what each provider gives you before you pay:

Provider	Free Allowance	Duration	≈ Minutes of Speech
Amazon Polly (Standard)	5M chars/mo	12 months	~833 min/mo
Google Cloud (Standard)	1M chars/mo	Ongoing	~167 min/mo
Azure Speech	500K chars/mo	Ongoing	~83 min/mo
Deepgram	$200 credit	One-time	~1,111 min total
Cartesia	20K chars/mo	Ongoing	~3 min/mo
ElevenLabs	10K chars/mo	Ongoing	~2 min/mo
OpenAI TTS	None	—	0
Kokoro / Chatterbox	Unlimited	Forever	∞

How to Choose: Decision Tree by Use Case

Content Creation / Audiobooks

Quality is everything. Budget secondary.

→ ElevenLabs (best quality) or Fish Audio (best value)

Voice Agents / Phone Bots

Latency is everything. Sub-200ms required.

→ Cartesia (fastest) or Inworld (best quality + speed)

Prototyping / MVP

Speed to ship + minimal cost. Quality can improve later.

→ OpenAI TTS (if using GPT) or Kokoro (free, runs locally)

Enterprise / Compliance

Data residency, SLAs, SOC 2. Voice quality is secondary.

→ Azure Speech or Amazon Polly (cloud compliance built in)

Multilingual Content

Need 20+ languages with natural pronunciation.

→ Azure (140+), Google Cloud (70+), or Fish Audio (80+, best CJK)

Zero Budget / Open Source

No API costs. Self-hosted acceptable.

→ Kokoro (CPU, English) or Qwen3-TTS (GPU, multilingual + cloning)

5 Pricing Gotchas That Add 20–50% to Your Bill

After writing 10 individual pricing breakdowns, these are the hidden costs that catch developers off guard:

SSML markup billing (Amazon Polly) — SSML tags count as characters. A paragraph with emphasis, breaks, and prosody tags can be 30% longer than the visible text.
UTF-8 byte billing (Fish Audio) — 1 Chinese character = 3 bytes. What looks like $15/1M becomes $45/1M for CJK content.
Concurrent session limits (Cartesia, Inworld) — Hitting the concurrent limit queues requests, adding latency that defeats the purpose of choosing a low-latency provider.
Pro Voice Cloning surcharge (Cartesia) — 1.5x multiplier on all characters generated with cloned voices. Plus $0.014/min phone connection fees.
Output token billing (Gemini Flash) — Gemini charges per output audio token, not input character. Cost depends on speech duration, not text length — longer pauses cost more.

SDK & Streaming Support Matrix

API	Python	Node.js	Go	Streaming
ElevenLabs	✓ Official	✓ Official	✓ Official	WebSocket
OpenAI	✓ Official	✓ Official	Community	HTTP chunked
Cartesia	✓ Official	✓ Official	—	WebSocket + gRPC
Fish Audio	✓ Official	—	—	WebSocket
Amazon Polly	✓ boto3	✓ AWS SDK	✓ AWS SDK	HTTP chunked
Deepgram	✓ Official	✓ Official	✓ Official	WebSocket
Kokoro	✓ pip install	—	—	Local only

The Bottom Line

The TTS API market split into three tiers in 2026: premium quality (ElevenLabs, Fish Audio S2 Pro, Inworld TTS-2), mid-tier value (OpenAI, Cartesia, Deepgram, Gemini Flash), and free open-source (Kokoro, Chatterbox, Qwen3-TTS). The quality gap between tiers shrinks every quarter. A year ago, using anything other than ElevenLabs for production content felt like a compromise. Now Fish Audio matches or beats it in blind tests at a fraction of the price, and open-source models handle English narration convincingly.

My recommendation: start with Kokoro or OpenAI TTS for prototyping, then evaluate Fish Audio and ElevenLabs for production based on your quality and language requirements. Use our TTS cost calculator to model costs at your expected volume before committing.

Related Pricing & Comparison Guides

ElevenLabs Pricing 2026$5–$330/mo plans + API rates per character OpenAI TTS Pricingtts-1, tts-1-hd, and gpt-4o-mini-tts per-character costs Cartesia PricingFree tier to $299/mo Scale plan — credit system explained Fish Audio Pricing$15/1M chars for S2 Pro — UTF-8 byte billing explained Amazon Polly Pricing$4–$30/1M across Standard, Neural, and Generative engines Deepgram PricingAura-2 TTS + Voice Agent API bundle at $4.50/hr Fish Audio vs ElevenLabs$15/1M vs $180/1M — blind test data and real comparison Cartesia vs ElevenLabsSpeed vs quality — when 40ms TTFA beats voice fidelity ElevenLabs vs Amazon PollyPremium quality vs cloud-native cost — which API wins for your use case Kindle Text to SpeechFree built-in reader + 3 better AI alternatives compared Canva Text to SpeechFree AI voiceover for videos — setup, limits, and better options Play.ht Shutdown GuideMigration paths for Play.ht users switching TTS providers Open-Source TTS Guide8 free models compared: benchmarks, licenses, hardware

By TextToLab Research Team · Last verified June 2026. Pricing from official API documentation. Arena rankings from Speech Arena (blind crowdsourced evaluation). ElevenLabs affiliate link disclosed — all other recommendations are independent.