What is the fastest TTS for voice agents?

Cartesia Sonic 3.5 achieves approximately 40ms time-to-first-audio (TTFA) — the fastest in the market as of June 2026. ElevenLabs Flash v2.5 follows at ~75ms with tighter consistency (28ms IQR vs Cartesia's 100ms). For voice agents, sub-200ms TTFA is considered table stakes for natural-sounding conversation.

How much does voice agent TTS cost per minute?

TTS alone ranges from ~$0.03/min (Fish Audio) to $0.12/min (ElevenLabs Premium). But TTS is only one component — a full voice agent stack (STT + LLM + TTS + telephony) typically costs $0.08-$0.40/min total. Bundled platforms like Deepgram's Voice Agent API ($0.08/min) and Bland AI ($0.09/min all-in) offer the most predictable pricing.

Is Vapi really $0.05 per minute?

The $0.05/min is Vapi's orchestration fee only. You still pay separately for TTS (ElevenLabs ~$0.08/min), STT (Deepgram ~$0.01/min), LLM (GPT-4o-mini ~$0.02/min), and telephony (Twilio ~$0.013/min). Real total cost is $0.15-$0.40/min depending on your provider choices — 3-8x the advertised rate.

What TTS should I use for phone-based voice agents?

Cartesia Sonic 3.5 for latency-critical phone agents (40ms TTFA, built-in telephony at $0.014/min). ElevenLabs for sales calls where voice quality drives conversion ($0.08-$0.12/min). For high-volume outbound, Bland AI's all-inclusive $0.09/min is 30-50% cheaper than building on Vapi.

Can I use free open-source TTS for voice agents?

Kokoro (Apache 2.0, 82M params) is free and produces #1 TTS Arena quality, but it lacks native streaming — a dealbreaker for production phone agents. It works for internal prototypes and low-volume web-based chatbots. For production voice agents, streaming TTS providers like Cartesia or ElevenLabs are necessary.

What is the cheapest voice agent platform?

For outbound calling: Bland AI at ~$0.09/min all-inclusive (30-50% cheaper than Vapi). For inbound customer service: Retell AI at $0.11-$0.15/min total with predictable pricing. For DIY: self-hosted Kokoro TTS + Whisper STT + open-source LLM can run for ~$0.05/min on cloud GPUs, but requires significant engineering.

Best TTS for AI Voice Agents 2026: 7 APIs Ranked by Latency, Cost & Quality

The Short Answer

For most voice agent deployments in 2026, Cartesia Sonic 3.5 is the best TTS if latency is your top priority (40ms TTFA), ElevenLabs Turbo v2.5 if voice quality matters most, and Deepgram Aura-2 if you want the simplest bundled STT+TTS stack. The right choice depends on whether you're building a phone agent, a chatbot, or a customer service pipeline — and how much you're willing to pay per conversation minute.

Voice agent TTS is fundamentally different from regular text-to-speech. When someone's waiting on a phone call, a 300ms delay between their question and your agent's response makes the difference between "this feels natural" and "something is wrong." The sub-200ms time-to-first-audio (TTFA) threshold is table stakes. Streaming output is mandatory. And pricing shifts from per-character to per-minute, which changes the economics entirely.

I've reviewed every major TTS API on this site. This guide narrows it down to the 7 that actually work for voice agent use cases, ranked by the metrics that matter: latency, quality at conversational speed, and real cost per conversation.

7 Voice Agent TTS APIs Compared

Provider	TTFA	Cost/Min	Best For	Streaming
Cartesia Sonic 3.5	~40ms	~$0.05–$0.08	Lowest latency	Yes
ElevenLabs Turbo v2.5	~75ms	$0.08–$0.12	Best voice quality	Yes
Inworld TTS-2	~130ms	~$0.06–$0.10	#1 Speech Arena	Yes
Deepgram Aura-2	~313ms	~$0.08 bundled	Bundled STT+TTS	Yes
OpenAI gpt-4o-mini-tts	~150ms	~$0.10–$0.15	Steerable with instructions	Yes
Fish Audio S2 Pro	~200ms	~$0.03–$0.05	Cheapest with cloning	Yes
Kokoro (self-hosted)	~100ms*	$0 (+ GPU)	Free, English only	No

*Kokoro TTFA is for GPU inference. CPU inference is slower (~400ms+). Kokoro lacks native streaming, which adds latency in voice agent pipelines.

Why You Can't Just Use Any TTS for Voice Agents

Regular TTS generates a complete audio file from text. Voice agent TTS has to do something harder: stream audio in real-time during a live conversation, handle interruptions mid-sentence, and maintain consistent voice quality across hundreds of turns per call.

Three metrics determine whether a TTS provider works for voice agents:

Time-to-first-audio (TTFA): How quickly the first audio bytes arrive after sending text. Under 200ms feels natural. Over 500ms feels broken. This is the single most important metric for voice agents.
TTFA consistency (IQR): Median latency doesn't tell the whole story. A provider with 100ms median but 500ms P95 will create terrible user experience on 1 in 20 responses. The interquartile range matters as much as the median.
Streaming support: Non-streaming TTS generates the entire audio clip before sending anything. For a 30-word response, that's 2-3 seconds of silence. Streaming starts playback within milliseconds.

#1: Cartesia Sonic 3.5 — The Latency Leader

TTFA~40ms (fastest in market)ArchitectureState Space Model (SSM) — scales linearlyPricingFrom $4/mo (Pro) to $239/mo (Scale)Phone IntegrationCartesia Line — $0.014/min telephony

Cartesia's SSM architecture is purpose-built for streaming. Where transformer-based TTS models have quadratic scaling costs (each new token looks at every previous token), SSMs scale linearly. The practical result: 40ms TTFA that holds steady under load, not just in benchmarks.

For voice agent builders, Cartesia offers Cartesia Line — a telephony integration at $0.014/min that handles phone connections directly. That means no separate Twilio integration (saving $0.013-$0.026/min per call leg). At 10,000 calls/month with 3-minute average duration, that's $420 for Cartesia Line vs $780-$1,560 for Twilio.

The tradeoff: voice quality. Cartesia sounds good — clean, natural pacing. But in direct comparison with ElevenLabs, the voices lack the subtle emotional range that makes conversations feel genuinely human. For customer service calls where efficiency matters more than warmth, Cartesia is the best choice. For sales calls where rapport drives conversion, ElevenLabs is worth the extra cost. Read our full Cartesia review and pricing breakdown.

#2: ElevenLabs Turbo v2.5 — Best Voice Quality

TTFA~75msAgent Pricing$0.08/min (Standard) to $0.12/min (Premium)Task Completion94% (ElevenLabs reported)TelephonyBYO Twilio/Vonage ($0.013-$0.026/min extra)

ElevenLabs crossed $500M ARR in 2026, largely on the back of voice quality that's indistinguishable from human in most scenarios. Their Conversational AI product bills per-minute (not per-character), with three tiers: Standard at $0.08/min, Turbo at $0.10/min, and Premium (GPT-4o + Flash v2.5) at $0.12/min.

The 75ms TTFA is fast enough for natural conversation — you won't notice the latency in practice. Where ElevenLabs stands apart is voice consistency and emotional range. Their 1,000+ voice library means you can find a voice that matches your brand personality. Professional Voice Cloning lets you create a proprietary agent voice that sounds like a real team member.

The downside: no built-in telephony. You need Twilio or Vonage for phone connections, adding $0.013-$0.026/min per call leg. And burst pricing can double your rate during traffic spikes — agents can temporarily handle 3x your concurrency limit, but excess calls are charged at 2x. At scale, this gets expensive. See our full ElevenLabs pricing breakdown.

#3: Inworld TTS-2 — Top of the Speech Arena

TTFA~130ms (Realtime API)Pricing$25–$35/1M chars ($15–$10 at volume)Speech Arena#1 overall (ELO 1,236)FocusPurpose-built for voice agents

Inworld's TTS-2 holds the #1 position on the Artificial Analysis Speech Arena leaderboard as of mid-2026. It's not a TTS-first company that added agent features — it's a voice agent platform that built TTS specifically for conversational use. The model was trained on dialogues, not narration, so it handles turn-taking patterns, filler words, and conversational rhythm better than models optimized for audiobook-style content.

Pricing is per-character: $25-$35/1M at standard rates, dropping to $5-$10/1M at enterprise volumes. At 500 voice agent calls per day (3 minutes average), that works out to roughly $339/month — competitive with ElevenLabs at similar volume. Full pricing analysis in our Inworld pricing guide.

#4: Deepgram Aura-2 — The All-in-One Bundle

TTFA~313ms (P50)Voice Agent API$0.08/min bundled (STT + TTS + turn-taking)STT EngineNova-3 (industry-leading accuracy)Consistency68ms IQR

Deepgram's pitch is simplicity. Their Voice Agent API bundles speech-to-text (Nova-3), text-to-speech (Aura-2), and conversational turn-taking logic into a single API at $0.08/min. No stitching together Twilio + STT provider + LLM + TTS provider + orchestration layer.

The 313ms median TTFA is the main limitation — it's noticeably slower than Cartesia or ElevenLabs. For phone agents where every millisecond matters, that gap is real. But for web-based chatbots, internal tools, or applications where a slight pause is acceptable, the all-in-one approach saves significant engineering time. Their STT accuracy with Nova-3 is industry-leading, especially for healthcare, finance, and compliance-heavy verticals. Read our Deepgram pricing breakdown.

#5: OpenAI gpt-4o-mini-tts — Steerable Voice

TTFA~150msPricing$15/1M chars (TTS), $24/1M (Realtime API)Unique FeatureNatural language instructions for voice styleVoices9 built-in voices

OpenAI's TTS has a unique trick: you can steer the voice style with natural language instructions. "Speak slowly and empathetically" or "sound excited and energetic" actually change the output in meaningful ways. For voice agents that need different tones for different conversation states (calm during troubleshooting, upbeat during upselling), this flexibility is powerful without requiring multiple voice profiles.

The Realtime API at $24/1M characters enables voice-to-voice interaction without a separate STT step — the model processes audio input directly. That cuts latency for simple Q&A agents, though it sacrifices the control you get from a traditional STT → LLM → TTS pipeline. See our full OpenAI TTS pricing analysis.

#6: Fish Audio S2 Pro — Best Budget Option

TTFA~200msPricing$15/1M UTF-8 bytes (~$0.03-$0.05/min)Blind Tests60% win rate vs ElevenLabsLanguages13+ (strong CJK)

Fish Audio's S2 Pro hits an unusual sweet spot: near-ElevenLabs quality at roughly one-third the price. The 60% blind test win rate isn't a typo — in community evaluations, Fish Audio S2 Pro was preferred over ElevenLabs more often than not. The catch is that billing uses UTF-8 bytes, not characters. For English text, 1 byte ≈ 1 character. For CJK text, each character is 3 bytes, tripling the effective cost.

At ~200ms TTFA, Fish Audio sits at the edge of acceptable voice agent latency. It works well for web-based chat agents and internal tools, but phone agents will feel slightly delayed compared to Cartesia or ElevenLabs. The self-hosting option (CC BY-NC-SA) eliminates API costs for non-commercial projects. Full comparison in our Fish Audio vs ElevenLabs analysis and pricing guide.

#7: Kokoro — Zero-Cost English Option

TTFA~100ms (GPU) / 400ms+ (CPU)Pricing$0 (Apache 2.0)Parameters82M (300MB model file)LimitationEnglish only, no streaming, no voice cloning

Kokoro is the only TTS model in this list with a fully permissive license (Apache 2.0) and zero API costs. Self-host it on a $0.20/hr GPU and your TTS bill drops to nearly nothing. The quality reached #1 on the TTS Arena in January 2026.

The dealbreaker for most voice agents: no native streaming. Kokoro generates the entire audio clip before returning it, which adds latency that streaming providers avoid. There are community workarounds (chunking text into sentences and generating concurrently), but it's engineering overhead that managed APIs handle automatically. For internal prototypes or low-volume English agents, it's a viable option. For production phone agents, you need streaming. Read our full Kokoro TTS review.

Latency Benchmarks: TTFA and Consistency

These numbers come from independent benchmarks (Gradium, Coval), not provider marketing. Median TTFA tells you the typical experience. IQR (interquartile range) tells you how consistent it is — lower is better.

Provider	P50 TTFA	IQR	Verdict
Cartesia Sonic 3.5	~40ms	100ms	Fastest. Wider IQR at tail.
ElevenLabs Flash v2.5	~75ms	28ms	Very consistent. Tight IQR.
Inworld TTS-2	~130ms	—	Good enough for most agents.
OpenAI gpt-4o-mini-tts	~150ms	—	Acceptable for web agents.
Deepgram Aura-2	~313ms	68ms	Noticeable delay on phone calls.

An important nuance: Cartesia has the fastest median but a wider IQR (100ms) than ElevenLabs (28ms). That means Cartesia is faster on average but less predictable. For most voice agents, the median matters more than the tail — users notice average response time, not occasional outliers. But for high-stakes calls (healthcare triage, emergency services), ElevenLabs' consistency is worth the 35ms slower median.

What Voice Agents Actually Cost Per Conversation

TTS is only one piece of the voice agent cost puzzle. A complete pipeline includes speech-to-text, LLM inference, TTS, and telephony. Here's what the full stack costs at different scales, assuming 3-minute average call duration:

Stack	TTS/Min	Total/Min	1K Calls/Mo	10K Calls/Mo
Cartesia + Deepgram STT + Twilio	~$0.06	~$0.11	$330	$3,300
ElevenLabs Conversational AI	$0.08	~$0.12	$360	$3,600
Deepgram Voice Agent API	bundled	~$0.08	$240	$2,400
Fish Audio + Whisper + Twilio	~$0.04	~$0.09	$270	$2,700
Kokoro (self-hosted) + Whisper	~$0.01	~$0.05	$150	$1,500

Total/min includes estimated STT ($0.01/min), LLM inference ($0.02/min for GPT-4o-mini), and telephony ($0.013/min via Twilio) where applicable. Actual costs vary based on LLM choice, call complexity, and volume discounts.

The hidden cost multiplier is LLM inference. At $0.02/min for GPT-4o-mini, the LLM often costs as much as the TTS layer itself. Voice agent builders spending $5,000+/month should look at smaller fine-tuned models or Groq for LLM inference — that can cut the LLM line item by 80%. Use our TTS cost calculator to model your specific scenario.

Bundled vs BYO: Voice Agent Platform Pricing

You don't have to assemble the stack yourself. Voice agent platforms bundle TTS + STT + orchestration + telephony into a single product. The trade-off is higher per-minute cost for less engineering work.

Platform	Advertised Rate	True Cost/Min	What's Included
Vapi	$0.05/min	$0.15–$0.40/min	Orchestration only — BYO TTS, STT, LLM, telephony
Retell AI	$0.07/min	$0.11–$0.15/min	Voice engine + optional BYO LLM + telephony at $0.015/min
Bland AI	~$0.09/min	~$0.09/min	All-inclusive for outbound. $0.015 per attempt.
ElevenLabs Agents	$0.08/min	$0.10–$0.14/min	Voice + LLM included. BYO telephony ($0.013-$0.026/min).

The Pricing Trap Nobody Talks About

Vapi's $0.05/min sounds cheapest, but it's an orchestration fee on top of your provider costs. Add ElevenLabs TTS ($0.08/min), Deepgram STT ($0.01/min), GPT-4o-mini ($0.02/min), and Twilio ($0.013/min), and your actual cost is $0.17-$0.40/min — 3-8x the advertised rate. Bland AI's $0.09/min all-inclusive is genuinely all-inclusive for outbound calls, making it 30-50% cheaper than Vapi at high volume. But Bland is outbound-only. For inbound customer service agents, Retell AI at $0.11-$0.15/min total is the most predictable pricing.

How to Choose: A Decision Framework

After reviewing every provider on this site, here's the framework I'd use:

Phone agents with latency-critical UX: Cartesia Sonic 3.5. Nothing else hits 40ms TTFA with streaming. If you need telephony built in, their Line product saves a Twilio integration.
Sales/marketing calls where voice quality drives conversion: ElevenLabs. The voice quality gap still matters for rapport-dependent conversations. The $0.08/min premium over Cartesia pays for itself if your conversion rate improves even slightly.
Enterprise with compliance requirements: Deepgram. HIPAA, SOC 2, and industry-specific STT models (medical, legal, financial) make it the safest choice for regulated industries. Accept the latency trade-off.
High-volume outbound at lowest cost: Bland AI at $0.09/min all-in. At 10K+ calls/month, the savings over Vapi ($0.15-$0.40/min) are massive. Limited to outbound only.
Prototype or internal tool: Kokoro (free, self-hosted) or Fish Audio ($15/1M). Don't spend $0.12/min on ElevenLabs for an internal tool that 5 people use.
Multilingual voice agents: ElevenLabs (70+ languages) or Fish Audio (13+ with strong CJK). Cartesia's language support is more limited.

For detailed pricing on any of these providers, compare them all on our TTS pricing comparison page or explore our complete TTS API comparison.

By TextToLab Research Team · Last verified June 2026. Latency benchmarks from Gradium TTS Benchmark (2026) and Coval evaluation platform. Voice quality rankings from Artificial Analysis Speech Arena leaderboard. Platform pricing verified against Vapi, Retell AI, Bland AI, ElevenLabs, Deepgram, Cartesia, and OpenAI pricing pages (June 2026). Voice agent market size from Grand View Research ($4.8B, 38% CAGR). ElevenLabs task completion data from ElevenLabs published benchmarks. Individual provider analyses from our pricing deep-dives (see links above).