The Short Answer
For most voice agent deployments in 2026, Cartesia Sonic 3.5 is the best TTS if latency is your top priority (40ms TTFA), ElevenLabs Turbo v2.5 if voice quality matters most, and Deepgram Aura-2 if you want the simplest bundled STT+TTS stack. The right choice depends on whether you're building a phone agent, a chatbot, or a customer service pipeline — and how much you're willing to pay per conversation minute.
Voice agent TTS is fundamentally different from regular text-to-speech. When someone's waiting on a phone call, a 300ms delay between their question and your agent's response makes the difference between "this feels natural" and "something is wrong." The sub-200ms time-to-first-audio (TTFA) threshold is table stakes. Streaming output is mandatory. And pricing shifts from per-character to per-minute, which changes the economics entirely.
I've reviewed every major TTS API on this site. This guide narrows it down to the 7 that actually work for voice agent use cases, ranked by the metrics that matter: latency, quality at conversational speed, and real cost per conversation.
7 Voice Agent TTS APIs Compared
| Provider | TTFA | Cost/Min | Best For | Streaming |
|---|---|---|---|---|
| Cartesia Sonic 3.5 | ~40ms | ~$0.05–$0.08 | Lowest latency | Yes |
| ElevenLabs Turbo v2.5 | ~75ms | $0.08–$0.12 | Best voice quality | Yes |
| Inworld TTS-2 | ~130ms | ~$0.06–$0.10 | #1 Speech Arena | Yes |
| Deepgram Aura-2 | ~313ms | ~$0.08 bundled | Bundled STT+TTS | Yes |
| OpenAI gpt-4o-mini-tts | ~150ms | ~$0.10–$0.15 | Steerable with instructions | Yes |
| Fish Audio S2 Pro | ~200ms | ~$0.03–$0.05 | Cheapest with cloning | Yes |
| Kokoro (self-hosted) | ~100ms* | $0 (+ GPU) | Free, English only | No |
*Kokoro TTFA is for GPU inference. CPU inference is slower (~400ms+). Kokoro lacks native streaming, which adds latency in voice agent pipelines.
Why You Can't Just Use Any TTS for Voice Agents
Regular TTS generates a complete audio file from text. Voice agent TTS has to do something harder: stream audio in real-time during a live conversation, handle interruptions mid-sentence, and maintain consistent voice quality across hundreds of turns per call.
Three metrics determine whether a TTS provider works for voice agents:
- Time-to-first-audio (TTFA): How quickly the first audio bytes arrive after sending text. Under 200ms feels natural. Over 500ms feels broken. This is the single most important metric for voice agents.
- TTFA consistency (IQR): Median latency doesn't tell the whole story. A provider with 100ms median but 500ms P95 will create terrible user experience on 1 in 20 responses. The interquartile range matters as much as the median.
- Streaming support: Non-streaming TTS generates the entire audio clip before sending anything. For a 30-word response, that's 2-3 seconds of silence. Streaming starts playback within milliseconds.
#1: Cartesia Sonic 3.5 — The Latency Leader
Cartesia's SSM architecture is purpose-built for streaming. Where transformer-based TTS models have quadratic scaling costs (each new token looks at every previous token), SSMs scale linearly. The practical result: 40ms TTFA that holds steady under load, not just in benchmarks.
For voice agent builders, Cartesia offers Cartesia Line — a telephony integration at $0.014/min that handles phone connections directly. That means no separate Twilio integration (saving $0.013-$0.026/min per call leg). At 10,000 calls/month with 3-minute average duration, that's $420 for Cartesia Line vs $780-$1,560 for Twilio.
The tradeoff: voice quality. Cartesia sounds good — clean, natural pacing. But in direct comparison with ElevenLabs, the voices lack the subtle emotional range that makes conversations feel genuinely human. For customer service calls where efficiency matters more than warmth, Cartesia is the best choice. For sales calls where rapport drives conversion, ElevenLabs is worth the extra cost. Read our full Cartesia review and pricing breakdown.
#2: ElevenLabs Turbo v2.5 — Best Voice Quality
ElevenLabs crossed $500M ARR in 2026, largely on the back of voice quality that's indistinguishable from human in most scenarios. Their Conversational AI product bills per-minute (not per-character), with three tiers: Standard at $0.08/min, Turbo at $0.10/min, and Premium (GPT-4o + Flash v2.5) at $0.12/min.
The 75ms TTFA is fast enough for natural conversation — you won't notice the latency in practice. Where ElevenLabs stands apart is voice consistency and emotional range. Their 1,000+ voice library means you can find a voice that matches your brand personality. Professional Voice Cloning lets you create a proprietary agent voice that sounds like a real team member.
The downside: no built-in telephony. You need Twilio or Vonage for phone connections, adding $0.013-$0.026/min per call leg. And burst pricing can double your rate during traffic spikes — agents can temporarily handle 3x your concurrency limit, but excess calls are charged at 2x. At scale, this gets expensive. See our full ElevenLabs pricing breakdown.
#3: Inworld TTS-2 — Top of the Speech Arena
Inworld's TTS-2 holds the #1 position on the Artificial Analysis Speech Arena leaderboard as of mid-2026. It's not a TTS-first company that added agent features — it's a voice agent platform that built TTS specifically for conversational use. The model was trained on dialogues, not narration, so it handles turn-taking patterns, filler words, and conversational rhythm better than models optimized for audiobook-style content.
Pricing is per-character: $25-$35/1M at standard rates, dropping to $5-$10/1M at enterprise volumes. At 500 voice agent calls per day (3 minutes average), that works out to roughly $339/month — competitive with ElevenLabs at similar volume. Full pricing analysis in our Inworld pricing guide.
#4: Deepgram Aura-2 — The All-in-One Bundle
Deepgram's pitch is simplicity. Their Voice Agent API bundles speech-to-text (Nova-3), text-to-speech (Aura-2), and conversational turn-taking logic into a single API at $0.08/min. No stitching together Twilio + STT provider + LLM + TTS provider + orchestration layer.
The 313ms median TTFA is the main limitation — it's noticeably slower than Cartesia or ElevenLabs. For phone agents where every millisecond matters, that gap is real. But for web-based chatbots, internal tools, or applications where a slight pause is acceptable, the all-in-one approach saves significant engineering time. Their STT accuracy with Nova-3 is industry-leading, especially for healthcare, finance, and compliance-heavy verticals. Read our Deepgram pricing breakdown.
#5: OpenAI gpt-4o-mini-tts — Steerable Voice
OpenAI's TTS has a unique trick: you can steer the voice style with natural language instructions. "Speak slowly and empathetically" or "sound excited and energetic" actually change the output in meaningful ways. For voice agents that need different tones for different conversation states (calm during troubleshooting, upbeat during upselling), this flexibility is powerful without requiring multiple voice profiles.
The Realtime API at $24/1M characters enables voice-to-voice interaction without a separate STT step — the model processes audio input directly. That cuts latency for simple Q&A agents, though it sacrifices the control you get from a traditional STT → LLM → TTS pipeline. See our full OpenAI TTS pricing analysis.
#6: Fish Audio S2 Pro — Best Budget Option
Fish Audio's S2 Pro hits an unusual sweet spot: near-ElevenLabs quality at roughly one-third the price. The 60% blind test win rate isn't a typo — in community evaluations, Fish Audio S2 Pro was preferred over ElevenLabs more often than not. The catch is that billing uses UTF-8 bytes, not characters. For English text, 1 byte ≈ 1 character. For CJK text, each character is 3 bytes, tripling the effective cost.
At ~200ms TTFA, Fish Audio sits at the edge of acceptable voice agent latency. It works well for web-based chat agents and internal tools, but phone agents will feel slightly delayed compared to Cartesia or ElevenLabs. The self-hosting option (CC BY-NC-SA) eliminates API costs for non-commercial projects. Full comparison in our Fish Audio vs ElevenLabs analysis and pricing guide.
#7: Kokoro — Zero-Cost English Option
Kokoro is the only TTS model in this list with a fully permissive license (Apache 2.0) and zero API costs. Self-host it on a $0.20/hr GPU and your TTS bill drops to nearly nothing. The quality reached #1 on the TTS Arena in January 2026.
The dealbreaker for most voice agents: no native streaming. Kokoro generates the entire audio clip before returning it, which adds latency that streaming providers avoid. There are community workarounds (chunking text into sentences and generating concurrently), but it's engineering overhead that managed APIs handle automatically. For internal prototypes or low-volume English agents, it's a viable option. For production phone agents, you need streaming. Read our full Kokoro TTS review.
Latency Benchmarks: TTFA and Consistency
These numbers come from independent benchmarks (Gradium, Coval), not provider marketing. Median TTFA tells you the typical experience. IQR (interquartile range) tells you how consistent it is — lower is better.
| Provider | P50 TTFA | IQR | Verdict |
|---|---|---|---|
| Cartesia Sonic 3.5 | ~40ms | 100ms | Fastest. Wider IQR at tail. |
| ElevenLabs Flash v2.5 | ~75ms | 28ms | Very consistent. Tight IQR. |
| Inworld TTS-2 | ~130ms | — | Good enough for most agents. |
| OpenAI gpt-4o-mini-tts | ~150ms | — | Acceptable for web agents. |
| Deepgram Aura-2 | ~313ms | 68ms | Noticeable delay on phone calls. |
An important nuance: Cartesia has the fastest median but a wider IQR (100ms) than ElevenLabs (28ms). That means Cartesia is faster on average but less predictable. For most voice agents, the median matters more than the tail — users notice average response time, not occasional outliers. But for high-stakes calls (healthcare triage, emergency services), ElevenLabs' consistency is worth the 35ms slower median.
What Voice Agents Actually Cost Per Conversation
TTS is only one piece of the voice agent cost puzzle. A complete pipeline includes speech-to-text, LLM inference, TTS, and telephony. Here's what the full stack costs at different scales, assuming 3-minute average call duration:
| Stack | TTS/Min | Total/Min | 1K Calls/Mo | 10K Calls/Mo |
|---|---|---|---|---|
| Cartesia + Deepgram STT + Twilio | ~$0.06 | ~$0.11 | $330 | $3,300 |
| ElevenLabs Conversational AI | $0.08 | ~$0.12 | $360 | $3,600 |
| Deepgram Voice Agent API | bundled | ~$0.08 | $240 | $2,400 |
| Fish Audio + Whisper + Twilio | ~$0.04 | ~$0.09 | $270 | $2,700 |
| Kokoro (self-hosted) + Whisper | ~$0.01 | ~$0.05 | $150 | $1,500 |
Total/min includes estimated STT ($0.01/min), LLM inference ($0.02/min for GPT-4o-mini), and telephony ($0.013/min via Twilio) where applicable. Actual costs vary based on LLM choice, call complexity, and volume discounts.
The hidden cost multiplier is LLM inference. At $0.02/min for GPT-4o-mini, the LLM often costs as much as the TTS layer itself. Voice agent builders spending $5,000+/month should look at smaller fine-tuned models or Groq for LLM inference — that can cut the LLM line item by 80%. Use our TTS cost calculator to model your specific scenario.
Bundled vs BYO: Voice Agent Platform Pricing
You don't have to assemble the stack yourself. Voice agent platforms bundle TTS + STT + orchestration + telephony into a single product. The trade-off is higher per-minute cost for less engineering work.
| Platform | Advertised Rate | True Cost/Min | What's Included |
|---|---|---|---|
| Vapi | $0.05/min | $0.15–$0.40/min | Orchestration only — BYO TTS, STT, LLM, telephony |
| Retell AI | $0.07/min | $0.11–$0.15/min | Voice engine + optional BYO LLM + telephony at $0.015/min |
| Bland AI | ~$0.09/min | ~$0.09/min | All-inclusive for outbound. $0.015 per attempt. |
| ElevenLabs Agents | $0.08/min | $0.10–$0.14/min | Voice + LLM included. BYO telephony ($0.013-$0.026/min). |
The Pricing Trap Nobody Talks About
Vapi's $0.05/min sounds cheapest, but it's an orchestration fee on top of your provider costs. Add ElevenLabs TTS ($0.08/min), Deepgram STT ($0.01/min), GPT-4o-mini ($0.02/min), and Twilio ($0.013/min), and your actual cost is $0.17-$0.40/min — 3-8x the advertised rate. Bland AI's $0.09/min all-inclusive is genuinely all-inclusive for outbound calls, making it 30-50% cheaper than Vapi at high volume. But Bland is outbound-only. For inbound customer service agents, Retell AI at $0.11-$0.15/min total is the most predictable pricing.
How to Choose: A Decision Framework
After reviewing every provider on this site, here's the framework I'd use:
- Phone agents with latency-critical UX: Cartesia Sonic 3.5. Nothing else hits 40ms TTFA with streaming. If you need telephony built in, their Line product saves a Twilio integration.
- Sales/marketing calls where voice quality drives conversion: ElevenLabs. The voice quality gap still matters for rapport-dependent conversations. The $0.08/min premium over Cartesia pays for itself if your conversion rate improves even slightly.
- Enterprise with compliance requirements: Deepgram. HIPAA, SOC 2, and industry-specific STT models (medical, legal, financial) make it the safest choice for regulated industries. Accept the latency trade-off.
- High-volume outbound at lowest cost: Bland AI at $0.09/min all-in. At 10K+ calls/month, the savings over Vapi ($0.15-$0.40/min) are massive. Limited to outbound only.
- Prototype or internal tool: Kokoro (free, self-hosted) or Fish Audio ($15/1M). Don't spend $0.12/min on ElevenLabs for an internal tool that 5 people use.
- Multilingual voice agents: ElevenLabs (70+ languages) or Fish Audio (13+ with strong CJK). Cartesia's language support is more limited.
For detailed pricing on any of these providers, compare them all on our TTS pricing comparison page or explore our complete TTS API comparison.
Related Guides
By TextToLab Research Team · Last verified June 2026. Latency benchmarks from Gradium TTS Benchmark (2026) and Coval evaluation platform. Voice quality rankings from Artificial Analysis Speech Arena leaderboard. Platform pricing verified against Vapi, Retell AI, Bland AI, ElevenLabs, Deepgram, Cartesia, and OpenAI pricing pages (June 2026). Voice agent market size from Grand View Research ($4.8B, 38% CAGR). ElevenLabs task completion data from ElevenLabs published benchmarks. Individual provider analyses from our pricing deep-dives (see links above).