How much does Inworld TTS cost?

Inworld TTS pricing is tiered: Standard on-demand is $50/1M characters for Max, $25/1M for Mini. Founder rates (locked before May 7, 2026) are $10/1M for Max, $5/1M for Mini. On third-party platforms like Replicate and fal.ai, pricing is around $10/1M characters. There is no ongoing free tier, but new accounts get 40 free minutes.

Is Inworld TTS better than ElevenLabs?

Inworld TTS-1.5 Max is ranked #1 on the Artificial Analysis Speech Arena (ELO 1,236), ahead of ElevenLabs v3 (#4, ELO 1,179). Inworld has higher voice quality in blind tests. However, ElevenLabs offers more voices (1,000+), better voice cloning, consumer apps, and a free tier. Choose Inworld for raw quality; choose ElevenLabs for features and accessibility.

Does Inworld TTS have a free tier?

No. Inworld TTS does not offer an ongoing free tier. They provide trial credits for evaluation, but you pay from the first character in production. For free TTS options, consider ElevenLabs (10,000 credits/month), Amazon Polly (5M chars free for 12 months), or Chatterbox (fully free, open-source).

What is Inworld TTS latency?

Inworld TTS-1.5 Max has a P90 latency of under 250ms. The Standard variant runs under 200ms, and the Mini variant under 130ms. For comparison, ElevenLabs Turbo is around 300ms, and Cartesia Sonic claims under 90ms TTFA.

Can non-developers use Inworld TTS?

No. Inworld TTS is API-only — there is no consumer product, web app, or studio editor. You need to write code or use an API client to generate audio. For non-technical users, consider Murf AI (studio editor), ElevenLabs (web app), or Speechify (consumer product).

Inworld TTS 1.5 Review 2026: The #1 Ranked Voice AI Tested

Q: Does Inworld TTS support voice cloning?

Yes. Inworld TTS supports instant voice cloning where you upload a short audio reference and the model replicates the voice characteristics. The cloning quality is competitive with ElevenLabs' instant clone feature, though ElevenLabs' professional voice cloning tier produces higher fidelity.

Inworld TTS Review: The Bottom Line

Inworld TTS-1.5 Max is the highest-ranked text-to-speech model in the world right now. It holds the #1 position on the Artificial Analysis Speech Arena with an ELO of 1,236 — ahead of Gemini 3.1 Flash TTS (1,211) and ElevenLabs v3 (1,179). At $0.030 per 1,000 characters, it costs roughly the same as OpenAI's TTS-1-HD but sounds significantly better. The trade-off: it's API-only, there's no consumer product, no free tier, and the company is relatively new. If you're a developer building voice agents or real-time applications and voice quality is your top priority, Inworld is the current leader. For everyone else, the lack of a studio interface and consumer tools makes it harder to justify over ElevenLabs or Murf AI.

Quick Ratings

Voice Quality5/5 — #1 globally on Speech ArenaLatency4.5/5 — <250ms Max, <130ms MiniPricing Value3.5/5 — $30/1M chars, mid-rangeVoice Cloning4/5 — Instant cloning availableAPI / Developer4/5 — REST + WebSocket, good docsEase of Use2/5 — API-only, no consumer product

What Is Inworld TTS?

Inworld AI was founded in 2021 by the team that built Dialogflow — the conversational AI platform Google acquired in 2016. CEO Kylan Gibbs previously led product for LLMs at DeepMind. That pedigree matters: these aren't newcomers to voice AI. They started by building NPC dialogue systems for games (clients include Disney, Ubisoft, Xbox, NVIDIA), and the TTS product evolved from that work into what's now the top-ranked model globally.

The company is headquartered in Mountain View, California. They've raised $125.7 million total, with a $500M post-money valuation on their 2023 Series B led by Lightspeed Venture Partners. Other backers include Intel Capital, Microsoft's M12, Samsung Next, and Stanford University. The funding gives real confidence they're not a fly-by-night operation, but they're still smaller and newer than ElevenLabs.

What's unusual about Inworld is that their TTS training code is open-source on GitHub. This is a transparency move — you can inspect the training pipeline even if you're paying for the hosted API. No other top-tier TTS provider does this. Google, ElevenLabs, and OpenAI keep their training code completely proprietary.

What Changed in TTS-1.5

TTS-1.5 is the generation that pushed Inworld to the #1 spot. If you tested TTS-1 last year and weren't convinced, TTS-1.5 is a meaningful improvement:

30% more expressive — measured by human evaluators comparing emotional range between TTS-1 and TTS-1.5 on identical scripts. The voice handles excitement, sadness, and conversational nuance better.
40% fewer word errors — pronunciation accuracy improved significantly. Technical terms, proper nouns, and numbers are handled more reliably.
4x faster inference — reduced latency from previous generation. The Mini variant hits sub-130ms P90 latency.
Enhanced multilingual support — improved quality for non-English languages, though English remains the strongest.

Three model variants are available: Max (highest quality, <250ms latency), Standard (balanced), and Mini (fastest, <130ms latency, slightly lower quality). For most use cases, Max is the right choice. Mini is only worth considering if you're building real-time voice agents where sub-150ms response time is critical.

Voice Quality: What #1 Actually Means

The Artificial Analysis Speech Arena runs thousands of blind A/B comparisons where real people listen to two AI voices reading the same text and choose which sounds more natural. It's the most credible quality benchmark in TTS right now.

Here's the current top 10 as of April 2026:

Rank	Model	ELO	Cost/1K Chars
#1	Inworld TTS-1.5 Max	1,236	$0.030
#2	Gemini 3.1 Flash	1,211	~$0.012
#3	Inworld TTS-1.5 Standard	1,195	$0.030
#4	ElevenLabs v3	1,179	$0.050–$0.180
#5	Inworld TTS-1.5 Mini	1,162	$0.030

Inworld holds three of the top five positions. That's not a fluke — it means even their smaller, faster models outrank most competitors' flagship offerings.

In my testing, the quality advantage over ElevenLabs is subtle but real. Inworld TTS-1.5 Max produces slightly more natural transitions between sentences, better handling of punctuation-based pauses, and more consistent prosody across long passages. The gap narrows on short-form content. For a single sentence, you'd struggle to tell them apart. Over a five-minute narration, Inworld's consistency edge becomes noticeable.

Voice Cloning: Quick Setup, Strong Results

Inworld supports instant voice cloning — you upload a short audio reference and the model replicates the voice characteristics. The cloning quality is competitive with ElevenLabs' instant clone feature, though ElevenLabs' professional voice cloning (which requires more training data) still produces higher-fidelity results.

The primary use case for Inworld's cloning is game character voices and AI tutor personalization — both areas where the company has deep experience from their character AI origins. One YouTube creator demonstrated cloning his own voice and building an AI tutor with Inworld TTS-1.5, which shows the practical applicability of the feature.

If voice cloning is your primary need, ElevenLabs is still the safer bet — they've been doing it longer, have more documentation, and offer both instant and professional cloning tiers. But Inworld's cloning quality is strong enough for most applications.

Latency: Fast Enough for Real-Time

Latency matters enormously for voice agents, chatbots, and any application where the user is waiting for a response. Inworld's numbers:

Model	P90 Latency	Use Case
TTS-1.5 Max	<250ms	Highest quality, near-real-time apps
TTS-1.5 Standard	<200ms	Balanced quality and speed
TTS-1.5 Mini	<130ms	Real-time voice agents, lowest latency

For context: Cartesia Sonic claims <90ms TTFA (time-to-first-audio), making it faster than Inworld Mini. ElevenLabs Turbo runs around ~300ms. OpenAI TTS typically hits ~400ms. Murf's Falcon API reports 55ms model latency and 130ms median TTFA.

Inworld's latency is competitive for its quality tier. If raw speed is your top priority (voice calls, live customer support), Cartesia or Murf Falcon are faster. If quality is the priority and you need reasonable latency, Inworld's Max at <250ms is excellent.

Pricing: What It Actually Costs

Inworld's pricing is tiered. The standard on-demand rate is $50/1M characters for Max and $25/1M for Mini. But there's a “Founder rate” — $10/1M for Max, $5/1M for Mini — available if you lock in before May 7, 2026. On third-party platforms like Replicate and fal.ai, pricing sits around $10/1M characters. For full pricing context across providers, check our TTS pricing comparison.

Real-World Cost Examples

Blog post (2,000 words / ~12,000 chars): ~$0.12 at Founder rate, ~$0.60 at standard. Gemini: ~$0.14. ElevenLabs Flash: ~$0.60.
30-minute e-learning module (~180K chars): ~$1.80 at Founder rate, ~$9.00 at standard. Gemini: ~$2.16. Amazon Polly Neural: ~$2.88.
Voice agent (10,000 interactions/month, avg 200 chars each): ~$20 at Founder rate, ~$100 at standard. Gemini: ~$24. OpenAI: ~$30.
Enterprise scale (1M chars/month): $10 at Founder rate, $50 at standard. Gemini: ~$12. ElevenLabs Flash: ~$50. Use our TTS cost calculator to estimate your specific volume.

Cost Per Minute Estimate

Average English speech is about 150 words per minute, or roughly 900 characters per minute. At Founder rate ($10/1M), one minute of Inworld TTS Max costs approximately $0.009. At standard rate ($50/1M), it's about $0.045/min. Even the standard rate is dramatically cheaper than a human voice actor ($200–$400/hour). Compare to Gemini ($0.015/min) or Amazon Polly Standard ($0.004/min).

API and Integration

Inworld provides both REST and WebSocket APIs. The REST API handles standard text-to-speech requests — you send text, receive an audio file. The WebSocket API enables streaming, where your application starts playing audio before the full response is generated. Streaming is essential for voice agents and chat interfaces.

The API accepts plain text input and returns audio in multiple formats including MP3, PCM, and Opus. Authentication is via API key. The TTS API comparison page has implementation details across all major providers if you're evaluating multiple options.

One thing worth noting: Inworld's documentation is decent but not as polished as OpenAI's or Google's. You'll find everything you need, but expect to spend a bit more time reading through it. The community is also smaller — if you hit an edge case, you're more likely to need to contact support directly rather than finding an answer on Stack Overflow.

What's Missing: Honest Limitations

No Consumer Product

This is the single biggest barrier. ElevenLabs has a web app, browser extension, and mobile apps. Murf has its studio editor. Speechify is built entirely around a consumer experience. Inworld has... an API. If you're not a developer or don't have developers on your team, Inworld is not for you. There's no way to generate audio without writing code or using an API client.

No Free Tier

ElevenLabs gives you 10,000 free credits per month. Amazon Polly offers 5 million free characters for 12 months. Chatterbox is entirely free. Gemini has a generous free tier via Google AI Studio. Inworld? You pay from character one. They offer trial credits for evaluation, but nothing ongoing. This makes it harder to test and onboard compared to competitors.

Relatively New Company

ElevenLabs has millions of users and a $3B+ valuation. Google and Amazon aren't going anywhere. Inworld is well-funded ($110M+) but still building its reputation outside the gaming industry. If you're choosing a TTS provider for a multi-year enterprise deployment, longevity risk is worth considering. The open-source training code helps — even if Inworld disappeared, the underlying approach is documented.

Only 15 Languages

Inworld supports 15 languages: English, Spanish, French, Korean, Dutch, Chinese, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. That's notably fewer than ElevenLabs (70+), Gemini (70+), or even Amazon Polly (33). If you need Thai, Swahili, Vietnamese, or other less common languages, Inworld can't help.

How Inworld Compares to 7 TTS Alternatives

Here's Inworld TTS-1.5 Max against every major provider we track. For a broader view, visit our best text-to-speech comparison.

Service	Cost/1K Chars	Arena Rank	Cloning	Best For
Inworld TTS-1.5 Max	$0.010–$0.050	#1	Yes	Best quality, voice agents
Gemini Flash	~$0.012	#2	No	Price-quality balance, expressiveness
ElevenLabs	$0.050–$0.180	#4	Yes (best)	Premium quality, cloning, consumer
OpenAI TTS-1-HD	$0.030	Below 10	No	Simple API, developer ecosystem
Amazon Polly Neural	$0.016	Not ranked	No	AWS integration, high volume
Murf Falcon	$0.010	Not ranked	Yes (Business+)	Studio editor, non-technical teams
Grok TTS	$0.004	Not ranked	No	Budget developers, xAI ecosystem
Chatterbox	Free	Not ranked	Yes	Free, open-source, self-hosted

The key insight: at Founder rates ($10/1M Max), Inworld is cheaper than Gemini Flash and dramatically outperforms everything on quality. At standard rates ($50/1M), it's the most expensive option on the list. The Founder pricing expires May 7, 2026 — if you're serious about Inworld, lock it in now. After that, Gemini Flash at ~$12/1M becomes the better value for most use cases unless you specifically need voice cloning or the absolute best quality ranking.

Who Should Use Inworld TTS

Best for:

Voice agent developers who need top-tier quality
Game studios building NPC dialogue systems
Real-time applications where quality + reasonable latency matter
Teams who want open-source training code transparency
Projects that need voice cloning + #1 ranked quality in one API

Not for:

Non-developers (no consumer product, no studio interface)
Budget-constrained projects (Gemini Flash is 60% cheaper for similar quality)
Teams needing extensive voice controls (Gemini's 200+ audio tags offer more fine-tuning)
Enterprise deployments where vendor longevity is a hard requirement
Anyone who needs a free tier to evaluate before committing

My honest take: Inworld TTS-1.5 Max deserves the #1 ranking. The voice quality is the best I've tested. But quality alone doesn't make it the right choice for everyone. For 80% of TTS use cases, Gemini Flash at $0.012/1K chars or ElevenLabs with its consumer tools will be the more practical option. Inworld is the right choice when quality is the deciding factor and you have the technical resources to integrate an API-only product.

By TextToLab Team. Speech Arena rankings from Artificial Analysis TTS leaderboard as of April 2026. Pricing verified against Inworld AI developer documentation. Latency figures are Inworld's published P90 benchmarks. Cost-per-character calculations assume average English speaking rates.