Azure Text-to-Speech Pricing: The Quick Answer
Azure Text-to-Speech costs $16 per million characters for prebuilt Neural voices and $22/1M for the newer Neural HD voices (dropped from $30 in March 2026). The free tier gives you 500K characters per month. Commitment tiers push the effective rate as low as $7.50/1M — a 53% discount off pay-as-you-go.
If you're coming from the Microsoft ecosystem — Teams, Dynamics 365, Power Apps — Azure TTS is the natural pick. It integrates tightly with the rest of Azure AI services and offers the deepest customization options of any major cloud provider. The trade-off is that the base pricing isn't the cheapest. At $16/1M for standard Neural voices, you're paying the same rate as Amazon Polly Neural and slightly more than OpenAI TTS at $15/1M. But Azure wins on volume — those commitment tiers can cut your costs in half.
Quick note: Microsoft recently rebranded the service to “Azure Speech in Foundry Tools” under the Azure AI Foundry umbrella. The API endpoints and pricing haven't changed — it's purely a naming/portal reorganization. I'll cover the rebrand details below.
Azure TTS at a Glance
Voice Tier Pricing Breakdown
The pricing gap between tiers is real but smaller than it looks. Prebuilt Neural at $16/1M handles 90% of production use cases. Neural HD at $22/1M only justifies its 37.5% premium on long-form content where prosody consistency matters — audiobooks, podcasts, full-article narration. Custom Neural ($24-$48/1M) adds training and hosting costs that push the true per-character rate much higher than the sticker price.
| Voice Tier | Price/1M Chars | Free Tier | Quality | Best For |
|---|---|---|---|---|
| Prebuilt Neural | $16.00 | 500K chars/mo | Natural, smooth | General apps, content, IVR |
| Neural HD | $22.00 | 500K chars/mo | High fidelity, expressive | Premium content, media production |
| Custom Neural Standard | $24.00 | None | Brand-matched voice | Custom brand voice at scale |
| Custom Neural HD | $48.00 | None | Highest fidelity custom | Premium brand experiences |
The Prebuilt Neural tier is where most teams should start. At $16/1M characters, you get access to all 500+ voices across 140+ languages with no setup required. Neural HD adds noticeable quality improvements — more natural prosody, better handling of complex sentences — but at $22/1M, it's 37.5% more expensive. Whether that premium is worth it depends on your use case.
Understanding Azure's Voice Tiers
Prebuilt Neural Voices — $16/1M Characters
This is Azure's bread and butter. You get 500+ voices across 140+ languages and locales, all powered by neural TTS models. The voice quality is genuinely good — comparable to Amazon Polly's Neural engine at the same $16/1M price point. SSML support is excellent, with fine-grained prosody control for pitch, rate, volume, and emphasis. I've found Azure's SSML implementation to be the most comprehensive among the major cloud providers — you can adjust speaking styles (cheerful, sad, whispering) on supported voices, which is a feature neither AWS nor Google offer at this tier.
The 500+ voice count is legitimate, though many are regional variants of the same base voice. In practice, you'll have 15-20 genuinely distinct English voices to choose from, with strong coverage in Chinese, Japanese, Spanish, German, and French. For most applications — e-learning, customer support, content narration — Prebuilt Neural does the job well.
Neural HD Voices — $22/1M Characters
Neural HD is Azure's premium prebuilt tier. Microsoft dropped the price from $30 to $22/1M in March 2026, which makes it much more competitive. At the old $30 rate, it was hard to justify over standard Neural. At $22, the math changes — you're paying a 37.5% premium for noticeably better output on long-form content.
In my testing, Neural HD handles paragraph-length text better than standard Neural. The prosody is more consistent across long passages, and there's less of the “robotic drift” you sometimes hear when standard Neural processes a full article. That said, Neural HD is still expanding to more regions and doesn't cover all 140+ languages yet. Check Azure's region availability page before committing to HD for production workloads.
Custom Neural Voice — $24-$48/1M + Training Costs
Custom Neural Voice (CNV) lets you build a voice that sounds like your brand. This is where Azure's pricing gets complex. The per-character cost is $24/1M for standard quality and $48/1M for HD quality — but that's only part of the story.
You also pay for voice training at $52 per compute hour, capped at $936 per training session. A typical training run takes 10-18 compute hours, so expect to spend $520-$936 upfront to create a single custom voice. Then there's endpoint hosting at $4.04 per model per hour. That's $4.04/hr whether you're using the endpoint or not — roughly $2,909/month for a single always-on endpoint.
Let me do the real math on Custom Neural Voice. If you train one custom voice ($936 max), host it 24/7 for a month ($2,909), and synthesize 10M characters at the standard rate ($240), your total first-month cost is roughly $4,085. That's a steep entry point. Custom Neural Voice really only makes sense for enterprises with consistent high-volume needs and a strong brand-voice requirement.
The Audio Content Creation tool is included with Custom Neural Voice at no extra cost. It provides a web-based editor for fine-tuning pronunciation and prosody, which saves time compared to hand-editing SSML.
Personal Voice — Privacy-Focused Voice Cloning
Personal Voice is Azure's accessibility-focused voice cloning feature. Pricing varies and requires application for access. It's designed for scenarios like people at risk of losing their voice due to medical conditions — you record a short voice sample and Azure creates a synthetic replica. Unlike Custom Neural Voice, Personal Voice is gated behind consent verification and intended for individual rather than commercial use. If you're looking for commercial voice cloning, check out ElevenLabs or Fish Audio, which offer more straightforward cloning APIs.
Commitment Tiers: Enterprise Volume Discounts
Azure's commitment tiers are where the pricing gets genuinely competitive. If you can commit to a monthly minimum, the per-character rate drops dramatically. These are the best-kept secret in Azure TTS pricing — at the highest tier, you're paying less than half what Google Cloud TTS charges for WaveNet.
| Tier | Monthly Commitment | Characters Included | Effective Rate/1M | Savings vs PAYG |
|---|---|---|---|---|
| Pay-As-You-Go | None | Unlimited | $16.00 | — |
| Commitment 1 | $960/mo | 80M characters | $12.00 | 25% |
| Commitment 2 | $3,840/mo | 400M characters | $9.60 | 40% |
| Commitment 3 | $15,000/mo | 2,000M characters | $7.50 | 53% |
Critical detail: these are monthly commitments, not usage-based discounts. You pay $960, $3,840, or $15,000 per month whether you use the full allotment or not. There's no rollover. If you commit to the 80M tier and only use 40M characters, you still pay $960. Plan carefully.
That said, the economics are compelling at scale. The Commitment 3 tier at $7.50/1M is cheaper than OpenAI TTS ($15/1M), ElevenLabs (starts at $60/1M on API), and even Amazon Polly Neural ($16/1M). Only Google Cloud's standard TTS at $4/1M and Polly Standard at $4/1M are cheaper — and those use older, lower-quality engines. Use our TTS cost calculator to model the exact break-even point for your volume.
Free Tier Deep-Dive: What 500K Characters Gets You
Azure's free tier (F0) gives you 500,000 characters per month for neural TTS at no cost. That's roughly 10.4 hours of synthesized audio per month (assuming an average of 800 characters per minute of speech). The free tier never expires — as long as your Azure account is active, you get 500K characters every month.
How does Azure's free tier compare to the competition?
| Service | Free Allowance | Duration | Quality Tier |
|---|---|---|---|
| Azure TTS | 500K chars/mo | Never expires | Neural |
| Amazon Polly | 5M chars/mo | 12 months only | Standard (1M Neural) |
| Google Cloud TTS | 4M chars/mo | Never expires | Standard only |
| ElevenLabs | 10K chars/mo | Never expires | Full quality |
Azure's free tier is the smallest of the three major cloud providers, but it comes with two advantages: it never expires, and it includes neural-quality voices (not the lower-quality standard engine like Google's free tier). Amazon Polly's free tier is far more generous at 5M characters, but it disappears after 12 months. For personal projects, prototyping, or low-volume production use, Azure's 500K/month is enough to narrate a blog post or two each month indefinitely.
Real-World Cost Examples
Abstract pricing is useless without context. Here's what Azure TTS actually costs for common use cases, comparing pay-as-you-go with the best available commitment tier.
| Use Case | Monthly Volume | PAYG Cost | Commitment Cost |
|---|---|---|---|
| Blog narrator | 500K chars | $0 (free tier) | N/A |
| E-learning platform | 10M chars | $160 | $120 (Tier 1) |
| Customer service IVR | 100M chars | $1,600 | $960 (Tier 1) |
| Enterprise multi-product | 500M chars | $8,000 | $3,840 (Tier 2) |
The e-learning example is instructive. At 10M characters per month, you're just barely over the Tier 1 commitment threshold of 80M — meaning the commitment tier doesn't make financial sense yet (you'd pay $960 for 80M when you only need 10M at $160 PAYG). The commitment tiers really kick in when you're consistently using 60M+ characters per month. Below that, PAYG is cheaper.
For the enterprise tier, the savings are massive — $4,160/month saved versus PAYG. That's nearly $50,000/year. If you're processing 500M+ characters monthly, Azure's commitment tiers make it one of the most cost-effective neural TTS options available. Use our TTS cost calculator to model your specific volume.
Azure vs Amazon Polly vs Google Cloud vs OpenAI (Full Comparison)
Here's how Azure TTS stacks up against every major TTS service. I've included both the standard rates and the best available volume pricing where applicable. Check our full TTS pricing comparison for even more detail.
| Service | Price/1M | Free Tier | Voices | Languages | Best For |
|---|---|---|---|---|---|
| Azure Neural | $16.00 | 500K/mo | 500+ | 140+ | Microsoft ecosystem, enterprise |
| Azure Neural HD | $22.00 | 500K/mo | Expanding | Limited | Premium content |
| Azure Commitment | $7.50 | N/A | 500+ | 140+ | High-volume enterprise |
| Amazon Polly Standard | $4.00 | 5M/mo (12mo) | 30+ | 30+ | Cheapest bulk TTS |
| Amazon Polly Neural | $16.00 | 1M/mo (12mo) | 60+ | 30+ | AWS-native apps |
| Google Cloud WaveNet | $4.00 | 4M/mo (std) | 40+ | 50+ | Cheap neural quality |
| Google Cloud Chirp 3 HD | $30.00 | None | 10+ | 30+ | Highest GCP quality |
| OpenAI TTS | $15.00 | None | 6 | 57+ | Simple API, good quality |
| ElevenLabs | ~$60+ | 10K/mo | 1000+ | 32+ | Best overall quality, cloning |
| Deepgram Aura-2 | $30.00 | $200 credit | 40+ | 7 | Voice agents (STT + TTS) |
| Cartesia Sonic | ~$30 | Free tier | Custom | 15+ | Lowest latency voice agents |
| Fish Audio | $15.00 | 10K/day | 200K+ | 13+ | Budget voice cloning |
The table tells the rate-card story; here's the strategic one. At PAYG, Azure is unremarkable — $16/1M sits in the fat middle of the market and you're better off with OpenAI ($15/1M, zero provisioning overhead) or Google WaveNet ($4/1M if quality tradeoffs are acceptable). Azure's actual pricing moat is commitment-tier volume. At $7.50/1M on the 2B tier, no other provider selling neural-quality TTS comes close. That makes Azure the clear pick for organizations already processing 100M+ characters per month — and a mediocre pick for everyone else.
The “Foundry Tools” Rebrand: What Actually Changed
In late 2025, Microsoft reorganized its AI services under the Azure AI Foundry umbrella. Azure Cognitive Services Speech became “Azure Speech in Foundry Tools.” If you're searching for Azure TTS pricing and landing on pages referencing “Foundry Tools” or “Azure AI Speech,” it's the same service. Here's what changed and what didn't.
What Changed
- Branding: “Azure Cognitive Services Speech” is now “Azure Speech in Foundry Tools” in documentation and the Azure portal
- Portal location: The service now lives under the Azure AI Foundry section in the Azure portal, not the standalone Cognitive Services menu
- Documentation URLs: Some docs moved to new URLs under the AI Foundry docs structure — old URLs redirect
- Integration: Tighter bundling with other Azure AI services (OpenAI, Vision, Language) in the Foundry platform
What Didn't Change
- API endpoints: Same REST API and SDK endpoints — no code changes needed
- Pricing: Identical pricing structure and rates
- Voice catalog: Same 500+ voices, same quality
- SSML support: Unchanged
- Existing resources: Your existing Speech resources continue to work without migration
In practice, the rebrand is mostly confusing for people searching for pricing information. You might see “Azure Speech,” “Azure AI Speech,” “Cognitive Services Speech,” or “Azure Speech in Foundry Tools” — they all refer to the same service with the same pricing. If you're an existing user, nothing breaks. If you're new, just search for “Speech” in the Azure portal and you'll find it.
Hidden Costs and Gotchas
Azure's pricing page tells you the per-character rates, but there are several costs and billing behaviors that aren't immediately obvious. I've hit most of these in production.
Custom Voice Training Is Expensive Up Front
The $52/compute-hour training cost is capped at $936 per session, but that's per voice. If you need five custom voices, you're looking at up to $4,680 in training costs alone. And the $4.04/model/hour hosting fee is ongoing — you pay whether the endpoint receives requests or not. A single custom voice endpoint running 24/7 costs ~$2,909/month. Two endpoints? $5,818/month. This is by far the most overlooked cost in Azure TTS.
SSML Characters Count Toward Billing
When you use SSML markup to control prosody, pauses, and pronunciation, the SSML tags themselves count toward your character total. A 1,000 character plain text passage might become 1,500+ characters with SSML tags. This is standard across cloud TTS providers, but it's worth factoring into your cost estimates — especially if you use heavy SSML markup. Stripping unnecessary whitespace and using shorthand SSML attributes can save 10-20% on character counts.
Real-Time vs Batch: Same Price, Different Limits
Azure charges the same per-character rate for both real-time and batch synthesis, but the operational constraints differ sharply. Real-time synthesis caps at 200 concurrent WebSocket connections per region on the S0 tier (you can request increases via Azure support). Each request is limited to ~10 minutes of audio. Batch synthesis removes those caps — you submit up to 2 GB of SSML per job and poll for results — but adds minutes of queue latency. The practical rule: if your workload exceeds 50,000 characters per minute or needs to pre-render hours of audio overnight, batch is the better architecture. Same bill either way.
Region Availability for Neural HD
Neural HD voices aren't available in all Azure regions yet. If your application is deployed in a region without HD support, you'll either need to route requests to a supported region (adding latency) or fall back to standard Neural voices. Check Azure's region availability table before building your architecture around HD voices.
No Explicit Caching Policy
Azure doesn't publish a clear policy on caching synthesized audio. If you synthesize the same text twice, you pay twice. For applications that re-generate the same content (like IVR prompts or frequently read articles), implement your own audio caching layer to avoid redundant charges. Check the Azure Terms of Service for any restrictions on storing and replaying synthesized audio.
Bandwidth and Egress: The Good News
Unlike Google Cloud, Azure doesn't charge for data egress on Speech service responses. The audio data returned from synthesis requests is included in the per-character price. This is a meaningful advantage if you're generating large volumes of audio — Google's egress fees can add up on high-volume workloads.
When to Choose Azure TTS (And When to Skip It)
Choose Azure TTS If:
- You're already in the Microsoft ecosystem (Teams, Dynamics 365, Power Platform, Azure Functions)
- You need a custom brand voice and can justify the training + hosting costs
- You process 80M+ characters/month and want commitment-tier discounts ($7.50-$12/1M)
- You need 140+ language support with consistent neural quality
- SSML-level prosody control is important (Azure has the best SSML support among cloud providers)
- Enterprise compliance requirements favor Azure (HIPAA, SOC 2, FedRAMP available)
Skip Azure TTS If:
- You want the cheapest neural TTS — go with Google Cloud WaveNet at $4/1M or Amazon Polly Standard at $4/1M
- Voice quality is your top priority — check ElevenLabs for the best-sounding output
- You're building real-time voice agents and need sub-100ms latency — look at Cartesia or read our best TTS for voice agents guide
- You want the simplest API integration — OpenAI's TTS API is the easiest to use (one endpoint, no resource provisioning)
- You need instant voice cloning without an enterprise contract — ElevenLabs or Fish Audio are better options
My honest take: Azure TTS is an excellent choice for enterprises that are already invested in the Microsoft stack. The commitment tiers make it genuinely cheap at scale, the language coverage is unmatched, and the SSML support is the best in the industry. But if you're a startup or indie developer just looking for good TTS at a fair price, the onboarding friction of Azure (resource groups, subscription setup, regional endpoints) isn't worth it when OpenAI TTS gives you nearly the same quality at $15/1M with a single API key. Read our best TTS API guide for a full breakdown of which service fits which use case.
Frequently Asked Questions
How much does Azure Text-to-Speech cost per month?
It depends entirely on your usage volume. At $16/1M characters for Prebuilt Neural voices, 1 million characters of synthesis costs $16. The free tier covers 500K characters/month at no cost. For heavy users, commitment tiers drop the effective rate to $7.50-$12/1M. A typical e-learning platform processing 10M characters per month would pay about $160 on PAYG. Use our TTS cost calculator to get an exact estimate.
Is Azure Text-to-Speech free?
Partially. The F0 (free) tier gives you 500,000 characters per month of neural TTS at no cost, and it never expires. That's enough for roughly 10 hours of audio per month. You do need an Azure account to access it, but you can use the free tier without entering payment information on the free Azure subscription.
What's the difference between Azure Neural and Neural HD?
Neural HD produces higher-fidelity audio with better prosody on longer passages. Standard Neural costs $16/1M characters while Neural HD costs $22/1M (reduced from $30 in March 2026). The quality difference is most noticeable on paragraphs and longer texts — for short utterances like IVR prompts, standard Neural is fine. Neural HD is still expanding to more Azure regions.
How does Azure TTS compare to Amazon Polly?
Azure Neural and Polly Neural both cost $16/1M characters. Polly has a much larger free tier (5M chars vs 500K) but it expires after 12 months. Azure has more voices (500+ vs 60+), more languages (140+ vs 30+), and better SSML support. Polly also offers a Standard engine at $4/1M that Azure doesn't match — it's lower quality but great for IVR/notification use cases. Read our full Amazon Polly pricing breakdown for the complete comparison.
What is Azure Speech in Foundry Tools?
It's the new name for what was previously Azure Cognitive Services Speech. Microsoft rebranded it as part of the Azure AI Foundry platform reorganization. The pricing, API endpoints, voice catalog, and functionality are all unchanged. It's purely a naming/portal change. Your existing code and integrations continue to work without modification.
Can I reduce my Azure TTS costs?
Yes, several ways. Commitment tiers save 25-53% off PAYG rates for predictable high-volume use. Cache synthesized audio to avoid re-synthesizing the same content. Minimize SSML markup since tags count toward character billing. Use batch synthesis for non-real-time workloads to maximize throughput. And start with the free tier — 500K characters/month covers many small-scale applications entirely.
How many characters is one hour of speech?
Roughly 48,000 characters per hour of synthesized speech (based on an average speaking rate of ~150 words per minute and ~5.3 characters per word). This means 1 million characters produces about 20.8 hours of audio. At $16/1M, that works out to about $0.77 per hour of audio — which is significantly cheaper than hiring a voice actor.
Does Azure TTS support SSML?
Yes — and in my experience, Azure has the most comprehensive SSML implementation of any major TTS provider. You get full prosody control (pitch, rate, volume), emphasis markers, break/pause tags, phoneme overrides, and speaking style selection (cheerful, sad, whispering, newscast) on supported voices. The best TTS API guide covers SSML support across all providers.
Related Pricing Guides
By TextToLab Research Team · Last verified June 2026 against Azure's official pricing page (azure.microsoft.com/pricing/details/ai-speech/). Commitment tier rates confirmed via Azure pricing calculator. Competitor rates verified against each provider's pricing page.