Text to Speech API: Developer Guide

Complete guide to integrating text to speech APIs into your applications. Compare the top TTS APIs, explore code examples, and learn best practices for speech synthesis integration.

What is a Text to Speech API?

A text to speech API (TTS API) is a web service that converts written text into natural-sounding audio using artificial intelligence. Developers send HTTP requests with text input and receive audio files in formats like MP3, WAV, or streaming audio chunks in response.

Modern TTS APIs use neural networks and deep learning to produce remarkably human-like speech. Services like ElevenLabs, OpenAI TTS, and Amazon Polly offer real-time streaming, multiple voices, and support for dozens of languages.

REST/gRPC
Standard protocols
~200ms
Typical latency
$4-30
Per 1M chars
140+
Languages supported

TTS API Comparison

Side-by-side comparison of the top text to speech APIs for developers.

APIPricingRate LimitStreamingVoicesLatencyBest For
OpenAI TTS$15-30/1M chars500 req/minYes9~200msSimple integration, consistent quality
ElevenLabs$5-99/mo subscriptionVaries by planYes1000+~300msVoice cloning, premium quality
Amazon Polly$4-100/1M chars80 TPSYes60+~150msAWS integration, enterprise scale
Google Cloud TTS$4-16/1M chars1000 chars/reqYes (gRPC)400+~200msLanguage variety, WaveNet quality
Azure Speech$4-16/1M chars200 req/minYes400+~200msEnterprise, SSML support

Code Examples

Basic HTTP examples for each TTS API using curl. Replace placeholders with your actual API keys.

OpenAI TTScurl
View voices
curl https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, welcome to our application!",
    "voice": "alloy"
  }' \
  --output speech.mp3
ElevenLabscurl
View voices
curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, welcome to our application!",
    "model_id": "eleven_monolingual_v1"
  }' \
  --output speech.mp3
Amazon PollyAWS CLI
View voices
aws polly synthesize-speech \
  --text "Hello, welcome to our application!" \
  --output-format mp3 \
  --voice-id Joanna \
  --engine neural \
  speech.mp3
Google Cloud TTScurl
curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://texttospeech.googleapis.com/v1/text:synthesize" \
  -d '{
    "input": {"text": "Hello, welcome to our application!"},
    "voice": {"languageCode": "en-US", "name": "en-US-Wavenet-D"},
    "audioConfig": {"audioEncoding": "MP3"}
  }' | jq -r '.audioContent' | base64 --decode > speech.mp3
Azure Speechcurl
curl -X POST \
  "https://{region}.tts.speech.microsoft.com/cognitiveservices/v1" \
  -H "Ocp-Apim-Subscription-Key: $AZURE_SPEECH_KEY" \
  -H "Content-Type: application/ssml+xml" \
  -H "X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3" \
  -d '<speak version="1.0" xml:lang="en-US">
        <voice name="en-US-JennyNeural">
          Hello, welcome to our application!
        </voice>
      </speak>' \
  --output speech.mp3

TTS API Pricing Breakdown

Detailed pricing per API call and character for each provider.

OpenAI TTS

tts-1$15/1M chars
tts-1-hd$30/1M chars
~$0.015-0.03 per 1K chars

ElevenLabs

Free10K chars/mo
Starter$5/mo (30K)
Pro$99/mo (500K)
~$0.17-0.20 per 1K chars

Amazon Polly

Standard$4/1M chars
Neural$16/1M chars
Generative$30/1M chars
5M free chars first 12mo

Google Cloud TTS

Standard$4/1M chars
WaveNet$16/1M chars
Neural2$16/1M chars
Free tier: 4M chars/mo

Azure Speech

Neural$16/1M chars
Custom Neural$24/1M chars
Personal Voice$100/1M chars
Free tier: 500K chars/mo
View full pricing comparison

Detailed breakdown with cost calculators

Streaming vs Batch Processing

Streaming TTS

Audio chunks are returned as they are generated, enabling playback before the full response is ready.

Low perceived latency (~100-200ms to first byte)
Ideal for chatbots and voice assistants
Real-time narration and live applications
Supported by: OpenAI, ElevenLabs, Amazon Polly, Google (gRPC), Azure

Batch Processing

The complete audio file is generated before being returned. Best for pre-rendered content.

Higher quality output potential
Ideal for audiobooks and podcasts
Easier to cache and store
Best for: Audiobooks, podcasts, video voiceovers, e-learning

Best TTS API by Use Case

Recommended APIs based on specific requirements and use cases.

Best for Low Latency

1
OpenAI tts-1

~200ms, optimized for real-time

2
Amazon Polly

~150ms, excellent AWS integration

3
ElevenLabs Turbo

~300ms with premium quality

Best for High Volume

1
Amazon Polly Standard

$4/1M chars, 80 TPS

2
Google Cloud TTS

$4/1M chars, scalable

3
OpenAI TTS

$15/1M chars, 500 req/min

Best for Voice Cloning

1
ElevenLabs

Industry-leading voice cloning

2
Azure Personal Voice

Enterprise voice cloning

3
Speechify

Voice cloning on Premium+

Learn more about voice cloning

Best for Enterprise

1
Amazon Polly

AWS ecosystem, SLA, compliance

2
Azure Speech

Microsoft ecosystem, SSML

3
Google Cloud TTS

GCP integration, WaveNet

Quick Start Guide

1

Get API credentials

Sign up for an account with your chosen TTS provider (OpenAI, ElevenLabs, Amazon Polly, Google Cloud, or Azure). Navigate to the API section of your dashboard and generate an API key. Store this key securely - never expose it in client-side code.

2

Install SDK or prepare HTTP client

Install the official SDK for your programming language (e.g., openai for Python, @google-cloud/text-to-speech for Node.js) or use any HTTP client library to make REST API calls directly.

3

Make your first API request

Send a POST request to the TTS endpoint with your text input, desired voice, and output format. Include your API key in the Authorization header. The API will return audio data that you can save to a file or stream to users.

4

Handle the audio response

Save the returned audio bytes to a file (MP3, WAV, etc.) or stream directly to your application. For streaming APIs, process audio chunks as they arrive to minimize latency. Implement error handling for rate limits and API errors.

5

Optimize for production

Implement caching for repeated text, use streaming for real-time applications, handle rate limits with exponential backoff, and monitor usage to optimize costs. Consider using a queue for high-volume batch processing.

Authentication Methods

APIAuth MethodHeaderNotes
OpenAIBearer TokenAuthorization: Bearer sk-...API key from OpenAI dashboard
ElevenLabsAPI Key Headerxi-api-key: ...Custom header for API key
Amazon PollyAWS Sig v4AWS SDK handles authIAM credentials required
Google CloudOAuth 2.0 / API KeyAuthorization: Bearer ...Service account or API key
Azure SpeechSubscription KeyOcp-Apim-Subscription-Key: ...Azure portal subscription key

Frequently Asked Questions

What is a text to speech API?

A text to speech API (TTS API) is a web service that converts written text into spoken audio using artificial intelligence. Developers send text via HTTP requests and receive audio files (MP3, WAV, etc.) in response. Popular TTS APIs include OpenAI TTS, ElevenLabs, Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech.

Which TTS API has the lowest latency?

For real-time applications, OpenAI tts-1 and ElevenLabs Turbo v2 offer the lowest latency at around 200-300ms. Amazon Polly Neural and Google Cloud TTS also provide fast response times suitable for interactive applications. Streaming APIs further reduce perceived latency by returning audio chunks as they are generated.

How much does a text to speech API cost?

TTS API pricing varies by provider. Amazon Polly Standard is cheapest at $4/1M characters. OpenAI TTS costs $15-30/1M characters. ElevenLabs uses subscription pricing from $5/month (30K chars) to $99/month (500K chars). Google Cloud and Azure charge $4-16/1M characters depending on voice type.

Which TTS API is best for developers?

OpenAI TTS offers the simplest API with excellent documentation and consistent quality. ElevenLabs provides the most features including voice cloning and streaming. Amazon Polly integrates well with AWS services. Google Cloud TTS offers extensive language support. Choose based on your specific requirements for quality, features, and infrastructure.

Do TTS APIs support real-time streaming?

Yes, most modern TTS APIs support streaming. OpenAI, ElevenLabs, Amazon Polly, Google Cloud, and Azure all offer streaming endpoints that return audio chunks as they are generated. This enables real-time applications like voice assistants and live narration with minimal perceived latency.

What audio formats do TTS APIs support?

Common TTS API output formats include MP3 (most compatible), WAV (uncompressed), OGG/Opus (efficient compression), AAC (Apple compatible), FLAC (lossless), and PCM (raw audio). OpenAI supports all major formats. Amazon Polly and Google Cloud offer MP3, OGG, and PCM. Choose based on your application's requirements.

Can I use TTS APIs for commercial projects?

Yes, all major TTS APIs allow commercial use with appropriate licensing. OpenAI, Amazon Polly, Google Cloud, and Azure include commercial rights. ElevenLabs requires paid plans for commercial use. Always review the terms of service for specific usage rights, attribution requirements, and content restrictions.

What are TTS API rate limits?

Rate limits vary by provider and plan. OpenAI allows 500 requests/minute by default. ElevenLabs limits depend on subscription tier. Amazon Polly allows 80 transactions/second. Google Cloud has per-minute character limits. Most providers offer increased limits for enterprise customers.

How do I choose between streaming and batch TTS?

Use streaming TTS for real-time applications like chatbots, voice assistants, and live narration where low latency is critical. Use batch TTS for pre-rendering content like audiobooks, podcasts, and video voiceovers where quality matters more than speed. Batch processing often produces higher quality output.

Which TTS API has the best voice quality?

ElevenLabs is widely considered to have the most natural-sounding voices, especially for emotional expression and voice cloning. OpenAI TTS-1-HD offers excellent quality for general use. Amazon Polly Generative and Google Cloud WaveNet voices are also highly regarded. Quality perception varies by language and use case.

Explore TTS Services

Preview voices and compare features across different providers.