Complete guide to integrating text to speech APIs into your applications. Compare the top TTS APIs, explore code examples, and learn best practices for speech synthesis integration.
A text to speech API (TTS API) is a web service that converts written text into natural-sounding audio using artificial intelligence. Developers send HTTP requests with text input and receive audio files in formats like MP3, WAV, or streaming audio chunks in response.
Modern TTS APIs use neural networks and deep learning to produce remarkably human-like speech. Services like ElevenLabs, OpenAI TTS, and Amazon Polly offer real-time streaming, multiple voices, and support for dozens of languages.
Side-by-side comparison of the top text to speech APIs for developers.
| API | Pricing | Rate Limit | Streaming | Voices | Latency | Best For |
|---|---|---|---|---|---|---|
| OpenAI TTS | $15-30/1M chars | 500 req/min | Yes | 9 | ~200ms | Simple integration, consistent quality |
| ElevenLabs | $5-99/mo subscription | Varies by plan | Yes | 1000+ | ~300ms | Voice cloning, premium quality |
| Amazon Polly | $4-100/1M chars | 80 TPS | Yes | 60+ | ~150ms | AWS integration, enterprise scale |
| Google Cloud TTS | $4-16/1M chars | 1000 chars/req | Yes (gRPC) | 400+ | ~200ms | Language variety, WaveNet quality |
| Azure Speech | $4-16/1M chars | 200 req/min | Yes | 400+ | ~200ms | Enterprise, SSML support |
Basic HTTP examples for each TTS API using curl. Replace placeholders with your actual API keys.
curl https://api.openai.com/v1/audio/speech \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tts-1",
"input": "Hello, welcome to our application!",
"voice": "alloy"
}' \
--output speech.mp3curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, welcome to our application!",
"model_id": "eleven_monolingual_v1"
}' \
--output speech.mp3aws polly synthesize-speech \
--text "Hello, welcome to our application!" \
--output-format mp3 \
--voice-id Joanna \
--engine neural \
speech.mp3curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
"https://texttospeech.googleapis.com/v1/text:synthesize" \
-d '{
"input": {"text": "Hello, welcome to our application!"},
"voice": {"languageCode": "en-US", "name": "en-US-Wavenet-D"},
"audioConfig": {"audioEncoding": "MP3"}
}' | jq -r '.audioContent' | base64 --decode > speech.mp3curl -X POST \
"https://{region}.tts.speech.microsoft.com/cognitiveservices/v1" \
-H "Ocp-Apim-Subscription-Key: $AZURE_SPEECH_KEY" \
-H "Content-Type: application/ssml+xml" \
-H "X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3" \
-d '<speak version="1.0" xml:lang="en-US">
<voice name="en-US-JennyNeural">
Hello, welcome to our application!
</voice>
</speak>' \
--output speech.mp3Detailed pricing per API call and character for each provider.
Detailed breakdown with cost calculators
Audio chunks are returned as they are generated, enabling playback before the full response is ready.
The complete audio file is generated before being returned. Best for pre-rendered content.
Recommended APIs based on specific requirements and use cases.
~200ms, optimized for real-time
~150ms, excellent AWS integration
~300ms with premium quality
$4/1M chars, 80 TPS
$4/1M chars, scalable
$15/1M chars, 500 req/min
Industry-leading voice cloning
Enterprise voice cloning
Voice cloning on Premium+
AWS ecosystem, SLA, compliance
Microsoft ecosystem, SSML
GCP integration, WaveNet
Sign up for an account with your chosen TTS provider (OpenAI, ElevenLabs, Amazon Polly, Google Cloud, or Azure). Navigate to the API section of your dashboard and generate an API key. Store this key securely - never expose it in client-side code.
Install the official SDK for your programming language (e.g., openai for Python, @google-cloud/text-to-speech for Node.js) or use any HTTP client library to make REST API calls directly.
Send a POST request to the TTS endpoint with your text input, desired voice, and output format. Include your API key in the Authorization header. The API will return audio data that you can save to a file or stream to users.
Save the returned audio bytes to a file (MP3, WAV, etc.) or stream directly to your application. For streaming APIs, process audio chunks as they arrive to minimize latency. Implement error handling for rate limits and API errors.
Implement caching for repeated text, use streaming for real-time applications, handle rate limits with exponential backoff, and monitor usage to optimize costs. Consider using a queue for high-volume batch processing.
| API | Auth Method | Header | Notes |
|---|---|---|---|
| OpenAI | Bearer Token | Authorization: Bearer sk-... | API key from OpenAI dashboard |
| ElevenLabs | API Key Header | xi-api-key: ... | Custom header for API key |
| Amazon Polly | AWS Sig v4 | AWS SDK handles auth | IAM credentials required |
| Google Cloud | OAuth 2.0 / API Key | Authorization: Bearer ... | Service account or API key |
| Azure Speech | Subscription Key | Ocp-Apim-Subscription-Key: ... | Azure portal subscription key |
A text to speech API (TTS API) is a web service that converts written text into spoken audio using artificial intelligence. Developers send text via HTTP requests and receive audio files (MP3, WAV, etc.) in response. Popular TTS APIs include OpenAI TTS, ElevenLabs, Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech.
For real-time applications, OpenAI tts-1 and ElevenLabs Turbo v2 offer the lowest latency at around 200-300ms. Amazon Polly Neural and Google Cloud TTS also provide fast response times suitable for interactive applications. Streaming APIs further reduce perceived latency by returning audio chunks as they are generated.
TTS API pricing varies by provider. Amazon Polly Standard is cheapest at $4/1M characters. OpenAI TTS costs $15-30/1M characters. ElevenLabs uses subscription pricing from $5/month (30K chars) to $99/month (500K chars). Google Cloud and Azure charge $4-16/1M characters depending on voice type.
OpenAI TTS offers the simplest API with excellent documentation and consistent quality. ElevenLabs provides the most features including voice cloning and streaming. Amazon Polly integrates well with AWS services. Google Cloud TTS offers extensive language support. Choose based on your specific requirements for quality, features, and infrastructure.
Yes, most modern TTS APIs support streaming. OpenAI, ElevenLabs, Amazon Polly, Google Cloud, and Azure all offer streaming endpoints that return audio chunks as they are generated. This enables real-time applications like voice assistants and live narration with minimal perceived latency.
Common TTS API output formats include MP3 (most compatible), WAV (uncompressed), OGG/Opus (efficient compression), AAC (Apple compatible), FLAC (lossless), and PCM (raw audio). OpenAI supports all major formats. Amazon Polly and Google Cloud offer MP3, OGG, and PCM. Choose based on your application's requirements.
Yes, all major TTS APIs allow commercial use with appropriate licensing. OpenAI, Amazon Polly, Google Cloud, and Azure include commercial rights. ElevenLabs requires paid plans for commercial use. Always review the terms of service for specific usage rights, attribution requirements, and content restrictions.
Rate limits vary by provider and plan. OpenAI allows 500 requests/minute by default. ElevenLabs limits depend on subscription tier. Amazon Polly allows 80 transactions/second. Google Cloud has per-minute character limits. Most providers offer increased limits for enterprise customers.
Use streaming TTS for real-time applications like chatbots, voice assistants, and live narration where low latency is critical. Use batch TTS for pre-rendering content like audiobooks, podcasts, and video voiceovers where quality matters more than speed. Batch processing often produces higher quality output.
ElevenLabs is widely considered to have the most natural-sounding voices, especially for emotional expression and voice cloning. OpenAI TTS-1-HD offers excellent quality for general use. Amazon Polly Generative and Google Cloud WaveNet voices are also highly regarded. Quality perception varies by language and use case.
Preview voices and compare features across different providers.