Text to Speech API: Developer Guide

Complete guide to integrating text to speech APIs into your applications. Compare the top TTS APIs, explore code examples, and learn best practices for speech synthesis integration.

What is a Text to Speech API?

A text to speech API (TTS API) is a web service that converts written text into natural-sounding audio using artificial intelligence. Developers send HTTP requests with text input and receive audio files in formats like MP3, WAV, or streaming audio chunks in response.

Modern TTS APIs use neural networks and deep learning to produce remarkably human-like speech. Services like ElevenLabs, OpenAI TTS, and Amazon Polly offer real-time streaming, multiple voices, and support for dozens of languages.

REST/gRPC

Standard protocols

~200ms

Typical latency

$4-30

Per 1M chars

140+

Languages supported

TTS API Comparison

Side-by-side comparison of the top text to speech APIs for developers.

API	Pricing	Rate Limit	Streaming	Voices	Latency	Best For
OpenAI TTS	$15-30/1M chars	500 req/min	Yes	9	~200ms	Simple integration, consistent quality
ElevenLabs	$5-99/mo subscription	Varies by plan	Yes	1000+	~300ms	Voice cloning, premium quality
Amazon Polly	$4-100/1M chars	80 TPS	Yes	60+	~150ms	AWS integration, enterprise scale
Google Cloud TTS	$4-16/1M chars	1000 chars/req	Yes (gRPC)	400+	~200ms	Language variety, WaveNet quality
Azure Speech	$4-16/1M chars	200 req/min	Yes	400+	~200ms	Enterprise, SSML support

Code Examples

Basic HTTP examples for each TTS API using curl. Replace placeholders with your actual API keys.

OpenAI TTScurl

View voices

curl https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "Hello, welcome to our application!",
    "voice": "alloy"
  }' \
  --output speech.mp3

ElevenLabscurl

View voices

curl -X POST "https://api.elevenlabs.io/v1/text-to-speech/{voice_id}" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, welcome to our application!",
    "model_id": "eleven_monolingual_v1"
  }' \
  --output speech.mp3

Amazon PollyAWS CLI

View voices

aws polly synthesize-speech \
  --text "Hello, welcome to our application!" \
  --output-format mp3 \
  --voice-id Joanna \
  --engine neural \
  speech.mp3

Google Cloud TTScurl

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  "https://texttospeech.googleapis.com/v1/text:synthesize" \
  -d '{
    "input": {"text": "Hello, welcome to our application!"},
    "voice": {"languageCode": "en-US", "name": "en-US-Wavenet-D"},
    "audioConfig": {"audioEncoding": "MP3"}
  }' | jq -r '.audioContent' | base64 --decode > speech.mp3

Azure Speechcurl

curl -X POST \
  "https://{region}.tts.speech.microsoft.com/cognitiveservices/v1" \
  -H "Ocp-Apim-Subscription-Key: $AZURE_SPEECH_KEY" \
  -H "Content-Type: application/ssml+xml" \
  -H "X-Microsoft-OutputFormat: audio-16khz-128kbitrate-mono-mp3" \
  -d '<speak version="1.0" xml:lang="en-US">
        <voice name="en-US-JennyNeural">
          Hello, welcome to our application!
        </voice>
      </speak>' \
  --output speech.mp3

TTS API Pricing Breakdown

Detailed pricing per API call and character for each provider.

OpenAI TTS

tts-1$15/1M chars

tts-1-hd$30/1M chars

~$0.015-0.03 per 1K chars

ElevenLabs

Free10K chars/mo

Starter$5/mo (30K)

Pro$99/mo (500K)

~$0.17-0.20 per 1K chars

Amazon Polly

Standard$4/1M chars

Neural$16/1M chars

Generative$30/1M chars

5M free chars first 12mo

Google Cloud TTS

Standard$4/1M chars

WaveNet$16/1M chars

Neural2$16/1M chars

Free tier: 4M chars/mo

Azure Speech

Neural$16/1M chars

Custom Neural$24/1M chars

Personal Voice$100/1M chars

Free tier: 500K chars/mo

View full pricing comparison

Detailed breakdown with cost calculators

Streaming vs Batch Processing

Streaming TTS

Audio chunks are returned as they are generated, enabling playback before the full response is ready.

Low perceived latency (~100-200ms to first byte)

Ideal for chatbots and voice assistants

Real-time narration and live applications

Supported by: OpenAI, ElevenLabs, Amazon Polly, Google (gRPC), Azure

Batch Processing

The complete audio file is generated before being returned. Best for pre-rendered content.

Higher quality output potential

Ideal for audiobooks and podcasts

Easier to cache and store

Best for: Audiobooks, podcasts, video voiceovers, e-learning

Best TTS API by Use Case

Recommended APIs based on specific requirements and use cases.

Best for Low Latency

OpenAI tts-1

~200ms, optimized for real-time

Amazon Polly

~150ms, excellent AWS integration

ElevenLabs Turbo

~300ms with premium quality

Best for High Volume

Amazon Polly Standard

$4/1M chars, 80 TPS

Google Cloud TTS

$4/1M chars, scalable

OpenAI TTS

$15/1M chars, 500 req/min

Best for Voice Cloning

ElevenLabs

Industry-leading voice cloning

Azure Personal Voice

Enterprise voice cloning

Speechify

Voice cloning on Premium+

Learn more about voice cloning

Best for Enterprise

Amazon Polly

AWS ecosystem, SLA, compliance

Azure Speech

Microsoft ecosystem, SSML

Google Cloud TTS

GCP integration, WaveNet

Quick Start Guide

Get API credentials

Sign up for an account with your chosen TTS provider (OpenAI, ElevenLabs, Amazon Polly, Google Cloud, or Azure). Navigate to the API section of your dashboard and generate an API key. Store this key securely - never expose it in client-side code.

Install SDK or prepare HTTP client

Install the official SDK for your programming language (e.g., openai for Python, @google-cloud/text-to-speech for Node.js) or use any HTTP client library to make REST API calls directly.

Make your first API request

Send a POST request to the TTS endpoint with your text input, desired voice, and output format. Include your API key in the Authorization header. The API will return audio data that you can save to a file or stream to users.

Handle the audio response

Save the returned audio bytes to a file (MP3, WAV, etc.) or stream directly to your application. For streaming APIs, process audio chunks as they arrive to minimize latency. Implement error handling for rate limits and API errors.

Optimize for production

Implement caching for repeated text, use streaming for real-time applications, handle rate limits with exponential backoff, and monitor usage to optimize costs. Consider using a queue for high-volume batch processing.

Authentication Methods

API	Auth Method	Header	Notes
OpenAI	Bearer Token	`Authorization: Bearer sk-...`	API key from OpenAI dashboard
ElevenLabs	API Key Header	`xi-api-key: ...`	Custom header for API key
Amazon Polly	AWS Sig v4	`AWS SDK handles auth`	IAM credentials required
Google Cloud	OAuth 2.0 / API Key	`Authorization: Bearer ...`	Service account or API key
Azure Speech	Subscription Key	`Ocp-Apim-Subscription-Key: ...`	Azure portal subscription key

Frequently Asked Questions

What is a text to speech API?

A text to speech API (TTS API) is a web service that converts written text into spoken audio using artificial intelligence. Developers send text via HTTP requests and receive audio files (MP3, WAV, etc.) in response. Popular TTS APIs include OpenAI TTS, ElevenLabs, Amazon Polly, Google Cloud Text-to-Speech, and Microsoft Azure Speech.

Which TTS API has the lowest latency?

For real-time applications, OpenAI tts-1 and ElevenLabs Turbo v2 offer the lowest latency at around 200-300ms. Amazon Polly Neural and Google Cloud TTS also provide fast response times suitable for interactive applications. Streaming APIs further reduce perceived latency by returning audio chunks as they are generated.

How much does a text to speech API cost?

TTS API pricing varies by provider. Amazon Polly Standard is cheapest at $4/1M characters. OpenAI TTS costs $15-30/1M characters. ElevenLabs uses subscription pricing from $5/month (30K chars) to $99/month (500K chars). Google Cloud and Azure charge $4-16/1M characters depending on voice type.

Which TTS API is best for developers?

OpenAI TTS offers the simplest API with excellent documentation and consistent quality. ElevenLabs provides the most features including voice cloning and streaming. Amazon Polly integrates well with AWS services. Google Cloud TTS offers extensive language support. Choose based on your specific requirements for quality, features, and infrastructure.

Do TTS APIs support real-time streaming?

Yes, most modern TTS APIs support streaming. OpenAI, ElevenLabs, Amazon Polly, Google Cloud, and Azure all offer streaming endpoints that return audio chunks as they are generated. This enables real-time applications like voice assistants and live narration with minimal perceived latency.

What audio formats do TTS APIs support?

Common TTS API output formats include MP3 (most compatible), WAV (uncompressed), OGG/Opus (efficient compression), AAC (Apple compatible), FLAC (lossless), and PCM (raw audio). OpenAI supports all major formats. Amazon Polly and Google Cloud offer MP3, OGG, and PCM. Choose based on your application's requirements.

Can I use TTS APIs for commercial projects?

Yes, all major TTS APIs allow commercial use with appropriate licensing. OpenAI, Amazon Polly, Google Cloud, and Azure include commercial rights. ElevenLabs requires paid plans for commercial use. Always review the terms of service for specific usage rights, attribution requirements, and content restrictions.

What are TTS API rate limits?

Rate limits vary by provider and plan. OpenAI allows 500 requests/minute by default. ElevenLabs limits depend on subscription tier. Amazon Polly allows 80 transactions/second. Google Cloud has per-minute character limits. Most providers offer increased limits for enterprise customers.

How do I choose between streaming and batch TTS?

Use streaming TTS for real-time applications like chatbots, voice assistants, and live narration where low latency is critical. Use batch TTS for pre-rendering content like audiobooks, podcasts, and video voiceovers where quality matters more than speed. Batch processing often produces higher quality output.

Which TTS API has the best voice quality?

ElevenLabs is widely considered to have the most natural-sounding voices, especially for emotional expression and voice cloning. OpenAI TTS-1-HD offers excellent quality for general use. Amazon Polly Generative and Google Cloud WaveNet voices are also highly regarded. Quality perception varies by language and use case.

Explore TTS Services

Preview voices and compare features across different providers.

OpenAI TTS ElevenLabs Amazon Polly Murf AI Chatterbox Turbo Speechify

What is a Text to Speech API?

TTS API Comparison

Code Examples

TTS API Pricing Breakdown

OpenAI TTS

ElevenLabs

Amazon Polly

Google Cloud TTS

Azure Speech

Streaming vs Batch Processing

Streaming TTS

Batch Processing

Best TTS API by Use Case

Best for Low Latency

Best for High Volume

Best for Voice Cloning

Best for Enterprise

Quick Start Guide

Get API credentials

Install SDK or prepare HTTP client

Make your first API request

Handle the audio response

Optimize for production

Authentication Methods

Frequently Asked Questions

What is a text to speech API?

Which TTS API has the lowest latency?

How much does a text to speech API cost?

Which TTS API is best for developers?

Do TTS APIs support real-time streaming?

What audio formats do TTS APIs support?

Can I use TTS APIs for commercial projects?

What are TTS API rate limits?

How do I choose between streaming and batch TTS?

Which TTS API has the best voice quality?

Developer Resources

Murf AI Review 2026

Gemini 3.1 Flash TTS Review

Inworld TTS 1.5 Review

Grok TTS Review

Amazon Polly Pricing

ElevenLabs Pricing Guide

Explore TTS Services