Guides → Playground & Guide → Audio Cost - Transcription, TTS, and Voice Agent Pricing

Audio Cost - Transcription, TTS, and Voice Agent Pricing

Meet Sven Mikkelsen. Product Lead at a 40-person customer service tool. "We want to add voice support - transcribe calls, AI assist, generate speech for outbound. What does the audio side cost?"

🔥 1,000 calls/day average. Audio could become 60% of our AI bill.

The story

Audio AI has a different pricing model from text. STT (speech-to-text) is per-minute. TTS (text-to-speech) is per-character. Voice agents (real-time bidirectional) are per-minute on a different scale. Pricing across vendors looks similar until you discover Whisper at $0.006/min vs Deepgram Nova at $0.0043/min vs ElevenLabs at $0.30/1K chars vs OpenAI Realtime at $0.06/min input + $0.24/min output.

Sven's customer service tool: 1,000 calls × ~6 min avg × both directions transcribed = 12,000 minutes/day. Add voice agent for 30% of those = 3,600 voice-agent minutes. Plus TTS for outbound greetings = 50K characters/day. Total: ~$600/mo on Deepgram + ~$2,400/mo on OpenAI Realtime + ~$450/mo ElevenLabs = $3,450/mo.

Voice agents are the price disruptor. Old-school: STT → LLM → TTS pipeline costs ~$0.04/min. New voice-native models (OpenAI Realtime, Gemini Live): $0.06-0.30/min - more expensive per minute, but ~3× lower latency and significantly better conversation quality. Worth the premium for high-stakes interactions.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

About this calculator: Audio Cost - Transcription, TTS, and Voice Agent Pricing

Speech-to-text per minute, text-to-speech per character, voice agent stack cost. Whisper, Deepgram, ElevenLabs, OpenAI Realtime - when each wins.

Inputs you control

Input	Impact on result	Range	Typical
Total audio minutes per day	Sum of all audio processed (STT + voice-agent + TTS converted to minutes). Sven's case: 1K calls × 6 min × 2 directions = 12K min.	10 – 1M	12000
Voice agent share (%) - realtime LLM bidirectional	Fraction needing real-time AI conversation (vs just transcription). Voice agents cost ~10-50× transcription per minute.	0 – 100	30
TTS characters per day	Outbound speech synthesis. 1 minute of speech ≈ 800-1200 characters.	0 – 10M	50000

Outputs computed for you · model: `audio_stack`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Total audio minutes per day 12,000

Sum of all audio processed (STT + voice-agent + TTS converted to minutes). Sven's case: 1K calls × 6 min × 2 directions = 12K min.

Estimated: —

Voice agent share (%) - realtime LLM bidirectional 30

Fraction needing real-time AI conversation (vs just transcription). Voice agents cost ~10-50× transcription per minute.

Estimated: —

TTS characters per day 50,000

Outbound speech synthesis. 1 minute of speech ≈ 800-1200 characters.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Three line items: STT, voice agent, TTS. Each has its own scale. STT typically dominates volume, voice agent dominates cost-per-minute, TTS is usually small unless you're outbound-heavy.

Voice agent vs old pipeline is the strategic choice. Voice agent: $0.06-0.30/min, ~500ms perceived latency, natural turn-taking. Old pipeline (STT → LLM → TTS): ~$0.04/min, ~1.5-2s latency, awkward interruptions. UX-sensitive workflows go agent; cost-sensitive batch workflows stay pipeline.

Latency matters more here than text. Sub-300ms TTFT is the threshold for natural conversation. Gemini Live + OpenAI Realtime hit it; old pipelines don't. If your voice agent feels awkward, the model isn't the problem - the architecture is.

What "good" looks like:

Pure STT call recording: $0.003-0.008/min. Whisper, Deepgram, AssemblyAI competitive.
Voice agent (bidirectional): $0.06-0.30/min. OpenAI Realtime, Gemini Live.
TTS (high quality): $0.10-0.30/1K chars. ElevenLabs premium, OpenAI HD.
TTS (basic): $0.015-0.04/1K chars. OpenAI standard, Google standard.

Top transcription + TTS providers right now

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$6,450 / month ≈ $77,400 / year

Call recording archival + searchable transcripts. Deepgram Nova at $0.0043/min × 50K min × 30 days ≈ $6,450/mo. No agent overhead.

Healthy range: $6-9K/mo for 50K min/day

See inputs used

minutesPerDay: 50,000
voiceAgentPctOfTotal: 0
ttsCharactersPerDay: 0
sttPricePerMinute: 0.004
voiceAgentPricePerMinute: 0
ttsPricePerThousandChars: 0

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Deepgram Nova $0.0043/min - cheapest STT
OpenAI Whisper $0.006/min - quality benchmark
Old pipeline (STT+LLM+TTS) Cheaper than voice agent for non-realtime

STT pricing is competitive. Voice agent pricing is 10-50× higher. The cost decision is mostly: do you need real-time conversation quality? If yes, voice agent. If no, pipeline.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$720.00 / month ≈ $8,640 / year

Bulk podcast/audiobook transcription. Whisper or Deepgram batch. Pure STT, batch processing. ~$720/mo.

Healthy range: $700-1K/mo at 8K min/day

See inputs used

minutesPerDay: 8,000
voiceAgentPctOfTotal: 0
ttsCharactersPerDay: 0
sttPricePerMinute: 0.003
voiceAgentPricePerMinute: 0
ttsPricePerThousandChars: 0

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Per-minute pricing varies by quality tier and language - Hindi/Mandarin sometimes priced higher than English.
Doesn't model VAD/streaming infrastructure cost (typically small but real).
Voice-agent latency assumptions vary by network conditions.
TTS quality differences (ElevenLabs vs OpenAI vs Google) are workload-specific - test before committing.

For these, use: Voice Agent Stack for full architecture. Multimodal RAG if mixing audio + text retrieval.

Where to go next

Full voice agent architecture →

STT + LLM + TTS or voice-native - full stack pricing.

Voice at 10× scale →

Audio costs compound fast. See cliffs.

Voice vendor lock-in →

Higher migration cost than text - plan for it.

Methodology

Source: https://platform.openai.com/docs/guides/realtime
Extraction: Per-vendor audio pricing extracted weekly. Latency benchmarks from Artificial Analysis.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Audio Cost - Transcription, TTS, and Voice Agent Pricing

The story

About this calculator: Audio Cost - Transcription, TTS, and Voice Agent Pricing

Inputs you control

Outputs computed for you · model: `audio_stack`

What you're looking at

Ready to run the numbers?

Reading your result

Top transcription + TTS providers right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Audio Cost - Transcription, TTS, and Voice Agent Pricing

Inputs you control

Outputs computed for you · model: audio_stack

What you're looking at

Ready to run the numbers?

Reading your result

Top transcription + TTS providers right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `audio_stack`