Audio Cost · for voice agents, transcription, TTS

What do voice + audio workloads cost?

Transcription, TTS, and audio-in-LLM pricing. Build voice agents, transcribe podcasts, or stream audio to multimodal LLMs.

Pricing verified: 2026-06-05 Whisper · Deepgram · ElevenLabs · Gemini

📅 Schedule a meeting via AvatarVA ✉️ Email [email protected]

What this calculator does

Speech-to-text and text-to-speech cost per minute of audio.

Why use it

STT/TTS cost is usually billed per minute, not per token — easy to miscalculate
Compare Whisper, Deepgram, ElevenLabs, Cartesia side-by-side

📊 Calculator at a glance

🎤 Transcription (STT)

🔊 Text-to-Speech (TTS)

🗣 Voice agent (full loop)

🎛 CALCULATOR

🎤 Your transcription workload

Converting speech to text. Billed per minute of audio.

Service

Hours of audio / month e.g., podcast transcription, call recordings, meeting notes. 100 hrs/mo = ~1 full-time call center agent.

Load preset

Monthly transcription cost

Per minute

Per hour

Annual

📊 Compare all transcription services

Service	$/min	Monthly	Annual

🔊 Your TTS workload

Converting text to speech. Billed per character of text.

TTS provider

Characters / month 1M chars ≈ 150K words ≈ 800 pages. A voicebot speaking 1K words per 5-min call = 5,000 chars/call.

Load preset

Monthly TTS cost

Per 1K chars

Per 1M chars

Annual

📊 Compare all TTS providers

Service	$/M chars	Monthly	Annual

🗣 Full voice agent loop

Transcription + LLM reasoning + TTS. The real cost of a voice bot call.

Transcription (STT)

LLM (reasoning)

TTS (speech output)

Avg call duration (minutes)

User words/min (during call)

Bot words/min (during call)

Calls per day

Monthly voice agent cost

Per call

Per minute

Annual

💡 Breakdown per call

Component	Cost	% of call

🎯 Optimization notes

Vision cost → Text-agent cost → Text-only calculator →

🎯 Use this result to

🎤 Pick STT and TTS providers — Deepgram, AssemblyAI, OpenAI, ElevenLabs all priced differently. Decide right.
🤖 Cost a voice agent stack — STT + LLM + TTS + latency engineering. See full per-minute cost.
📉 Find scale break-even — At certain volumes, self-host beats managed. Calc surfaces the threshold.
🔌 Integrate with your AI agents — MCP available for agentic workflow integration. Cost-aware audio routing.

📅 Schedule a call to apply this to your workload

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

What do voice + audio workloads cost?

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)