Guides → Playground & Guide → Audio Cost - Transcription, TTS, and Voice Agent Pricing
Meet Sven Mikkelsen. Product Lead at a 40-person customer service tool. "We want to add voice support - transcribe calls, AI assist, generate speech for outbound. What does the audio side cost?"
🔥 1,000 calls/day average. Audio could become 60% of our AI bill.
Audio AI has a different pricing model from text. STT (speech-to-text) is per-minute. TTS (text-to-speech) is per-character. Voice agents (real-time bidirectional) are per-minute on a different scale. Pricing across vendors looks similar until you discover Whisper at $0.006/min vs Deepgram Nova at $0.0043/min vs ElevenLabs at $0.30/1K chars vs OpenAI Realtime at $0.06/min input + $0.24/min output.
Sven's customer service tool: 1,000 calls × ~6 min avg × both directions transcribed = 12,000 minutes/day. Add voice agent for 30% of those = 3,600 voice-agent minutes. Plus TTS for outbound greetings = 50K characters/day. Total: ~$600/mo on Deepgram + ~$2,400/mo on OpenAI Realtime + ~$450/mo ElevenLabs = $3,450/mo.
Voice agents are the price disruptor. Old-school: STT → LLM → TTS pipeline costs ~$0.04/min. New voice-native models (OpenAI Realtime, Gemini Live): $0.06-0.30/min - more expensive per minute, but ~3× lower latency and significantly better conversation quality. Worth the premium for high-stakes interactions.
Speech-to-text per minute, text-to-speech per character, voice agent stack cost. Whisper, Deepgram, ElevenLabs, OpenAI Realtime - when each wins.
audio_stack
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Three line items: STT, voice agent, TTS. Each has its own scale. STT typically dominates volume, voice agent dominates cost-per-minute, TTS is usually small unless you're outbound-heavy.
Voice agent vs old pipeline is the strategic choice. Voice agent: $0.06-0.30/min, ~500ms perceived latency, natural turn-taking. Old pipeline (STT → LLM → TTS): ~$0.04/min, ~1.5-2s latency, awkward interruptions. UX-sensitive workflows go agent; cost-sensitive batch workflows stay pipeline.
Latency matters more here than text. Sub-300ms TTFT is the threshold for natural conversation. Gemini Live + OpenAI Realtime hit it; old pipelines don't. If your voice agent feels awkward, the model isn't the problem - the architecture is.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Call recording archival + searchable transcripts. Deepgram Nova at $0.0043/min × 50K min × 30 days ≈ $6,450/mo. No agent overhead.
Healthy range: $6-9K/mo for 50K min/day
Sven's mix. STT for 70% of minutes ($1,500), voice agent for 30% ($6,500), TTS for outbound ($450). Total ~$8.5K - actually higher than I said earlier. Voice agent is the cost driver.
Healthy range: $3-4K/mo total
Premium voice assistant - 90% real-time agent. 5K min × 90% × $0.10 = $13.5K. Plus STT for 10% ($90), plus TTS ($1,800). Total ~$15.4K. Cost-justified only at high LTV per call.
Healthy range: $15K-20K/mo at 5K min/day
Cost isn't the only dimension. Click any constraint — see how recommendations change.
STT pricing is competitive. Voice agent pricing is 10-50× higher. The cost decision is mostly: do you need real-time conversation quality? If yes, voice agent. If no, pipeline.
Transcription errors compound - wrong word in transcript = wrong AI summary = wrong action. Test on YOUR actual call audio (accent, background noise, multi-speaker).
Voice agents (especially newer ones) often don't yet have BAA. If healthcare, check before integrating.
Voice biometrics are identifying biometric data. Confirm retention policy + no-train. Some vendors retain audio for model improvement by default.
Latency is THE differentiator for voice. Sub-300ms feels natural. 800ms+ feels awkward. 1.5s+ feels broken. Don't ship voice agents that don't hit the threshold.
OpenAI Realtime API ≠ Gemini Live API. Different audio formats, websocket protocols, turn-taking semantics. Multi-vendor abstraction is harder than text or vision.
Audio pipelines need VAD (voice activity detection), format normalization (PCM, Opus, etc.), streaming infra. Not trivial - budget for it.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Bulk podcast/audiobook transcription. Whisper or Deepgram batch. Pure STT, batch processing. ~$720/mo.
Healthy range: $700-1K/mo at 8K min/day
Real-time transcription + post-meeting summary. STT cost is the audio piece (~$540). Add LLM summarization separately (~$300/mo for 100 meetings). Total ~$850/mo.
Healthy range: $500-700/mo + LLM summarization
Full AI customer service voice. Real-time agent on every call. 8K min × $0.08 × 30 = $19.2K/mo. Cost-justified only when displacing $40K+ of human agents.
Healthy range: $18K-22K/mo at 8K min/day
TTS-heavy: audiobook narration, podcast generation. 5M chars/day × $0.30/1K chars × 30 days = $45K. ElevenLabs premium voices. Margin math is critical here - pricing per audiobook needs to clear $0.50-1 per minute of generated audio.
Healthy range: $45K/mo at 5M chars/day
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Voice Agent Stack for full architecture. Multimodal RAG if mixing audio + text retrieval.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →