Guides → Playground & Guide → Voice Agent Stack - Full Architecture from STT to TTS
Meet Aiyana Crow. Tech Lead at a voice-first customer service startup. "Voice agents have 5+ moving parts. What's the full stack cost per minute of conversation?"
🔥 Pricing pitch said '$0.04/min'. First production day cost $0.18/min average. Need to know why.
Voice agent cost is dominated by voice-native LLMs at $0.06-0.30/min. Old pipeline: STT (~$0.005/min) + LLM (~$0.01-0.03/min) + TTS (~$0.01-0.04/min) = $0.025-0.075/min. Voice-native (OpenAI Realtime, Gemini Live): $0.06-0.30/min. Pipeline is cheaper but slower; voice-native is faster but pricier.
Aiyana's $0.18/min surprise: voice-native LLM ($0.10) + tool calls (3-5 per call × $0.01 = $0.04) + memory retrieval (vector DB ~$0.02) + telemetry ($0.005) + post-call summarization ($0.015) = $0.18. The pricing pitch only counted the voice-native LLM line.
Three architecture choices. (1) Pipeline (STT → LLM → TTS) - cheapest, ~1-2s perceived latency, fine for non-emergency. (2) Voice-native - pricier, ~300-500ms latency, natural conversation. (3) Hybrid - voice-native for live, pipeline for non-realtime (post-call summary, transcription archival).
Voice agents combine STT + LLM + tools + memory + TTS or voice-native models. Real architecture math for production voice products.
voice_stack
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Per-minute total = LLM tier + tools + memory + telemetry. Aiyana's $0.18 = $0.10 voice-native + $0.04 tools (3 × $0.013) + $0.02 memory + $0.005 telemetry + $0.015 post-call. Each line small, total real.
Architecture mix is the biggest lever. 100% voice-native: Aiyana's case. Drop to 50/50 hybrid: ~$0.13/min. Pure pipeline: $0.07/min. Decide per use case - high-stakes calls voice-native, follow-ups pipeline.
Tool calls compound fast. Each tool call adds $0.005-0.02. At 3-5 per minute over thousands of calls: meaningful share. Cache common lookups.
Post-call processing adds ~$0.01-0.03/min. Summarize, extract action items, generate transcripts. Often forgotten in initial budgeting.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Internal voice tool, 1K min/day. Pipeline architecture. ~$0.06/min total. $1,800/mo. Perfectly fine for low-stakes use.
Healthy range: $0.06/min × 1000 × 30 = $1,800/mo
Aiyana's customer service voice agent. 70% voice-native (live calls), 30% pipeline (escalations + post-processing). $0.16/min avg × 5K × 30 = $24K/mo.
Healthy range: ~$22-28K/mo at 5K min/day
Consumer voice scale. Voice-native everywhere. Negotiated lower per-min rates. Multi-vendor routing. ~$170K/mo. ROI from displaced human agents.
Healthy range: $150-200K/mo, multi-vendor mandatory
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Voice-native isn't always worth it. For high-stakes, customer-facing live conversation: yes. For internal tools, post-processing, batch: pipeline wins.
Pipeline has more failure surfaces. Voice-native is more robust on continuous conversation but newer (less mature). Eval extensively.
Voice has the most compliance complexity of any AI modality. Biometric regulations + healthcare + recording laws. Verify per-vendor.
Audio recordings + transcripts are highly identifying. Tighter retention + access controls than text.
Sub-300ms = natural. 800ms = noticeable. 1.5s+ = broken. UX target dictates architecture.
Voice-native APIs (OpenAI Realtime, Gemini Live) differ structurally. Migration between voice-native vendors is a major project.
Voice MLOps is the hardest in AI. Sub-second latency monitoring, prosody quality, conversation completion rates, user dissatisfaction signals. Real platform investment.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Outbound voice agent. Voice-native essential (natural conversation = higher conversion). Few tool calls. ~$10K/mo. Justified at $0.50+ revenue per call.
Healthy range: $8-12K/mo, ROI from conversion lift
Tier-1 customer service deflection. Mixed architecture. High tool calls (account lookups, status checks). ~$40K/mo. Displaces 5-8 human agents = $80K+/mo savings.
Healthy range: $35-45K/mo, displacing $80K/mo human
Internal voice tool - schedule meetings, find info, send reminders. Mostly pipeline (latency tolerable). Productivity ROI vs absolute cost.
Healthy range: $700-1.2K/mo, productivity ROI
Patient triage. Voice-native essential (sub-300ms latency, natural conversation). Premium HIPAA-covered tier. Higher per-minute. Mandatory at this risk level.
Healthy range: $15-20K/mo (premium voice + HIPAA tier)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Audio Cost for STT/TTS detail. Agentic AI Stack for general agent.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →