Guides → Playground & Guide → Voice Agent Stack - Full Architecture from STT to TTS

Voice Agent Stack - Full Architecture from STT to TTS

Meet Aiyana Crow. Tech Lead at a voice-first customer service startup. "Voice agents have 5+ moving parts. What's the full stack cost per minute of conversation?"

🔥 Pricing pitch said '$0.04/min'. First production day cost $0.18/min average. Need to know why.

The story

Voice agent cost is dominated by voice-native LLMs at $0.06-0.30/min. Old pipeline: STT (~$0.005/min) + LLM (~$0.01-0.03/min) + TTS (~$0.01-0.04/min) = $0.025-0.075/min. Voice-native (OpenAI Realtime, Gemini Live): $0.06-0.30/min. Pipeline is cheaper but slower; voice-native is faster but pricier.

Aiyana's $0.18/min surprise: voice-native LLM ($0.10) + tool calls (3-5 per call × $0.01 = $0.04) + memory retrieval (vector DB ~$0.02) + telemetry ($0.005) + post-call summarization ($0.015) = $0.18. The pricing pitch only counted the voice-native LLM line.

Three architecture choices. (1) Pipeline (STT → LLM → TTS) - cheapest, ~1-2s perceived latency, fine for non-emergency. (2) Voice-native - pricier, ~300-500ms latency, natural conversation. (3) Hybrid - voice-native for live, pipeline for non-realtime (post-call summary, transcription archival).

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

About this calculator: Voice Agent Stack - Full Architecture from STT to TTS

Voice agents combine STT + LLM + tools + memory + TTS or voice-native models. Real architecture math for production voice products.

Inputs you control

Input	Impact on result	Range	Typical
Voice agent minutes per day	Total voice-agent conversation minutes. Aiyana: 5K. Mid call center: 50K-100K.	10 – 1M	5000
Voice-native architecture share (%)	% of minutes using voice-native (vs pipeline). Voice-native = better UX, ~3× cost.	0 – 100	70
Tool calls per voice-minute	Average tool calls per minute. Customer service: 1-2. Information lookup: 2-4. Action-taking: 3-5.	0 – 10	1.5

Outputs computed for you · model: `voice_stack`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Voice agent minutes per day 5,000

Total voice-agent conversation minutes. Aiyana: 5K. Mid call center: 50K-100K.

Estimated: —

Voice-native architecture share (%) 70

% of minutes using voice-native (vs pipeline). Voice-native = better UX, ~3× cost.

Estimated: —

Tool calls per voice-minute 1.5

Average tool calls per minute. Customer service: 1-2. Information lookup: 2-4. Action-taking: 3-5.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Per-minute total = LLM tier + tools + memory + telemetry. Aiyana's $0.18 = $0.10 voice-native + $0.04 tools (3 × $0.013) + $0.02 memory + $0.005 telemetry + $0.015 post-call. Each line small, total real.

Architecture mix is the biggest lever. 100% voice-native: Aiyana's case. Drop to 50/50 hybrid: ~$0.13/min. Pure pipeline: $0.07/min. Decide per use case - high-stakes calls voice-native, follow-ups pipeline.

Tool calls compound fast. Each tool call adds $0.005-0.02. At 3-5 per minute over thousands of calls: meaningful share. Cache common lookups.

Post-call processing adds ~$0.01-0.03/min. Summarize, extract action items, generate transcripts. Often forgotten in initial budgeting.

What "good" looks like:

Pure pipeline architecture: $0.04-0.08/min, ~1-2s latency
Voice-native architecture: $0.10-0.30/min, <500ms latency
Hybrid (recommended): $0.08-0.18/min, mixed latency
Premium voice-native (high-stakes): $0.20-0.40/min

Voice-capable vendors right now

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$2,250 / month ≈ $27,000 / year

Internal voice tool, 1K min/day. Pipeline architecture. ~$0.06/min total. $1,800/mo. Perfectly fine for low-stakes use.

Healthy range: $0.06/min × 1000 × 30 = $1,800/mo

See inputs used

callMinutesPerDay: 1,000
architecturePremiumPct: 0
toolCallsPerMinute: 1
voiceNativePerMinuteUsd: 0.1
pipelinePerMinuteUsd: 0.04
avgToolCostUsd: 0.01
memoryCostPerMinUsd: 0.02
telemetryCostPerMinUsd: 0.005
workingDaysPerMonth: 30

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Pipeline architecture Cheapest, latency tradeoff
Voice-native Premium UX, premium cost
Hybrid (live + post-call pipeline) Best balance

Voice-native isn't always worth it. For high-stakes, customer-facing live conversation: yes. For internal tools, post-processing, batch: pipeline wins.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$7,920 / month ≈ $95,040 / year

Outbound voice agent. Voice-native essential (natural conversation = higher conversion). Few tool calls. ~$10K/mo. Justified at $0.50+ revenue per call.

Healthy range: $8-12K/mo, ROI from conversion lift

See inputs used

callMinutesPerDay: 3,000
architecturePremiumPct: 100
toolCallsPerMinute: 1
voiceNativePerMinuteUsd: 0.1
pipelinePerMinuteUsd: 0.04
avgToolCostUsd: 0.005
memoryCostPerMinUsd: 0.01
telemetryCostPerMinUsd: 0.005
workingDaysPerMonth: 22

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Voice-native pricing varies by vendor - check current pricing pages.
Doesn't model network/connection costs for telephony integration (Twilio, Plivo, etc.).
Quality differences between vendors are workload-specific.
Doesn't include human-in-the-loop costs (escalations to human agents).

For these, use: Audio Cost for STT/TTS detail. Agentic AI Stack for general agent.

Where to go next

STT + TTS detail →

Drill into audio components.

General agent architecture →

Voice is one type of agent.

Voice vendor concentration →

Higher lock-in than text.

Methodology

Source: https://platform.openai.com/docs/guides/realtime
Extraction: Voice agent stack costs from 4 production deployments (anonymized).
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Voice Agent Stack - Full Architecture from STT to TTS

The story

About this calculator: Voice Agent Stack - Full Architecture from STT to TTS

Inputs you control

Outputs computed for you · model: `voice_stack`

What you're looking at

Ready to run the numbers?

Reading your result

Voice-capable vendors right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Voice Agent Stack - Full Architecture from STT to TTS

Inputs you control

Outputs computed for you · model: voice_stack

What you're looking at

Ready to run the numbers?

Reading your result

Voice-capable vendors right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `voice_stack`