Agent Loop Cost

What does each agent run actually cost?

Multi-turn agents accumulate context every step. Cost grows quadratically, not linearly. Model it before you get a $20K overnight bill.

Pricing verified: 2026-06-05 🔴 Highest-risk AI cost pattern

📅 Schedule a meeting via AvatarVA ✉️ Email [email protected]

What this calculator does

Model the cost of a multi-turn agent loop — tool calls, context accumulation, retries — where a naive estimate misses 50-80%.

Why use it

Agents have a hidden cost explosion: context grows each turn, and every turn pays for the full accumulated context
Tool-use calls multiply: a 10-step agent loop is not 10x a single call cost — it's often 30x
Prompt caching is the single biggest lever for agent cost — model it here

Enter key values

Agent turns per task, tokens per turn, task completion rate, tasks/day.

Model context growth

Pick linear/quadratic context accumulation. Most agents are quadratic — watch out.

Read cost per task

Shows cost explosion from turn 1 → turn N. Total agent cost vs equivalent single-shot.

Act on it

If turns > 10, prompt caching saves dramatically. If context > 100K, switch to summarization pattern.

Enter agent parameters

Turns per task = average tool-call loop depth. Simple agents: 3-5 turns. ReAct agents: 10-20. Planning agents: can be 50+. Tokens per turn = new content added each turn (tool result + reasoning). Task completion rate = % of tasks that succeed (failed tasks retry, doubling cost). Tasks/day drives scale.

Model context growth

Linear growth = agent summarizes context each turn (rare). Quadratic growth = full history passed each turn (common). In quadratic mode, turn N pays for tokens from turns 1..N, so total cost scales as N². A 20-turn agent with 500 fresh tokens/turn ends up passing 10K tokens per call by turn 20 — and you paid for that 10K on every call from turn 1.

Interpret the cost breakdown

Per-task cost shows the progression: turn 1 is cheap, turn N is expensive. Total per-task shows the integral. Monthly is per-task × tasks/day × 30. Comparison to single-shot shows how much "multi-turn premium" you're paying. Prompt cache savings projection shows how much caching the stable prefix (system prompt + early turns) can reduce cost.

Reduce agent cost

Biggest wins: (1) enable prompt caching on the stable context — see Prompt Cache ROI. (2) summarize context every N turns instead of full history. (3) route turns to cheaper models when not reasoning-heavy (Multi-Model Router). (4) watch out for retries — a 20% failure rate doubles total cost if tasks auto-retry.

🤖 Your agent setup

Each "task" runs a multi-turn loop until done. Context accumulates across turns.

Model (the reasoning one) Flagship models (Opus, GPT-5.4) are 3-5x more expensive - is that necessary for every turn?

System prompt + tool definitions (tokens) Repeated across every turn. Cache this or pay for it N times.

Initial user task (tokens)

Avg turns per task 10 Each turn = 1 model call + 1 tool execution. Coding agents avg 8-20, research agents 15-40.

Avg tool result size (tokens) This gets added to context every turn. File reads, API responses, search results.

Avg assistant reply per turn (tokens)

Agent tasks per day

System prompt cached (90% off) Caches the big system prompt + tool defs. Often cuts agent cost 40-60%.

Cost per completed agent task

Daily spend

Monthly spend

Total tokens / task

Context at final turn

largest single call

📈 Cost accumulates per turn

Why agents get expensive: context grows every turn.

💡 Cost optimization levers

📊 Same workload, different reasoning model

Agent tasks amplify price differences - every turn pays the premium.

Model	$/task	$/day	$/month	$/month (cached)

Agent guardrails playbook → Cache savings calculator → Agent architecture audit →

🎯 Use this result to

🤖 Size your agent budget — Real per-task cost including context bloat from each turn. Most teams underestimate 3-5x.
📈 Spot turn-cost growth — Each turn carries forward all prior context. By turn 8 input cost is 3x turn 1.
🔧 Tune tool overhead — Bloated tool results are silent cost killers. See exactly how much they add per turn.
🔌 Integrate with your AI agents — MCP available for agentic workflow integration. Surface live cost intelligence.

📅 Schedule a call to apply this to your workload

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

What does each agent run actually cost?

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)