Self-Host vs API Break-even

At what volume does self-hosting beat the API?

Running Llama 70B or Qwen 72B on rented GPUs. Factor in utilization, ops overhead, and the real costs most teams miss.

Pricing verified: 2026-06-05 H100 / A100 / cloud GPU rates

What this calculator does

At what volume does self-hosting (GPU cluster) beat vendor API pricing?

Why use it

Self-host has fixed costs (GPUs, ops) + low marginal cost per request
Breakeven is workload-specific — depends on tokens/request and GPU utilization

🎛 Your workload

Compare API inference against self-hosted on rented GPUs.

API model to compare against

Input / req

Output / req

Requests / day

🏗 Self-host setup

GPU instance type * Throughput rates are approximations for 70B-class models. Smaller models 2-4x faster.

Avg utilization 60% GPU is idle outside peak hours. Realistic production: 40-70%. 100% = saturated always (rare).

Ops overhead per month Engineering time to maintain. Rough guide: 10% of 1 FTE salary = $1500-3000/mo for a basic setup, 0.25-0.5 FTE for anything production-grade.

GPUs needed (redundancy) 1 GPU = single point of failure. 2+ for production HA. Multi-GPU also boosts throughput.

📡 API-based

per month

🏗 Self-hosted

per month

⚖ Break-even volume

⚠ Real costs most teams miss

Model quality gap. Open-weight Llama 70B ≈ Claude Sonnet 3.5 / GPT-4o quality. Gap to current frontier (Opus 4.7, GPT-5.4) is real and closing slowly. Don't self-host a worse model to save money if quality matters.
Engineering time compounds. Model serving, autoscaling, monitoring, inference framework upgrades, security patching. Budget 0.25-0.5 FTE minimum for production.
No automatic new models. API vendors ship new models monthly. Self-hosting means you're stuck on whatever you deployed until someone ports + tests + redeploys.
Multi-region = multiply everything. One region works for demos. Production usually needs 2-3 regions for latency + DR. Multiply GPU count accordingly.

💡 Recommendations

🖥 Monthly cost across GPU options

At your volume + utilization. "Total" includes ops overhead + GPU count.

GPU option	$/hr	Tokens/sec	Raw GPU cost	+ Ops	Total

API cost calculator → Fine-tune instead? → Self-host feasibility audit →

Vendor / Model

Field

Why it’s inferred

Anthropic — Claude Sonnet 4.6

cachedInput

Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.

Anthropic — Claude Sonnet 4.5

cachedInput

Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.

Anthropic — Claude Sonnet 4.5

batchInput

Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Sonnet 4.5

batchOutput

Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.

Anthropic — Claude Haiku 4.5

cachedInput

Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.

OpenAI — GPT-5.4 Mini

cachedInput

Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.

OpenAI — GPT-5.4 Nano

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Nano

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Nano

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

cachedInput

Derived at 10% of input — OpenAI 90% cache-hit convention.

OpenAI — GPT-5.4 Pro

batchInput

Derived at 50% of input — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.4 Pro

batchOutput

Derived at 50% of output — OpenAI Batch API uniform 50% discount.

OpenAI — GPT-5.2

cachedInput

Derived at 10% of input; no residency uplift.

OpenAI — GPT-5.2

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2

batchOutput

Derived at 50% of output.

OpenAI — GPT-5

cachedInput

Derived at 10% of input.

OpenAI — GPT-5

batchInput

Derived at 50% of input.

OpenAI — GPT-5

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.5 Pro

cachedInput

Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.

OpenAI — GPT-5.5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.2 Pro

cachedInput

Derived at 10% of input — pro-tier convention.

OpenAI — GPT-5.2 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5.2 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5.1

batchInput

Derived at 50% of input.

OpenAI — GPT-5.1

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Pro

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Pro

batchOutput

Derived at 50% of output.

OpenAI — GPT-5 Nano

cachedInput

Derived at 10% of input.

OpenAI — GPT-5 Nano

batchInput

Derived at 50% of input.

OpenAI — GPT-5 Nano

batchOutput

Derived at 50% of output.

Google — Gemini 3 Flash

cachedInput

Derived at 10% of input — Google caching discount convention ~90%.

Google — Gemini 3.1 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 3.1 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 3.1 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Pro

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash

cachedInput

Derived at 10% of input.

Google — Gemini 2.5 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.5 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.5 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

cachedInput

Derived at 25% of input per Google 2.0 family caching rates.

Google — Gemini 2.0 Flash

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

cachedInput

Derived at 10% of input — Google caching convention.

Google — Gemini 2.0 Flash-Lite

batchInput

Derived at 50% of input — Google Batch API uniform 50% discount.

Google — Gemini 2.0 Flash-Lite

batchOutput

Derived at 50% of output — Google Batch API uniform 50% discount.

xAI — Grok 4 (legacy)

cachedInput

Extrapolated at 25% of base.

At what volume does self-hosting beat the API?

Go deeper

The calculator's an estimate. Want the real number?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)