Guides → Playground & Guide → Self-Host vs API - Where the Break-Even Actually Is
Meet Wei Chen. VP Engineering at a 200-person Series C startup. "We spend $40K/mo on Anthropic. Should we self-host an open-source model on our own GPUs?"
🔥 CFO loves the math. CTO doesn't trust the math.
Self-hosting math is seductive and frequently wrong. A $40K/mo API bill compared to $5K/mo of GPU rental looks obvious. Then add: 1 ML engineer ($25K loaded), 1 SRE ($20K), inference framework licenses, model serving infra, observability, drift monitoring, eval pipeline, version management, security patches. The 'savings' usually disappear above the line.
Wei's $40K Anthropic bill is just under the threshold where self-host might pencil. Below $30K/mo: API always wins. Above $80K/mo: self-host usually wins (if utilization is good). In the middle: depends on workload predictability, privacy needs, and whether you're already paying for ML headcount for other reasons.
The privacy multiplier matters. If your workload absolutely requires self-hosting (HIPAA + you can't trust a BAA, on-prem-only enterprise customers, training data you can't expose), the math changes. Privacy isn't a cost saving - it's a deal-blocker that justifies negative ROI on infra.
This calc walks through the real numbers - fully-loaded ops cost, GPU utilization assumptions, and where the line actually is for your scale.
Self-hosting Llama 3 / Mistral on GPUs vs API: where break-even hits. Includes ops cost, capacity utilization, and the privacy multiplier.
self_host
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Self-host total monthly cost = GPU rental + ops + tooling. Compare to API spend. Break-even is when API ≥ self-host total.
Watch utilization carefully. A 1×A100 setup at 30% utilization costs the same as at 90% utilization - but processes 3× less work. Low-utilization self-host loses to API every time. Aim for 60%+ before committing.
Read the headcount line item honestly. 'We'll have an ML engineer manage it part-time' = 0.3 FTE = $8K-12K/mo loaded. Most teams underestimate this 3-5×. Add a real number to your model.
The migration cost is 6-12 months of pain. Beyond inference cost: prompt portability (your Anthropic prompts won't work as-is on Llama), eval pipeline rebuild, latency regressions, edge-case quality drops. Budget the engineering time.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
$10K API spend vs $5K GPU + $15K ops = $20K self-host. API wins by half. At this scale, self-hosting is engineering theater. Stay on API; invest in prompt/cache optimization.
Healthy range: API wins by ~$10K/mo
$40K API vs $8K GPU + $25K ops = $33K self-host. API still wins by $7K/mo, BUT: the gap closes at higher utilization (70% utilization makes them break-even). Wei's call: stay on API for now; revisit when bill hits $60K+.
Healthy range: API still wins by $7K/mo - but close enough to revisit
$150K API vs $35K GPU + $40K ops = $75K self-host. Saves $75K/mo = $900K/yr. Worth the migration pain. Stable workload + dedicated 2-person team. This is when self-hosting starts making sense.
Healthy range: Self-host saves $75K/mo
Cost isn't the only dimension. Click any constraint — see how recommendations change.
API is variable cost; self-host is fixed cost. The break-even depends on your utilization. For predictable workloads >70% utilization at high scale, fixed cost wins. For bursty/unpredictable, variable cost wins every time.
Llama 3.3, Mistral Large compete with GPT-4-class but trail GPT-5.5 / Claude Opus 4.7 by 5-15 points on factual benchmarks. Quality gap shrinking but real. Eval in your domain before committing.
SOC 2, HIPAA BAA, FedRAMP variants
Self-host is mandatory for some regulated workloads (FedRAMP High, certain healthcare configurations). API enterprise tier covers most others. Map your specific compliance asks before deciding.
Self-host means data never leaves your infrastructure. API enterprise tier with no-train + BAA + EU residency is sufficient for most workloads but not all. Know which you have.
Self-host can deliver lower latency for the same model size - no network round-trip to vendor. Matters for voice, real-time agents, latency-sensitive UX.
Self-host with Llama/Mistral is the most portable architecture - change cloud providers, change data centers, no vendor migration. API single-vendor is the highest lock-in posture. Worth pricing both.
API outsources MLOps to the vendor. Self-host means owning it. Most teams underestimate self-host MLOps by 2-3×. Budget honestly.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Consumer-scale app with predictable load. High utilization (>70%) achievable. Self-host saves >$1M/year. Mandatory at this scale.
Healthy range: Self-host saves $105K/mo (53%)
Healthcare or government workload requiring on-prem. API isn't an option. Self-host costs $42K vs $25K API - API would be cheaper, BUT compliance forbids it. Privacy multiplier = infinity. Self-host wins by default.
Healthy range: Privacy override - cost is moot
$50K bill but spiky usage - average 25% utilization. Self-host costs $40K but you're paying for idle capacity. API at $50K still wins because you only pay for actual use. Bursty workloads strongly favor API.
Healthy range: API wins despite high spend
Fine-tuning + privacy + capacity control. $60K API vs $53K self-host - slim margin, BUT: FT model on your own infra means full control over training data + checkpoint versions + retraining cadence. Worth the marginal cost for some teams.
Healthy range: Self-host enables FT control
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Fine-Tuning Cost for self-hosted FT detail. Scale Projection for break-even at growth.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →