Guides → Playground & Guide → Self-Host vs API - Where the Break-Even Actually Is

Self-Host vs API - Where the Break-Even Actually Is

Meet Wei Chen. VP Engineering at a 200-person Series C startup. "We spend $40K/mo on Anthropic. Should we self-host an open-source model on our own GPUs?"

🔥 CFO loves the math. CTO doesn't trust the math.

The story

Self-hosting math is seductive and frequently wrong. A $40K/mo API bill compared to $5K/mo of GPU rental looks obvious. Then add: 1 ML engineer ($25K loaded), 1 SRE ($20K), inference framework licenses, model serving infra, observability, drift monitoring, eval pipeline, version management, security patches. The 'savings' usually disappear above the line.

Wei's $40K Anthropic bill is just under the threshold where self-host might pencil. Below $30K/mo: API always wins. Above $80K/mo: self-host usually wins (if utilization is good). In the middle: depends on workload predictability, privacy needs, and whether you're already paying for ML headcount for other reasons.

The privacy multiplier matters. If your workload absolutely requires self-hosting (HIPAA + you can't trust a BAA, on-prem-only enterprise customers, training data you can't expose), the math changes. Privacy isn't a cost saving - it's a deal-blocker that justifies negative ROI on infra.

This calc walks through the real numbers - fully-loaded ops cost, GPU utilization assumptions, and where the line actually is for your scale.

About this calculator: Self-Host vs API - Where the Break-Even Actually Is

Self-hosting Llama 3 / Mistral on GPUs vs API: where break-even hits. Includes ops cost, capacity utilization, and the privacy multiplier.

Inputs you control

Input	Impact on result	Range	Typical
Current API spend ($/mo)	Your current vendor bill. The number self-host would replace.	1K – 500K	40000
GPU rental cost ($/mo)	AWS p4d/p5, Lambda Labs, or similar. Includes networking + storage. ~$2-4/hr per A100/H100 × 24/7 utilization.	500 – 200K	8000
ML/SRE loaded cost ($/mo)	Time of ML engineer + SRE allocated to self-hosted infra. Ranges from 0.3 FTE (~$8K) for shared time to 2 FTE (~$50K) for dedicated team.	0 – 80K	25000
Expected GPU utilization (%)	How busy your GPUs will be on average. Most teams achieve 30-60% in production. Below 40% = wasted capacity. Above 70% = capacity ceiling risk.	10 – 95	50

Outputs computed for you · model: `self_host`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Current API spend ($/mo) 40,000

Your current vendor bill. The number self-host would replace.

Estimated: —

GPU rental cost ($/mo) 8,000

AWS p4d/p5, Lambda Labs, or similar. Includes networking + storage. ~$2-4/hr per A100/H100 × 24/7 utilization.

Estimated: —

ML/SRE loaded cost ($/mo) 25,000

Time of ML engineer + SRE allocated to self-hosted infra. Ranges from 0.3 FTE (~$8K) for shared time to 2 FTE (~$50K) for dedicated team.

Estimated: —

Expected GPU utilization (%) 50

How busy your GPUs will be on average. Most teams achieve 30-60% in production. Below 40% = wasted capacity. Above 70% = capacity ceiling risk.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Self-host total monthly cost = GPU rental + ops + tooling. Compare to API spend. Break-even is when API ≥ self-host total.

Watch utilization carefully. A 1×A100 setup at 30% utilization costs the same as at 90% utilization - but processes 3× less work. Low-utilization self-host loses to API every time. Aim for 60%+ before committing.

Read the headcount line item honestly. 'We'll have an ML engineer manage it part-time' = 0.3 FTE = $8K-12K/mo loaded. Most teams underestimate this 3-5×. Add a real number to your model.

The migration cost is 6-12 months of pain. Beyond inference cost: prompt portability (your Anthropic prompts won't work as-is on Llama), eval pipeline rebuild, latency regressions, edge-case quality drops. Budget the engineering time.

What "good" looks like:

API wins clearly: <$30K/mo API spend. Self-host fixed costs dominate.
Toss-up: $30K-80K/mo. Depends on utilization, privacy needs, existing ML headcount.
Self-host wins (if executed): >$80K/mo with stable workload + dedicated team.
Privacy override: regulated industries where API isn't an option, regardless of cost.

Cheapest API alternatives (compare to self-host)

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$10,000 / month ≈ $120,000 / year

$10K API spend vs $5K GPU + $15K ops = $20K self-host. API wins by half. At this scale, self-hosting is engineering theater. Stay on API; invest in prompt/cache optimization.

Healthy range: API wins by ~$10K/mo

See inputs used

currentApiSpendMonthlyUsd: 10,000
gpuMonthlyRentalUsd: 5,000
mlOpsLoadedCostMonthlyUsd: 15,000
gpuUtilizationPct: 40

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

API: pay-per-use Optimal for variable workloads
Self-host: fixed cost Optimal for predictable high-volume

API is variable cost; self-host is fixed cost. The break-even depends on your utilization. For predictable workloads >70% utilization at high scale, fixed cost wins. For bursty/unpredictable, variable cost wins every time.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$110,000 / month ≈ $1,320,000 / year

Consumer-scale app with predictable load. High utilization (>70%) achievable. Self-host saves >$1M/year. Mandatory at this scale.

Healthy range: Self-host saves $105K/mo (53%)

See inputs used

currentApiSpendMonthlyUsd: 200,000
gpuMonthlyRentalUsd: 45,000
mlOpsLoadedCostMonthlyUsd: 50,000
gpuUtilizationPct: 75

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

GPU rental cost is heuristic - actual depends on cloud provider, term commitment, region.
Doesn't model GPU procurement timeline (capacity scarcity for H100s in some regions).
ML/SRE cost is fully-loaded average - actual headcount cost varies by location.
Doesn't model the quality gap between open-source and frontier models for your specific workload.
Migration cost (prompt portability, eval rebuild, edge-case fixes) isn't modeled - typically 3-6 months at this scale.

For these, use: Fine-Tuning Cost for self-hosted FT detail. Scale Projection for break-even at growth.

Where to go next

Will you cross break-even with growth? →

Project bill at 10×, see when self-host becomes the right call.

Self-host as lock-in hedge →

How much vendor exposure does self-host eliminate?

Full TCO including migration cost →

7-step wizard with sensitivity analysis.

Methodology

Source: https://aws.amazon.com/ec2/instance-types/p5/
Extraction: GPU pricing from major cloud providers (AWS, GCP, Azure, Lambda Labs) verified quarterly.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Self-Host vs API - Where the Break-Even Actually Is

The story

About this calculator: Self-Host vs API - Where the Break-Even Actually Is

Inputs you control

Outputs computed for you · model: `self_host`

What you're looking at

Ready to run the numbers?

Reading your result

Cheapest API alternatives (compare to self-host)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Self-Host vs API - Where the Break-Even Actually Is

Inputs you control

Outputs computed for you · model: self_host

What you're looking at

Ready to run the numbers?

Reading your result

Cheapest API alternatives (compare to self-host)

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `self_host`