Self-Host vs API Break-even

At what volume does self-hosting beat the API?

Running Llama 70B or Qwen 72B on rented GPUs. Factor in utilization, ops overhead, and the real costs most teams miss.

Pricing verified: 2026-06-05 H100 / A100 / cloud GPU rates
What this calculator does

At what volume does self-hosting (GPU cluster) beat vendor API pricing?

Why use it
  • Self-host has fixed costs (GPUs, ops) + low marginal cost per request
  • Breakeven is workload-specific — depends on tokens/request and GPU utilization
🎛 Your workload

Compare API inference against self-hosted on rented GPUs.

🏗 Self-host setup
* Throughput rates are approximations for 70B-class models. Smaller models 2-4x faster.
GPU is idle outside peak hours. Realistic production: 40-70%. 100% = saturated always (rare).
Engineering time to maintain. Rough guide: 10% of 1 FTE salary = $1500-3000/mo for a basic setup, 0.25-0.5 FTE for anything production-grade.
1 GPU = single point of failure. 2+ for production HA. Multi-GPU also boosts throughput.
-
-
-
📡 API-based
-
-
per month
🏗 Self-hosted
-
-
per month
⚖ Break-even volume
-
-
⚠ Real costs most teams miss
  • Model quality gap. Open-weight Llama 70B ≈ Claude Sonnet 3.5 / GPT-4o quality. Gap to current frontier (Opus 4.7, GPT-5.4) is real and closing slowly. Don't self-host a worse model to save money if quality matters.
  • Engineering time compounds. Model serving, autoscaling, monitoring, inference framework upgrades, security patching. Budget 0.25-0.5 FTE minimum for production.
  • No automatic new models. API vendors ship new models monthly. Self-hosting means you're stuck on whatever you deployed until someone ports + tests + redeploys.
  • Multi-region = multiply everything. One region works for demos. Production usually needs 2-3 regions for latency + DR. Multiply GPU count accordingly.
💡 Recommendations
    🖥 Monthly cost across GPU options

    At your volume + utilization. "Total" includes ops overhead + GPU count.

    GPU option $/hr Tokens/sec Raw GPU cost + Ops Total
    API cost calculator → Fine-tune instead? → Self-host feasibility audit →

    Go deeper

    Our playbooks on cutting this number.

    🎓
    Fine-Tuning Cost
    Customize instead of self-host
    🎯
    Concentration Risk
    Why self-host = ultimate de-risk
    🧮
    API Cost Calculator
    The baseline comparison
    📊
    AI Unit Economics
    When self-host breaks even

    The calculator's an estimate. Want the real number?

    A 5-day Quickscan ($1,500) reviews your actual usage across every pillar — financial, reliability, governance, privacy, MLOps, observability — and returns a concrete savings plan.

    Book a Quickscan →
    📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

    Methodology

    • All prices are USD per 1 million tokens, current as of 2026-06-05.
    • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
    • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
    • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
    • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
    • Long-context pricing tiers apply when input exceeds model threshold.
    • Embedding prices are input-only (no output tokens generated).

    Primary sources

    Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

    Anthropic
    2026-06-05
    https://www.anthropic.com/pricing
    Daily snapshot since Sep 2023 · 578 days captured
    Anthropic Docs
    2026-06-05
    https://platform.claude.com/docs/en/about-claude/pricing
    Daily snapshot since Sep 2023 · 578 days captured
    OpenAI
    2026-06-05
    https://openai.com/api/pricing/
    Daily snapshot since Sep 2023 · 579 days captured
    Google AI
    2026-06-05
    https://ai.google.dev/gemini-api/docs/pricing
    Daily snapshot since Dec 2023 · 554 days captured
    Google Vertex
    2026-06-05
    https://cloud.google.com/vertex-ai/generative-ai/pricing
    Daily snapshot since Dec 2023 · 554 days captured
    DeepSeek
    2026-06-05
    https://api-docs.deepseek.com/quick_start/pricing
    Daily snapshot since May 2024 · 493 days captured
    xAI
    2026-06-05
    https://x.ai/api
    Daily snapshot since Nov 2024 · 411 days captured
    Mistral
    2026-06-05
    https://mistral.ai/pricing
    Daily snapshot since Dec 2023 · 552 days captured
    Cohere
    2026-06-05
    https://cohere.com/pricing
    Daily snapshot since Sep 2023 · 578 days captured

    Inferred values (marked with * in calculator tables)

    Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

    Vendor / Model Field Why it’s inferred
    Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
    Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
    Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
    Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
    OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
    OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
    OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
    OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
    OpenAI — GPT-5.2 batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 cachedInput Derived at 10% of input.
    OpenAI — GPT-5 batchInput Derived at 50% of input.
    OpenAI — GPT-5 batchOutput Derived at 50% of output.
    OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
    OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
    OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5.1 batchInput Derived at 50% of input.
    OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
    OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
    OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
    OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
    OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
    Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
    Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
    Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
    Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
    Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
    Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
    xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

    Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →