Methodology: Self-Host Break-even

API vs GPU rental economics with real throughput data.

Citations last refreshed: 2026-06-03 Pricing snapshot: 2026-06-05 ← Back to calculator

How we keep this honest

Every number on aicost.ai is verified by 11 independent audit layers that run every day at 03:30 EDT — covering structural integrity, math correctness, source-side freshness, and cross-source agreement. We publish today's snapshot date and per-vendor verification timestamps below so you can verify any number yourself.

50
calculators
31
vendors verified
58
cited claims
581
days of history
11
audit layers
See the 8 audit layers
  • Layer 1: Architecture (12 structural invariants)
  • Layer 2: Smoke test (every calc page renders)
  • Layer 3: Golden values (math correctness vs reference)
  • Layer 4: Source resilience (independent reference data sources reachable)
  • Layer 5: Math gotchas (static code analysis)
  • Layer 6: Hybrid reconciliation (cross-source agreement)
  • Layer 7: Drift detection (day-over-day price changes)
  • Layer 8: Vendor cache (per-vendor freshness wiring)
  • Layer 9: Cross-vendor reachability (live vendor pricing page probes)
  • Layer 10: Rendered HTML drift (calc page DOM contracts, 45 pages daily)
  • Layer 11: Pricing freshness (cron heartbeat + per-vendor age tracking)

All 8 layers must pass before any pricing data is considered fresh. The infrastructure runs daily and publishes results to an internal dashboard. If any layer flags an issue, it is treated as stop-the-line work.

How this calculator sources its numbers

Every value falls into one of five categories. Numbers without an asterisk are vendor-published, directly observable, or computed by arithmetic on published data. Numbers marked with * are typical best-target values — we state the working range and invite you to override with your own number.

Vendor-published Directly from the vendor's pricing or docs page. No asterisk.
Published benchmark Independent benchmark (e.g. Chatbot Arena, ANN-Benchmarks, vLLM). Cited with date.
Research paper Peer-reviewed or widely-accepted research (e.g. LLMLingua, RAGAS).
Typical target * No single canonical source exists. We state the working range and explain why.
Computed Arithmetic on vendor-published values (e.g. batch discount × standard rate).

Any individual claim may also be tagged with * if its source has not yet been re-verified against the current vendor page — treat such claims as approximate until the next verification cycle resolves them.

Vendor verification freshness

Each vendor's pricing page is independently re-checked on a cadence ranging from daily to weekly. Below: when each relevant vendor was last verified by our automated pipeline.

anthropic
2026-06-05
verified today
by auto-pipeline
chatgpt-plus
2026-06-05
verified today
by auto-pipeline
chroma
2026-06-05
verified today
by auto-crawler
claude-pro
2026-06-05
verified today
by auto-pipeline
cohere
2026-06-05
verified today
by auto-pipeline

* Typical best-target values (defaults you may want to override)

These values have no single canonical source. We've stated the working range and its basis. Your actual numbers will likely differ — every starred field in the calculator has an override input.

* Typical production GPU utilization: 60%*

Default used
60.000000 percent
Typical range
40.000000–75.000000 percent
Source
FinOps Foundation GPU allocation guidance · Wed Jan 01 2025 00:00:00 GMT-0500 (Eastern Standard Time)

Utilization depends on load pattern. Steady traffic can hit 70-80%. Bursty traffic typically sits at 30-50% average. Most production deployments observe 40-75% sustained utilization ? plan breakeven math against the lower end if traffic pattern is unknown.

Published benchmarks and research

Independent, reproducible, and cited.

Llama 3 8B on 1x L40S via vLLM: ~2200 output tokens/sec at batch=1*

Value
2200.000000 tokens per second
Source
vLLM benchmarks · Mon Jul 01 2024 00:00:00 GMT-0400 (Eastern Daylight Time)
Type
Benchmark

Smaller models + smaller GPUs can be very cost-effective for high-volume, latency-tolerant workloads. L40S is significantly cheaper than H100 per hour.

Vendor-published values

Directly from the vendor's own docs. See per-vendor verification dates in the panel above.

vLLM v0.6.0 achieves 1.8x higher throughput for Llama 3 70B on 4xH100 compared to v0.5.3.

“vLLM achieves 1.8x higher throughput and 2x less TPOT on Llama 70B model”
Value
1.800000 multiplier
Source
vLLM — Performance Benchmarks · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Benchmark performed on ShareGPT dataset (500 prompts) with TPOT measured at 32 QPS.

vLLM v0.6.0 achieves 2.7x higher throughput for Llama 3 8B on 1xH100 compared to v0.5.3.

“vLLM achieves 2.7x higher throughput and 5x faster TPOT on Llama 8B model”
Value
2.700000 multiplier
Source
vLLM — Performance Benchmarks · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Benchmark performed on ShareGPT dataset (500 prompts) with TPOT measured at 32 QPS.

Vendor pricing pages referenced

All vendor-published prices used by this calculator are sourced from the pages below. See the verification panel above for when each was last re-checked.

See an error or stale value?

We treat methodology as a living document. If a price is wrong, a benchmark is outdated, or you have a better citation, let us know and we will verify within 48 hours.

Email [email protected]
📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

  • All prices are USD per 1 million tokens, current as of 2026-06-05.
  • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
  • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
  • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
  • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
  • Long-context pricing tiers apply when input exceeds model threshold.
  • Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic
2026-06-05
https://www.anthropic.com/pricing
Daily snapshot since Sep 2023 · 578 days captured
Anthropic Docs
2026-06-05
https://platform.claude.com/docs/en/about-claude/pricing
Daily snapshot since Sep 2023 · 578 days captured
OpenAI
2026-06-05
https://openai.com/api/pricing/
Daily snapshot since Sep 2023 · 579 days captured
Google AI
2026-06-05
https://ai.google.dev/gemini-api/docs/pricing
Daily snapshot since Dec 2023 · 554 days captured
Google Vertex
2026-06-05
https://cloud.google.com/vertex-ai/generative-ai/pricing
Daily snapshot since Dec 2023 · 554 days captured
DeepSeek
2026-06-05
https://api-docs.deepseek.com/quick_start/pricing
Daily snapshot since May 2024 · 493 days captured
xAI
2026-06-05
https://x.ai/api
Daily snapshot since Nov 2024 · 411 days captured
Mistral
2026-06-05
https://mistral.ai/pricing
Daily snapshot since Dec 2023 · 552 days captured
Cohere
2026-06-05
https://cohere.com/pricing
Daily snapshot since Sep 2023 · 578 days captured

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model Field Why it’s inferred
Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2 batchInput Derived at 50% of input.
OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
OpenAI — GPT-5 cachedInput Derived at 10% of input.
OpenAI — GPT-5 batchInput Derived at 50% of input.
OpenAI — GPT-5 batchOutput Derived at 50% of output.
OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.1 batchInput Derived at 50% of input.
OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →