Methodology: Self-Host Break-even

API vs GPU rental economics with real throughput data.

Citations last refreshed: 2026-06-03 Pricing snapshot: 2026-06-05 ← Back to calculator

How we keep this honest

Every number on aicost.ai is verified by 11 independent audit layers that run every day at 03:30 EDT — covering structural integrity, math correctness, source-side freshness, and cross-source agreement. We publish today's snapshot date and per-vendor verification timestamps below so you can verify any number yourself.

calculators

vendors verified

cited claims

581

days of history

audit layers

See the 8 audit layers

Layer 1: Architecture (12 structural invariants)
Layer 2: Smoke test (every calc page renders)
Layer 3: Golden values (math correctness vs reference)
Layer 4: Source resilience (independent reference data sources reachable)
Layer 5: Math gotchas (static code analysis)
Layer 6: Hybrid reconciliation (cross-source agreement)
Layer 7: Drift detection (day-over-day price changes)
Layer 8: Vendor cache (per-vendor freshness wiring)
Layer 9: Cross-vendor reachability (live vendor pricing page probes)
Layer 10: Rendered HTML drift (calc page DOM contracts, 45 pages daily)
Layer 11: Pricing freshness (cron heartbeat + per-vendor age tracking)

All 8 layers must pass before any pricing data is considered fresh. The infrastructure runs daily and publishes results to an internal dashboard. If any layer flags an issue, it is treated as stop-the-line work.

How this calculator sources its numbers

Every value falls into one of five categories. Numbers without an asterisk are vendor-published, directly observable, or computed by arithmetic on published data. Numbers marked with * are typical best-target values — we state the working range and invite you to override with your own number.

Vendor-published Directly from the vendor's pricing or docs page. No asterisk.

Published benchmark Independent benchmark (e.g. Chatbot Arena, ANN-Benchmarks, vLLM). Cited with date.

Research paper Peer-reviewed or widely-accepted research (e.g. LLMLingua, RAGAS).

Typical target * No single canonical source exists. We state the working range and explain why.

Computed Arithmetic on vendor-published values (e.g. batch discount × standard rate).

Any individual claim may also be tagged with ^* if its source has not yet been re-verified against the current vendor page — treat such claims as approximate until the next verification cycle resolves them.

Vendor verification freshness

Each vendor's pricing page is independently re-checked on a cadence ranging from daily to weekly. Below: when each relevant vendor was last verified by our automated pipeline.

anthropic

2026-06-05

verified today

by auto-pipeline

chatgpt-plus

2026-06-05

verified today

by auto-pipeline

chroma

2026-06-05

verified today

by auto-crawler

claude-pro

2026-06-05

verified today

by auto-pipeline

cohere

2026-06-05

verified today

by auto-pipeline

* Typical best-target values (defaults you may want to override)

These values have no single canonical source. We've stated the working range and its basis. Your actual numbers will likely differ — every starred field in the calculator has an override input.

* Typical production GPU utilization: 60%^*

Default used

60.000000 percent

Typical range

40.000000–75.000000 percent

Source

FinOps Foundation GPU allocation guidance · Wed Jan 01 2025 00:00:00 GMT-0500 (Eastern Standard Time)

Utilization depends on load pattern. Steady traffic can hit 70-80%. Bursty traffic typically sits at 30-50% average. Most production deployments observe 40-75% sustained utilization ? plan breakeven math against the lower end if traffic pattern is unknown.

Published benchmarks and research

Independent, reproducible, and cited.

Llama 3 8B on 1x L40S via vLLM: ~2200 output tokens/sec at batch=1^*

Value

2200.000000 tokens per second

Source

vLLM benchmarks · Mon Jul 01 2024 00:00:00 GMT-0400 (Eastern Daylight Time)

Type

Benchmark

Smaller models + smaller GPUs can be very cost-effective for high-volume, latency-tolerant workloads. L40S is significantly cheaper than H100 per hour.

Vendor-published values

Directly from the vendor's own docs. See per-vendor verification dates in the panel above.

vLLM v0.6.0 achieves 1.8x higher throughput for Llama 3 70B on 4xH100 compared to v0.5.3.

“vLLM achieves 1.8x higher throughput and 2x less TPOT on Llama 70B model”

Value

1.800000 multiplier

Source

vLLM — Performance Benchmarks · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Benchmark performed on ShareGPT dataset (500 prompts) with TPOT measured at 32 QPS.

vLLM v0.6.0 achieves 2.7x higher throughput for Llama 3 8B on 1xH100 compared to v0.5.3.

“vLLM achieves 2.7x higher throughput and 5x faster TPOT on Llama 8B model”

Value

2.700000 multiplier

Source

vLLM — Performance Benchmarks · Wed Jun 03 2026 00:00:00 GMT-0400 (Eastern Daylight Time)

Benchmark performed on ShareGPT dataset (500 prompts) with TPOT measured at 32 QPS.

Vendor pricing pages referenced

All vendor-published prices used by this calculator are sourced from the pages below. See the verification panel above for when each was last re-checked.

Anthropic https://www.anthropic.com/pricing · verified 2026-04-17
Anthropic Docs https://platform.claude.com/docs/en/about-claude/pricing · verified 2026-04-17
OpenAI https://openai.com/api/pricing/ · verified 2026-04-17
Google AI https://ai.google.dev/gemini-api/docs/pricing · verified 2026-04-17
Google Vertex https://cloud.google.com/vertex-ai/generative-ai/pricing · verified 2026-04-17
DeepSeek https://api-docs.deepseek.com/quick_start/pricing · verified 2026-04-17
xAI https://x.ai/api · verified 2026-04-17
Mistral https://mistral.ai/pricing · verified 2026-04-17
Cohere https://cohere.com/pricing · verified 2026-04-17
Voyage AI https://docs.voyageai.com/docs/pricing · verified 2026-04-17

See an error or stale value?

We treat methodology as a living document. If a price is wrong, a benchmark is outdated, or you have a better citation, let us know and we will verify within 48 hours.

Email [email protected] →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

How we keep this honest

How this calculator sources its numbers

Vendor verification freshness

* Typical best-target values (defaults you may want to override)

* Typical production GPU utilization: 60%*

Published benchmarks and research

Llama 3 8B on 1x L40S via vLLM: ~2200 output tokens/sec at batch=1*

Vendor-published values

vLLM v0.6.0 achieves 1.8x higher throughput for Llama 3 70B on 4xH100 compared to v0.5.3.

vLLM v0.6.0 achieves 2.7x higher throughput for Llama 3 8B on 1xH100 compared to v0.5.3.

Vendor pricing pages referenced

See an error or stale value?

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

* Typical production GPU utilization: 60%^*

Llama 3 8B on 1x L40S via vLLM: ~2200 output tokens/sec at batch=1^*