Guides → Playground & Guide → Embedding Cost - Indexing + Query Math for RAG
Meet Olivia Garrett. Solutions Engineer building a knowledge base RAG. "I have 500K docs to embed. Then ongoing query embedding. What does this actually cost?"
🔥 Spec said 'embeddings are basically free' - invoice came back $1,200.
Embeddings are 10-30× cheaper than chat - but volume hides bills. OpenAI text-embedding-3-large: $0.13/1M tokens. text-embedding-3-small: $0.02/1M. Voyage 3: $0.12/1M. Cohere v3: $0.10/1M. At million-doc scale, the small numbers compound.
Olivia's 500K docs at ~1500 tokens each = 750M tokens to index. On text-embedding-3-small ($0.02/1M): $15 one-time. On text-embedding-3-large ($0.13/1M): $97. The shock came from re-embedding when she upgraded models - second pass was another $97. And ongoing query embeddings (1K queries/day × 100 tokens × $0.02/1M × 30 days): $0.06/mo. Negligible. So why $1,200? Because she had 4 fields per doc embedded separately and tested 3 models.
Three embedding cost levers. (1) Model choice - small vs large is 6× cost difference, often <5% quality difference. (2) Field strategy - embed one consolidated field, not 4 separate. (3) Re-embedding discipline - every model upgrade costs the full index again. Plan for it.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
Embeddings are 10-30× cheaper than chat - but volume adds up. Index cost + query cost + re-embedding triggers. Real RAG pipeline math.
embedding
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Indexing is one-time but real. 750M tokens × $0.02/1M = $15. Sounds free. At 100M docs, becomes $300. At 1B docs, $3K. Plan for the magnitude.
Query embedding is essentially free. 5K queries/day × 100 tokens × $0.02/1M × 30 days = $3/mo. Don't optimize query embedding - focus on storage + retrieval.
Re-embedding is the surprise cost. Every model upgrade = full re-index. Budget for at least 1 re-embedding per year (vendors release new models). Two re-embeddings per year is normal during the optimization phase.
The bigger cost is downstream. Embedding cost is small. Vector DB storage + queries usually 10-50× more. Don't optimize the cheap line.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Internal wiki, 10K docs. Indexing: $0.30. Ongoing: ~$0.50/mo. Annual re-index: $0.30. Total per year: <$10. Don't overthink.
Healthy range: <$5 indexing + ~$1/mo ongoing
500K docs, 5K queries/day, 2 model upgrades/year. Indexing × 3 = $45 (initial + 2 rebuilds). Ongoing query: $3/mo × 12 = $36. Total ~$80/yr. Olivia's $1,200 was 4 fields × 3 models = 12× - fair lesson.
Healthy range: $30-100/year typical
10M docs at scale. Indexing $300/year. Query embedding $30/mo. Total ~$700/yr - dominated by index rebuilds. Vector DB cost will be 10-20× higher. Optimize that, not embeddings.
Healthy range: $300-800/year
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Default to small models for most workloads. Quality differences are modest (5-10% retrieval recall) for most use cases. Only go large/premium for specialized domains.
Embedding quality affects retrieval recall - bad embeddings = wrong docs retrieved = LLM hallucinates from wrong context. Worth investing in eval here.
Embeddings of regulated content are still regulated content. Confirm vendor BAA / SOC 2. Self-hosted (Sentence-Transformers, BGE) for highest compliance.
Modern research shows embeddings can leak ~30-50% of source text via inversion attacks. Treat embeddings as sensitive - same access controls as source content.
If you're embedding queries on every request, latency adds up. Self-hosted small models (BGE, Nomic) cut this in half if you have GPU capacity.
Embeddings from Vendor A are NOT compatible with Vendor B. Switching vendors means re-embedding the entire corpus. Plan for it as a multi-day operation at scale.
Re-embedding shouldn't be a manual scramble. Build automation. Eval harness ensures upgrades are quality-positive.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Codebase indexing - small docs (~300 tokens), high query volume. 4 rebuilds/year (model + chunking iteration). Total ~$60/year. Real cost is the vector DB.
Healthy range: $30-80/year (high re-embed rate)
3M papers × 8K tokens × $0.13/1M (large model for accuracy) = $3,120 indexing. Plus ongoing queries. Premium justified - accurate retrieval is the product.
Healthy range: $3K-5K/year (large + premium)
Historical tickets for similar-case suggestion. 2M tickets, ~800 tokens each. Cheap model fine. ~$32 indexing × 3 (with rebuilds) + $90/year ongoing = $200ish.
Healthy range: $70-120/year
Product catalog, short descriptions, frequent re-indexing (catalog churn). 5M × 200 tokens = 1B tokens × $0.02 = $20 per re-embed × 6/year = $120 + queries. Cheap relative to vector DB.
Healthy range: $50-150/year
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Vector DB Cost for storage. RAG Pipeline for full architecture.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →