Guides → Playground & Guide → Embedding Cost - Indexing + Query Math for RAG

Embedding Cost - Indexing + Query Math for RAG

Meet Olivia Garrett. Solutions Engineer building a knowledge base RAG. "I have 500K docs to embed. Then ongoing query embedding. What does this actually cost?"

🔥 Spec said 'embeddings are basically free' - invoice came back $1,200.

The story

Embeddings are 10-30× cheaper than chat - but volume hides bills. OpenAI text-embedding-3-large: $0.13/1M tokens. text-embedding-3-small: $0.02/1M. Voyage 3: $0.12/1M. Cohere v3: $0.10/1M. At million-doc scale, the small numbers compound.

Olivia's 500K docs at ~1500 tokens each = 750M tokens to index. On text-embedding-3-small ($0.02/1M): $15 one-time. On text-embedding-3-large ($0.13/1M): $97. The shock came from re-embedding when she upgraded models - second pass was another $97. And ongoing query embeddings (1K queries/day × 100 tokens × $0.02/1M × 30 days): $0.06/mo. Negligible. So why $1,200? Because she had 4 fields per doc embedded separately and tested 3 models.

Three embedding cost levers. (1) Model choice - small vs large is 6× cost difference, often <5% quality difference. (2) Field strategy - embed one consolidated field, not 4 separate. (3) Re-embedding discipline - every model upgrade costs the full index again. Plan for it.

📊 CALCULATOR AT A GLANCE
Embedding Cost - Indexing + Query Math for RAG full size

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

Number of documents — Total documents in your corpus to embed and index.
How to choose: Count what you index today; re-indexing frequency handles ongoing growth.
Avg tokens / doc — Average length of one document in tokens (chunk size times chunk count).
How to choose: About 750 words is ~1,000 tokens; use your real average, not the longest doc.
Re-indexing frequency — How often you re-embed the entire corpus from scratch.
How to choose: Match content churn: static docs rarely, fast-changing knowledge bases monthly or more.
Queries / month — Monthly retrieval queries; each query is embedded once before search.
How to choose: Use real query volume. Query-side cost dominates for high-traffic RAG.
Avg tokens / query — Average length of a user query string in tokens.
How to choose: Search queries are short, 20 to 100 tokens is typical.
Embedding model — The embedding model priced for indexing and queries.
How to choose: Balance dimensions (storage), MTEB retrieval quality, and dollars per 1M tokens.
Batch API — Use the provider async batch tier for indexing jobs.
How to choose: Enable when indexing can wait hours; typically ~50% cheaper (OpenAI).

About this calculator: Embedding Cost - Indexing + Query Math for RAG

Embeddings are 10-30× cheaper than chat - but volume adds up. Index cost + query cost + re-embedding triggers. Real RAG pipeline math.

Inputs you control

Input Impact on result Range Typical
Total docs to embed (one-time) Initial index size. Olivia: 500K docs. 1K – 100M 500000
Avg tokens per doc Most knowledge bases: 500-3000. Long PDFs: 5K-30K. Code repos: 100-500 per file. 50 – 50K 1500
Query embeddings per day (ongoing) Each query gets embedded to find similar docs. Tiny per-query cost, multiplied by query volume. 10 – 1M 5000

Outputs computed for you · model: embedding

Output How inputs affect it
Monthly cost ($) computed from inputs
Annual cost ($) monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

500,000

Initial index size. Olivia: 500K docs.

Estimated:
1,500

Most knowledge bases: 500-3000. Long PDFs: 5K-30K. Code repos: 100-500 per file.

Estimated:
5,000

Each query gets embedded to find similar docs. Tiny per-query cost, multiplied by query volume.

Estimated:

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Indexing is one-time but real. 750M tokens × $0.02/1M = $15. Sounds free. At 100M docs, becomes $300. At 1B docs, $3K. Plan for the magnitude.

Query embedding is essentially free. 5K queries/day × 100 tokens × $0.02/1M × 30 days = $3/mo. Don't optimize query embedding - focus on storage + retrieval.

Re-embedding is the surprise cost. Every model upgrade = full re-index. Budget for at least 1 re-embedding per year (vendors release new models). Two re-embeddings per year is normal during the optimization phase.

The bigger cost is downstream. Embedding cost is small. Vector DB storage + queries usually 10-50× more. Don't optimize the cheap line.

What "good" looks like:
  • Small embedding (text-embedding-3-small): $0.02/1M tokens - best for most cases
  • Mid embedding (Cohere v3): $0.10/1M - strong multi-language
  • Large (text-embedding-3-large): $0.13/1M - marginal quality gain on most workloads
  • Voyage 3: $0.12/1M - purpose-built for retrieval, often best quality

Embedding model providers right now

Verified 20 hours ago
  1. 1
    GPT-5 Mini
    $0.250 in · $2.00 out ·
  2. 2
    Command
    $1.00 in · $2.00 out ·
  3. 3
    devstral-2
    $0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$0.40 / month ≈ $4.80 / year

Internal wiki, 10K docs. Indexing: $0.30. Ongoing: ~$0.50/mo. Annual re-index: $0.30. Total per year: <$10. Don't overthink.

Healthy range: <$5 indexing + ~$1/mo ongoing

See inputs used
totalDocsToIndex
10,000
tokensPerDoc
1,500
queriesPerDay
500
embeddingModelTier
balanced
rebuildsPerYear
1

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

  1. text-embedding-3-small $0.02/1M - best default
  2. Cohere v3 $0.10/1M - multi-language strong
  3. Voyage 3 $0.12/1M - purpose-built retrieval

Default to small models for most workloads. Quality differences are modest (5-10% retrieval recall) for most use cases. Only go large/premium for specialized domains.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$36.25 / month ≈ $435.00 / year

Codebase indexing - small docs (~300 tokens), high query volume. 4 rebuilds/year (model + chunking iteration). Total ~$60/year. Real cost is the vector DB.

Healthy range: $30-80/year (high re-embed rate)

See inputs used
totalDocsToIndex
500,000
tokensPerDoc
300
queriesPerDay
100,000
embeddingModelTier
balanced
rebuildsPerYear
4

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

For these, use: Vector DB Cost for storage. RAG Pipeline for full architecture.

Where to go next

Vector storage cost →

Pinecone, Weaviate, Qdrant, pgvector compared.

Full RAG architecture cost →

Embedding + storage + retrieval + LLM read.

RAG vs fine-tuning math →

When to fine-tune instead.

Methodology

Source
https://platform.openai.com/docs/guides/embeddings
Extraction
Per-vendor embedding pricing pulled daily.
Editorial gate
8-layer defense — see aicost.ai/ai-cost-economics
Last verified
6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →
📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

  • All prices are USD per 1 million tokens, current as of 2026-06-05.
  • Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
  • Batch API discounts are 50% off standard rates across providers that offer Batch mode.
  • Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
  • Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
  • Long-context pricing tiers apply when input exceeds model threshold.
  • Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic
2026-06-05
https://www.anthropic.com/pricing
Daily snapshot since Sep 2023 · 578 days captured
Anthropic Docs
2026-06-05
https://platform.claude.com/docs/en/about-claude/pricing
Daily snapshot since Sep 2023 · 578 days captured
OpenAI
2026-06-05
https://openai.com/api/pricing/
Daily snapshot since Sep 2023 · 579 days captured
Google AI
2026-06-05
https://ai.google.dev/gemini-api/docs/pricing
Daily snapshot since Dec 2023 · 554 days captured
Google Vertex
2026-06-05
https://cloud.google.com/vertex-ai/generative-ai/pricing
Daily snapshot since Dec 2023 · 554 days captured
DeepSeek
2026-06-05
https://api-docs.deepseek.com/quick_start/pricing
Daily snapshot since May 2024 · 493 days captured
xAI
2026-06-05
https://x.ai/api
Daily snapshot since Nov 2024 · 411 days captured
Mistral
2026-06-05
https://mistral.ai/pricing
Daily snapshot since Dec 2023 · 552 days captured
Cohere
2026-06-05
https://cohere.com/pricing
Daily snapshot since Sep 2023 · 578 days captured

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model Field Why it’s inferred
Anthropic — Claude Sonnet 4.6 cachedInput Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5 cachedInput Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5 batchInput Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5 batchOutput Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5 cachedInput Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini cachedInput Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro cachedInput Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro batchInput Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro batchOutput Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2 cachedInput Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2 batchInput Derived at 50% of input.
OpenAI — GPT-5.2 batchOutput Derived at 50% of output.
OpenAI — GPT-5 cachedInput Derived at 10% of input.
OpenAI — GPT-5 batchInput Derived at 50% of input.
OpenAI — GPT-5 batchOutput Derived at 50% of output.
OpenAI — GPT-5.5 Pro cachedInput Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.2 Pro cachedInput Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5.2 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5.1 batchInput Derived at 50% of input.
OpenAI — GPT-5.1 batchOutput Derived at 50% of output.
OpenAI — GPT-5 Pro batchInput Derived at 50% of input.
OpenAI — GPT-5 Pro batchOutput Derived at 50% of output.
OpenAI — GPT-5 Nano cachedInput Derived at 10% of input.
OpenAI — GPT-5 Nano batchInput Derived at 50% of input.
OpenAI — GPT-5 Nano batchOutput Derived at 50% of output.
Google — Gemini 3 Flash cachedInput Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash cachedInput Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash cachedInput Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite cachedInput Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite batchInput Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite batchOutput Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy) cachedInput Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →