Guides → Playground & Guide → Embedding Cost - Indexing + Query Math for RAG

Embedding Cost - Indexing + Query Math for RAG

Meet Olivia Garrett. Solutions Engineer building a knowledge base RAG. "I have 500K docs to embed. Then ongoing query embedding. What does this actually cost?"

🔥 Spec said 'embeddings are basically free' - invoice came back $1,200.

The story

Embeddings are 10-30× cheaper than chat - but volume hides bills. OpenAI text-embedding-3-large: $0.13/1M tokens. text-embedding-3-small: $0.02/1M. Voyage 3: $0.12/1M. Cohere v3: $0.10/1M. At million-doc scale, the small numbers compound.

Olivia's 500K docs at ~1500 tokens each = 750M tokens to index. On text-embedding-3-small ($0.02/1M): $15 one-time. On text-embedding-3-large ($0.13/1M): $97. The shock came from re-embedding when she upgraded models - second pass was another $97. And ongoing query embeddings (1K queries/day × 100 tokens × $0.02/1M × 30 days): $0.06/mo. Negligible. So why $1,200? Because she had 4 fields per doc embedded separately and tested 3 models.

Three embedding cost levers. (1) Model choice - small vs large is 6× cost difference, often <5% quality difference. (2) Field strategy - embed one consolidated field, not 4 separate. (3) Re-embedding discipline - every model upgrade costs the full index again. Plan for it.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Number of documents — Total documents in your corpus to embed and index.

How to choose: Count what you index today; re-indexing frequency handles ongoing growth.

▸ Avg tokens / doc — Average length of one document in tokens (chunk size times chunk count).

How to choose: About 750 words is ~1,000 tokens; use your real average, not the longest doc.

▸ Re-indexing frequency — How often you re-embed the entire corpus from scratch.

How to choose: Match content churn: static docs rarely, fast-changing knowledge bases monthly or more.

▸ Queries / month — Monthly retrieval queries; each query is embedded once before search.

How to choose: Use real query volume. Query-side cost dominates for high-traffic RAG.

▸ Avg tokens / query — Average length of a user query string in tokens.

How to choose: Search queries are short, 20 to 100 tokens is typical.

▸ Embedding model — The embedding model priced for indexing and queries.

How to choose: Balance dimensions (storage), MTEB retrieval quality, and dollars per 1M tokens.

▸ Batch API — Use the provider async batch tier for indexing jobs.

How to choose: Enable when indexing can wait hours; typically ~50% cheaper (OpenAI).

About this calculator: Embedding Cost - Indexing + Query Math for RAG

Embeddings are 10-30× cheaper than chat - but volume adds up. Index cost + query cost + re-embedding triggers. Real RAG pipeline math.

Inputs you control

Input	Impact on result	Range	Typical
Total docs to embed (one-time)	Initial index size. Olivia: 500K docs.	1K – 100M	500000
Avg tokens per doc	Most knowledge bases: 500-3000. Long PDFs: 5K-30K. Code repos: 100-500 per file.	50 – 50K	1500
Query embeddings per day (ongoing)	Each query gets embedded to find similar docs. Tiny per-query cost, multiplied by query volume.	10 – 1M	5000

Outputs computed for you · model: `embedding`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Total docs to embed (one-time) 500,000

Initial index size. Olivia: 500K docs.

Estimated: —

Avg tokens per doc 1,500

Most knowledge bases: 500-3000. Long PDFs: 5K-30K. Code repos: 100-500 per file.

Estimated: —

Query embeddings per day (ongoing) 5,000

Each query gets embedded to find similar docs. Tiny per-query cost, multiplied by query volume.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Indexing is one-time but real. 750M tokens × $0.02/1M = $15. Sounds free. At 100M docs, becomes $300. At 1B docs, $3K. Plan for the magnitude.

Query embedding is essentially free. 5K queries/day × 100 tokens × $0.02/1M × 30 days = $3/mo. Don't optimize query embedding - focus on storage + retrieval.

Re-embedding is the surprise cost. Every model upgrade = full re-index. Budget for at least 1 re-embedding per year (vendors release new models). Two re-embeddings per year is normal during the optimization phase.

The bigger cost is downstream. Embedding cost is small. Vector DB storage + queries usually 10-50× more. Don't optimize the cheap line.

What "good" looks like:

Small embedding (text-embedding-3-small): $0.02/1M tokens - best for most cases
Mid embedding (Cohere v3): $0.10/1M - strong multi-language
Large (text-embedding-3-large): $0.13/1M - marginal quality gain on most workloads
Voyage 3: $0.12/1M - purpose-built for retrieval, often best quality

Embedding model providers right now

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$0.40 / month ≈ $4.80 / year

Internal wiki, 10K docs. Indexing: $0.30. Ongoing: ~$0.50/mo. Annual re-index: $0.30. Total per year: <$10. Don't overthink.

Healthy range: <$5 indexing + ~$1/mo ongoing

See inputs used

totalDocsToIndex: 10,000
tokensPerDoc: 1,500
queriesPerDay: 500
embeddingModelTier: balanced
rebuildsPerYear: 1

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

text-embedding-3-small $0.02/1M - best default
Cohere v3 $0.10/1M - multi-language strong
Voyage 3 $0.12/1M - purpose-built retrieval

Default to small models for most workloads. Quality differences are modest (5-10% retrieval recall) for most use cases. Only go large/premium for specialized domains.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$36.25 / month ≈ $435.00 / year

Codebase indexing - small docs (~300 tokens), high query volume. 4 rebuilds/year (model + chunking iteration). Total ~$60/year. Real cost is the vector DB.

Healthy range: $30-80/year (high re-embed rate)

See inputs used

totalDocsToIndex: 500,000
tokensPerDoc: 300
queriesPerDay: 100,000
embeddingModelTier: balanced
rebuildsPerYear: 4

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Doesn't include vector DB storage cost (often 10-50× embedding cost).
Doesn't model embedding rate limits (some providers throttle below claimed rate).
Doesn't model batch discounts (some embedding providers offer 50% off batch).
Quality differences vary by domain - test on your actual retrieval task.

For these, use: Vector DB Cost for storage. RAG Pipeline for full architecture.

Where to go next

Vector storage cost →

Pinecone, Weaviate, Qdrant, pgvector compared.

Full RAG architecture cost →

Embedding + storage + retrieval + LLM read.

RAG vs fine-tuning math →

When to fine-tune instead.

Methodology

Source: https://platform.openai.com/docs/guides/embeddings
Extraction: Per-vendor embedding pricing pulled daily.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Embedding Cost - Indexing + Query Math for RAG

The story

🎛 Inputs you control

About this calculator: Embedding Cost - Indexing + Query Math for RAG

Inputs you control

Outputs computed for you · model: `embedding`

What you're looking at

Ready to run the numbers?

Reading your result

Embedding model providers right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

About this calculator: Embedding Cost - Indexing + Query Math for RAG

Inputs you control

Outputs computed for you · model: embedding

What you're looking at

Ready to run the numbers?

Reading your result

Embedding model providers right now

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `embedding`