Guides → Playground & Guide → RAG Pipeline Cost - Full Stack from Index to Answer

RAG Pipeline Cost - Full Stack from Index to Answer

Meet Krishna Iyer. Tech Lead designing a customer-facing knowledge bot. "What does a 1M-doc RAG pipeline actually cost to run end-to-end?"

🔥 Board approved RAG project. CFO wants total cost. I have 5 vendor invoices to combine.

The story

RAG cost has 5 line items, not 1. (1) Embedding indexing - one-time per re-embed. (2) Vector DB storage - monthly. (3) Query embedding - per query. (4) Vector DB query - per query. (5) LLM read with retrieved context - per query, dominates cost. Most teams only model line 5 and miss the rest.

Krishna's 1M doc RAG: 1.5B tokens to embed (one-time, $30 with small model). Vector DB: 1M × 1536 dims = 6GB storage = $50/mo on hosted. 5K queries/day = 150K/mo. Each query: 100-token query embed ($0.0001), 1 vector search ($0.0001 hosted), then LLM read with 6K of retrieved context + 500-token answer (~$0.025 on Sonnet). Per query: $0.025. Monthly: $3,750.

The LLM read dominates at 95%+ of monthly cost. Embedding/retrieval are negligible. Optimization focus: reduce tokens-per-LLM-read (smarter retrieval, better chunking, output compression). Don't over-engineer the cheap parts.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

🎛 Inputs you control

Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.

▸ Monthly queries — User queries that hit the RAG pipeline each month.

How to choose: Use real traffic; generation cost scales directly with this.

▸ Retrievals per query — Retrieval calls per query (multi-hop or sub-query fan-out).

How to choose: Simple Q&A is 1; agentic or multi-hop is 2 to 3.

▸ Corpus size (docs) — Total documents in the knowledge base to index.

How to choose: Count what you index now; affects one-time indexing cost.

▸ Tokens per doc — Average document length in tokens.

How to choose: About 750 words is ~1,000 tokens; use your real average.

▸ Chunk size — Tokens per chunk when splitting documents for embedding.

How to choose: Smaller chunks improve precision but multiply vector count; 512 is a common default.

▸ Chunk overlap — Tokens shared between adjacent chunks to preserve context.

How to choose: More overlap helps recall slightly but raises chunk count and storage.

▸ Retrieval top-K — Number of chunks retrieved and fed to the generator per query.

How to choose: Higher K improves recall but adds input tokens (and cost) every query.

▸ Full re-indexes / year — How many times per year you re-embed the whole corpus.

How to choose: Match content churn; static corpora rarely need more than 1 to 2.

▸ Prompt cache hit rate — Share of requests reusing cached context (system prompt, retrieved chunks).

How to choose: Higher hit rate cuts generation input cost; set 0 if context is always fresh.

▸ Embedding model — Model used to embed chunks and queries.

How to choose: Balance retrieval quality (MTEB), dimensions/storage, and dollars per 1M tokens.

▸ Vector DB — Vector database storing and serving the embeddings.

How to choose: Managed is lower ops; self-hosted is cheaper at scale if you have the team.

▸ Generation model — LLM that writes the answer from retrieved context.

How to choose: Usually the dominant cost; route cheaper models for simple answers.

About this calculator: RAG Pipeline Cost - Full Stack from Index to Answer

RAG isn't one cost - it's five. Embedding indexing + storage + query embedding + retrieval + LLM read. Real architecture math for production RAG.

Inputs you control

Input	Impact on result	Range	Typical
Total docs indexed	Krishna: 1M. Most production RAG: 100K-10M.	10K – 100M	1000000
Queries per day	Each user search/question = 1 query.	100 – 1M	5000
Retrieved tokens per query	Sum of retrieved chunks. 4 chunks × 1500 tokens each = 6K. More chunks = better recall, higher cost.	1K – 50K	6000
Output tokens per query	Generated answer length. Q&A: 200-600. Detailed analysis: 1500+.	50 – 4K	500

Outputs computed for you · model: `token`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Total docs indexed 1,000,000

Krishna: 1M. Most production RAG: 100K-10M.

Estimated: —

Queries per day 5,000

Each user search/question = 1 query.

Estimated: —

Retrieved tokens per query 6,000

Sum of retrieved chunks. 4 chunks × 1500 tokens each = 6K. More chunks = better recall, higher cost.

Estimated: —

Output tokens per query 500

Generated answer length. Q&A: 200-600. Detailed analysis: 1500+.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

5 line items, but LLM dominates. At Krishna's scale: $30 indexing (one-time), $50/mo storage, ~$5/mo query embed + retrieval, ~$3,700/mo LLM read. LLM is 96% of monthly.

Optimization order: reduce LLM cost first. Caching (system prompt + repeated chunks) cuts 20-40%. Smarter retrieval (fewer, more relevant chunks) cuts 15-30%. Tighter outputs cut 10-15%. Stack these for 50%+ savings.

Don't over-optimize embedding/storage. They're already <5% of monthly. Saving 50% of $50 = $25/mo. Eng time better spent on LLM-side optimization.

Watch chunking strategy. Too small = lost context = more retrieved chunks = more cost. Too big = less relevant = lower quality. Sweet spot is workload-dependent.

What "good" looks like:

Healthy production RAG (1M docs, 5K queries/day): $2-4K/mo
Lean RAG (small + heavy caching): $500-1.5K/mo
Premium RAG (medical/legal, premium tier): $5-15K/mo
Consumer-scale (1M+ queries/day): $20-100K/mo, optimization mandatory

LLM tier matters most for RAG cost

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$299.95 / month ≈ $3,599 / year

Internal knowledge base, 100K docs, 1K queries/day. pgvector storage marginal cost. Standard 4-chunk retrieval. ~$500/mo.

Healthy range: $300-700/mo

See inputs used

totalDocsIndexed: 100,000
queriesPerDay: 1,000
retrievedTokensPerQuery: 4,000
outputTokensPerQuery: 400
avgTokensPerDoc: 1,500
embeddingModelTier: small
storageProvider: pgvector
llmTier: balanced
workingDaysPerMonth: 30

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Optimize LLM tier first 95% of bill
Caching is highest leverage 20-40% savings
Self-host vector DB at >1M vectors 10-30× cheaper than hosted

RAG cost optimization order: LLM tier > caching > retrieval reduction > vector DB > embeddings. Don't reverse this - it's where your money actually goes.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$109.98 / month ≈ $1,320 / year

Engineering wiki RAG. Modest scale. pgvector + balanced LLM. ~$220/mo. Don't over-engineer.

Healthy range: $150-350/mo

See inputs used

totalDocsIndexed: 50,000
queriesPerDay: 500
retrievedTokensPerQuery: 4,000
outputTokensPerQuery: 400
avgTokensPerDoc: 1,200
embeddingModelTier: small
storageProvider: pgvector
llmTier: balanced
workingDaysPerMonth: 22

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Doesn't model embedding quality on your specific corpus (recall@k varies).
Reranker stage isn't included - adds 10-30% cost if used, often quality-positive.
Hybrid search (dense + sparse) costs differently - see hybrid-search-cost guide.
Doesn't model retraining triggers (model upgrades, chunking iteration).

For these, use: Embedding Cost for index detail. Vector DB for storage. Hybrid Search for retrieval upgrade.

Where to go next

Drill into embedding cost →

Indexing + query embeddings.

Drill into vector storage cost →

Pinecone, Weaviate, Qdrant, pgvector.

Add sparse retrieval (hybrid) →

Dense + sparse beats dense alone for many use cases.

Methodology

Source: /ai-cost-economics
Extraction: Per-component pricing pulled daily. Architecture patterns from production case studies.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

RAG Pipeline Cost - Full Stack from Index to Answer

The story

🎛 Inputs you control

About this calculator: RAG Pipeline Cost - Full Stack from Index to Answer

Inputs you control

Outputs computed for you · model: `token`

What you're looking at

Ready to run the numbers?

Reading your result

LLM tier matters most for RAG cost

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

🎛 Inputs you control

About this calculator: RAG Pipeline Cost - Full Stack from Index to Answer

Inputs you control

Outputs computed for you · model: token

What you're looking at

Ready to run the numbers?

Reading your result

LLM tier matters most for RAG cost

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `token`