RAG Pipeline Cost · for RAG builders & CTOs
Full RAG stack cost - in one calculator
Embeddings + vector DB + rerank + generation, with prompt-cache savings modeled. Pick a preset or build your own stack across 9 embedding models, 8 vector DBs, and 101 generation models.
Pricing verified: 2026-06-03
5-stage cost model
Cache-aware
What this calculator does
End-to-end cost of a RAG stack — embeddings + vector DB + rerank + generation + prompt cache — in one place.
Why use it
- See which of the 5 stages actually dominates your bill (usually generation)
- Compare 4 pre-built vendor stacks (Budget, Balanced, Premium, Self-host) at your workload
- Avoid the classic RAG cost traps: top-K bloat, missing prompt cache, aggressive re-indexing
- Get a shareable URL that captures every input — send to your team as a decision artifact
1
Enter key values
Monthly queries, chunk size × top-K (these drive generation cost), and pick a generation model. Leave the rest at defaults.
2
Pick a preset
Click Budget / Balanced / Premium / Self-host. Each preset swaps the embedding, vector DB, and generation model in one click.
3
Read the big number
Top card = total monthly cost. Green bar below shows which of the 5 stages eats the most.
4
Act on it
If generation > 60% of total, try a cheaper gen model or enable prompt cache. If storage > generation, check if you need managed vs self-hosted.
1
Enter every field
Start with monthly queries (the single biggest driver). Then corpus docs × tokens-per-doc for storage. Chunk size (256/512/1024) affects both storage and generation — 512 is the safe default. Top-K retrieved chunks are shoved into every generation call, so higher K = higher bill. Re-indexes/year only matters if you change embedding models; leave at 2 unless you know otherwise.
2
Pick a preset then tune
Presets swap 3 things together: embedding model, vector DB, generation model. Budget = OpenAI small + pgvector + GPT-5-mini. Premium = Voyage + Pinecone + Claude Sonnet 4.6. Then adjust the prompt cache hit rate slider — 40% is typical for RAG with stable docs; 0% if your context changes every call. Toggle reranker for an extra ~$0.002/query in exchange for better retrieval quality.
3
Read every panel
Top card = monthly total + per-query cost. 5-stage breakdown shows ingest / vector DB / query embedding / rerank / generation — the cost-share bar visualizes proportions. Comparison table runs all 4 presets at your workload; green row wins on cost. Recommendations panel auto-flags top-K bloat, cache misuse, and vendor mismatches.
4
Next actions
If a single stage dominates (>60%), optimize there first. If generation is biggest, see Multi-Model Router to route easy queries cheaper. If you have a cache-capable model at hit rate 0, visit Prompt Cache ROI — could be another 30-50% off. At >500K queries/mo, consider whether fine-tuning beats RAG entirely (RAG vs Fine-Tune).
📊 Calculator at a glance
🎛 CALCULATOR
🧩 Your RAG workload
Start with a preset, then tweak.
User queries that hit your RAG pipeline.
Multi-hop or sub-query fan-out. Simple Q&A = 1; agent-style = 2-3.
Context overlap across requests. 40%. Set to 0 if every query has fresh context.
📈 RESULTS
Monthly cost for this stack
-
-
-
Ingest
-
-
Vector DB
-
-
Query embed
-
-
Rerank
-
-
Generation
-
-
💡 Recommendations
📋 Compare all 4 preset stacks at your workload
Same queries, same corpus - vendor mix varies. Green row = cheapest, gold = your current config.
| Stack | Components | Per query | Monthly | Annual |
|---|
🎯 Use this result to
- 📚 Budget your RAG pipeline — Embedding + retrieval + completion + reranking. Full TCO not just LLM cost.
- 🔍 Find your cost bottleneck — Usually one stage dominates. Calc surfaces which and how much it costs.
- 📈 Project at scale — Linear at first, non-linear above 10K queries/day. See your scaling curve.
- 🔌 Integrate with your AI agents — MCP available for agentic workflow integration. Cost-aware RAG routing.
📋 What now?
- Generation usually dominates — at scale the LLM answer call is most of the bill, so routing the generation model (or raising cache hit rate) moves cost far more than tweaking chunking.
- Embeddings are mostly one-time — indexing is paid once; only re-indexing and query embeddings recur, so a big corpus is cheaper to run than it looks.
- Top-K and rerank are quality/cost dials — higher top-K and a reranker improve recall but add per-query cost; tune against real answer quality.