Guides → Playground & Guide → Hybrid Search Cost - Dense + Sparse Retrieval
Meet Andre Williams. Senior Engineer building product search. "Pure semantic search misses exact-match queries (SKUs, product names). Hybrid search adds complexity. Worth it?"
🔥 User search for 'AirPods Pro 2' returning AirPods Max. Recall is broken.
Pure semantic search has a known weakness: exact-match queries. User searches 'GPT-5-Pro pricing' - semantic retrieval finds 'GPT-4 cost', 'pricing models', 'Anthropic Pro plans'. Misses the exact thing they asked for. BM25 (sparse keyword search) excels at this. Hybrid combines both.
Andre's product search returns ~20% irrelevant results when users type exact SKUs or product names. Hybrid search (BM25 + dense, weighted ensemble) drops this to ~5%. Cost increase: ~15-25% (running two indexes + reranking). Quality lift: 4-15% on recall@5. Worth it for most product/code/document search workloads.
Two architectures. (1) Two-index hybrid: dense vector + BM25 separately, score-fused. (2) Late-interaction (ColBERT, hybrid embedding models): single index with dense+sparse properties. Two-index is simpler to set up; late-interaction is more elegant but newer.
Hybrid retrieval (BM25 + dense) beats pure semantic for most workloads. Real cost vs recall math, and when the extra complexity pays back.
hybrid_search
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Two-index hybrid roughly doubles vector DB cost. Same vectors stored twice (dense + BM25). Storage and query time both ~2× pure dense. Often acceptable trade-off.
Reranking is the quality lever and the cost driver. Cross-encoder (e.g., Cohere Rerank, BGE Reranker) on top-20 candidates costs ~$0.001-0.005 per query. At 50K queries/day = $50-250/mo. Quality lift typically 5-15% recall@5.
Quality vs cost depends on query mix. Pure exact-match queries (SKUs): hybrid wins decisively. Pure semantic queries (paraphrases): pure dense wins slightly. Mixed real-world queries: hybrid wins by 4-10% on average.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Andre's e-commerce product search. 1M products, 50K queries/day. Hybrid + reranker adds ~$300/mo on top of dense-only. Recall@5 jumps from ~75% to ~85%. Conversion impact justifies cost.
Healthy range: +15-25% cost, +4-15% recall
Code search has heavy exact-match (function names, type signatures). BM25 alone is good; dense alone misses. Hybrid wins decisively. ~$1.5K/mo extra cost vs dense-only on this scale.
Healthy range: Hybrid mandatory for code search
Internal knowledge base, 100K docs, paraphrase-heavy queries. Pure semantic search at 75-80% recall is good enough. Hybrid would add complexity for marginal gain. Skip it.
Healthy range: Pure dense fine, skip hybrid
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Hybrid cost is real but typically <30% of total RAG bill. The recall lift usually justifies it for product/code/technical search.
Hybrid retrieval finds exact what user asked for → LLM doesn't have to compensate → fewer made-up answers. Particularly valuable for fact-heavy domains.
Hybrid doesn't change compliance - same data, two storage formats. All under same access controls.
BM25 retrievals are explainable (matched these specific keywords). Dense retrievals are opaque (high cosine similarity, but why?). Hybrid offers some interpretability.
Reranking is slow. For sub-second UX (chat, autocomplete), use small reranker or skip it. For Q&A and search, latency is acceptable.
Hybrid search APIs vary. Pinecone has hybrid score; Weaviate has BM25 + dense fusion; Qdrant has named vectors. Migration is more painful than pure dense.
Hybrid adds tuning surface: BM25 vs dense weight, top-N for each, rerank top-N. Build eval pipeline to optimize without breaking quality.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Developer docs have lots of exact-match content (API names, error messages, config keys). Hybrid + small reranker. ~$50/mo extra. Quality lift on technical queries is dramatic.
Healthy range: Hybrid wins on API names, error codes
Academic search needs both: exact-match author/title queries AND semantic concept queries. Hybrid + larger rerank window. Premium but justified.
Healthy range: Hybrid + reranker for citation accuracy
Consumer recommendation engine. Users describe preferences semantically ('cozy mystery novels', 'hip-hop dance music'). Exact-match rare. Pure dense is fine.
Healthy range: Pure dense - exact match rare
Support agents search both by ticket ID (exact) and by similar issues (semantic). Hybrid required for both query types. Reranker boosts similar-ticket relevance.
Healthy range: Hybrid for ticket-ID lookup + similarity
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: RAG Pipeline for full architecture. Chunking Optimizer for upstream.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →