Guides → Playground & Guide → Hybrid Search Cost - Dense + Sparse Retrieval

Hybrid Search Cost - Dense + Sparse Retrieval

Meet Andre Williams. Senior Engineer building product search. "Pure semantic search misses exact-match queries (SKUs, product names). Hybrid search adds complexity. Worth it?"

🔥 User search for 'AirPods Pro 2' returning AirPods Max. Recall is broken.

The story

Pure semantic search has a known weakness: exact-match queries. User searches 'GPT-5-Pro pricing' - semantic retrieval finds 'GPT-4 cost', 'pricing models', 'Anthropic Pro plans'. Misses the exact thing they asked for. BM25 (sparse keyword search) excels at this. Hybrid combines both.

Andre's product search returns ~20% irrelevant results when users type exact SKUs or product names. Hybrid search (BM25 + dense, weighted ensemble) drops this to ~5%. Cost increase: ~15-25% (running two indexes + reranking). Quality lift: 4-15% on recall@5. Worth it for most product/code/document search workloads.

Two architectures. (1) Two-index hybrid: dense vector + BM25 separately, score-fused. (2) Late-interaction (ColBERT, hybrid embedding models): single index with dense+sparse properties. Two-index is simpler to set up; late-interaction is more elegant but newer.

About this calculator: Hybrid Search Cost - Dense + Sparse Retrieval

Hybrid retrieval (BM25 + dense) beats pure semantic for most workloads. Real cost vs recall math, and when the extra complexity pays back.

Inputs you control

Input	Impact on result	Range	Typical
Total docs indexed	Both indexes will store this many docs.	10K – 100M	1000000
Queries per day	Each query hits both indexes if hybrid.	100 – 1M	50000
Rerank top-N (combined results)	Number of results to rerank with cross-encoder. 0 = no reranking. 20 = standard. More = better quality, higher cost.	0 – 100	20

Outputs computed for you · model: `hybrid_search`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Total docs indexed 1,000,000

Both indexes will store this many docs.

Estimated: —

Queries per day 50,000

Each query hits both indexes if hybrid.

Estimated: —

Rerank top-N (combined results) 20

Number of results to rerank with cross-encoder. 0 = no reranking. 20 = standard. More = better quality, higher cost.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Two-index hybrid roughly doubles vector DB cost. Same vectors stored twice (dense + BM25). Storage and query time both ~2× pure dense. Often acceptable trade-off.

Reranking is the quality lever and the cost driver. Cross-encoder (e.g., Cohere Rerank, BGE Reranker) on top-20 candidates costs ~$0.001-0.005 per query. At 50K queries/day = $50-250/mo. Quality lift typically 5-15% recall@5.

Quality vs cost depends on query mix. Pure exact-match queries (SKUs): hybrid wins decisively. Pure semantic queries (paraphrases): pure dense wins slightly. Mixed real-world queries: hybrid wins by 4-10% on average.

What "good" looks like:

Strong fit for hybrid: Product search, code search, technical docs (exact-match common)
Moderate fit: General knowledge bases, articles
Limited fit: Conversational Q&A, paraphrase-heavy (pure dense fine)
Reranker mandatory: Top-N candidates >10 - without reranker, hybrid scores can be inconsistent

Vector DBs with hybrid search support

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$3,200 / month ≈ $38,400 / year

Andre's e-commerce product search. 1M products, 50K queries/day. Hybrid + reranker adds ~$300/mo on top of dense-only. Recall@5 jumps from ~75% to ~85%. Conversion impact justifies cost.

Healthy range: +15-25% cost, +4-15% recall

See inputs used

totalDocs: 1,000,000
queriesPerDay: 50,000
rerankTopN: 20
architecture: two-index
storageProvider: qdrant-cluster

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Two-index hybrid +50-100% vector DB cost
Reranker +$50-500/mo at modest scale
Late-interaction (ColBERT) More compute, less storage

Hybrid cost is real but typically <30% of total RAG bill. The recall lift usually justifies it for product/code/technical search.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$680.00 / month ≈ $8,160 / year

Developer docs have lots of exact-match content (API names, error messages, config keys). Hybrid + small reranker. ~$50/mo extra. Quality lift on technical queries is dramatic.

Healthy range: Hybrid wins on API names, error codes

See inputs used

totalDocs: 50,000
queriesPerDay: 8,000
rerankTopN: 15
architecture: two-index
storageProvider: qdrant-self-hosted

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Doesn't model alternative architectures (ColBERT, SPLADE, hybrid embeddings) which may beat traditional hybrid.
Reranker latency may push response time past UX threshold for some applications.
Recall improvements are workload-specific - measure on your actual queries.
Some vendors charge separately for hybrid features.

For these, use: RAG Pipeline for full architecture. Chunking Optimizer for upstream.

Where to go next

Full RAG pipeline cost →

Hybrid is one piece.

Vector DB hybrid support →

Pinecone, Weaviate, Qdrant differ.

Chunking strategy →

Hybrid retrieval is sensitive to chunk size.

Methodology

Source: https://qdrant.tech/articles/hybrid-search/
Extraction: Hybrid recall benchmarks from BEIR, MS MARCO.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Hybrid Search Cost - Dense + Sparse Retrieval

The story

About this calculator: Hybrid Search Cost - Dense + Sparse Retrieval

Inputs you control

Outputs computed for you · model: `hybrid_search`

What you're looking at

Ready to run the numbers?

Reading your result

Vector DBs with hybrid search support

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Hybrid Search Cost - Dense + Sparse Retrieval

Inputs you control

Outputs computed for you · model: hybrid_search

What you're looking at

Ready to run the numbers?

Reading your result

Vector DBs with hybrid search support

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `hybrid_search`