Guides → Playground & Guide → Chunking Optimizer - Chunk Size vs Cost vs Recall

Chunking Optimizer - Chunk Size vs Cost vs Recall

Meet Tomoko Sato. ML Engineer iterating on a RAG retrieval system. "Should chunks be 200, 500, 1000, or 2000 tokens? What's the cost vs recall tradeoff?"

🔥 Three weeks tuning retrieval. Quality up 4%, costs up 18%.

The story

Chunk size shapes everything downstream. Smaller chunks = better precision (each chunk is focused) but more chunks per query = more retrieved tokens = more LLM cost. Larger chunks = lower precision (chunk has multiple ideas) but fewer chunks per query = cheaper LLM read. The optimal isn't obvious.

Tomoko's experiment: 500-token chunks at top-4 retrieval = 2K context, 78% recall@5. 1000-token chunks at top-3 = 3K context, 82% recall@5. 200-token chunks at top-8 = 1.6K context, 71% recall@5. Different chunking strategies hit different cost/quality points.

Three rules from production data. (1) Smaller chunks (200-400 tokens) win for narrow factual lookups. (2) Mid chunks (500-1000) win for explanations and context. (3) Larger chunks (1500+) win for procedural/sequential content. Match chunk size to query type - don't pick one for everything.

📊 CALCULATOR AT A GLANCE

🚀 Open the full calculator ✉️ Email [email protected]

About this calculator: Chunking Optimizer - Chunk Size vs Cost vs Recall

Chunk size is the most-tweaked, least-understood RAG parameter. Find the size that maximizes recall while controlling cost - workload-specific.

Inputs you control

Input	Impact on result	Range	Typical
Avg tokens per source doc	Source documents you're chunking. Articles: 1-3K. Long PDFs: 10-30K. Code files: 200-500.	200 – 50K	5000
Chunk size (tokens)	Tokens per chunk. Smaller = more chunks per doc = more index size.	100 – 4K	500
Retrieval top-k	How many chunks to retrieve per query. More chunks = more LLM cost.	1 – 20	4

Outputs computed for you · model: `chunking`

Output	How inputs affect it
Monthly cost ($)	computed from inputs
Annual cost ($)	monthlyUsd × 12

Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.

What you're looking at

Each input shapes your cost. Move the slider — see the impact.

Avg tokens per source doc 5,000

Source documents you're chunking. Articles: 1-3K. Long PDFs: 10-30K. Code files: 200-500.

Estimated: —

Chunk size (tokens) 500

Tokens per chunk. Smaller = more chunks per doc = more index size.

Estimated: —

Retrieval top-k 4

How many chunks to retrieve per query. More chunks = more LLM cost.

Estimated: —

Ready to run the numbers?

Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.

🚀 Open the full calculator →

Reading your result

Total retrieved tokens = chunk size × top-k. 500 × 4 = 2K. 1000 × 3 = 3K. Same total chunks, different precision. Fewer-but-bigger chunks tend to win on coherent topics.

Index size scales inverse-linear with chunk size. 5K-token doc at 500-token chunks = 10 chunks. At 200-token chunks = 25 chunks. Bigger index = bigger storage cost + more potential noise per query.

Recall@k peaks at workload-specific size. Run an eval set with varying chunk sizes. The optimum surprises most teams - usually 800-1200 tokens for general knowledge, 200-400 for code, 1500-2500 for procedural docs.

What "good" looks like:

Tight chunks (200-400): Code search, factual lookup, FAQ
Standard (500-800): General knowledge, articles, technical docs
Mid (1000-1500): Explanations, multi-paragraph answers, contextual
Large (2000+): Procedural docs, long contracts, sequential content

Embedding models for chunked retrieval

Verified 20 hours ago

1

GPT-5 Mini

$0.250 in · $2.00 out ·
2

Command

$1.00 in · $2.00 out ·
3

devstral-2

$0.400 in · $2.00 out ·

Three real scenarios

Same calculator, three different team sizes. Click a tab to see how the numbers shift.

$6,539 / month ≈ $78,473 / year

1M code files. Small chunks (200 tokens) match code structure. Top-6 retrieval gets multiple relevant fragments. Total context: 1.2K tokens. Cheaper than larger chunks would be.

Healthy range: Tight chunks win for code

See inputs used

avgTokensPerDoc: 400
chunkSizeTokens: 200
topK: 6
docsIndexed: 1,000,000
queriesPerDay: 50,000
overlapPct: 0
llmTier: balanced
workingDaysPerMonth: 22

Trade-offs

Cost isn't the only dimension. Click any constraint — see how recommendations change.

What matters most to you? Click any dimension — recommendations update.

Best fit for "cost":

Larger chunks + smaller k Often cheaper at same recall
Tight chunks + larger k Better precision, similar cost

Total cost = chunk size × top-k. The same total can be reached two ways: 500×4 or 1000×2 or 2000×1. Pick based on recall measurement, not chunk-size dogma.

Use cases

Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.

$169.30 / month ≈ $2,032 / year

FAQ retrieval. Each entry is naturally bounded (~200-500 tokens). Small chunks match this. Top-3 retrieval. Cheap LLM since queries are simple.

Healthy range: Tight chunks for Q-A pairs

See inputs used

avgTokensPerDoc: 800
chunkSizeTokens: 300
topK: 3
docsIndexed: 10,000
queriesPerDay: 10,000
overlapPct: 20
llmTier: cheap
workingDaysPerMonth: 30

What this calculator can't tell you

Honest limitations — every model is wrong; some are useful. Where this one falls short:

Recall depends on embedding model quality, not just chunk size.
Doesn't model semantic chunking (e.g., LangChain semantic splitter) - may beat fixed-size.
Doesn't model document-level metadata (titles, sections) which can boost recall regardless of chunk size.
Optimal chunk size is highly workload-specific - measure, don't theorize.

For these, use: Embedding Cost for chunk impact on indexing. RAG Pipeline for end-to-end.

Where to go next

Full RAG pipeline cost →

Chunking is one piece of the bill.

Hybrid retrieval (dense + sparse) →

Beats pure dense for many workloads.

Embedding cost (smaller chunks = larger index) →

Index size scales inverse-linear with chunk size.

Methodology

Source: /ai-cost-economics
Extraction: Recall@k benchmarks on standardized test corpora across chunk sizes.
Editorial gate: 8-layer defense — see aicost.ai/ai-cost-economics
Last verified: 6/4/2026, 8:00:00 PM

Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.

3 years of pricing history

Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.

View 3-year history for →

📖 Data sources & methodology 161 text models · 9 embeddings · 24 vision · 41 audio · 8 vector DBs across 10 vendor pages · last verified 2026-06-05

Methodology

All prices are USD per 1 million tokens, current as of 2026-06-05.
Vendor-published values have no mark. Inferred/extrapolated values are marked with * and listed below.
Batch API discounts are 50% off standard rates across providers that offer Batch mode.
Prompt caching discounts vary by provider (typically 80-90% off cached input tokens).
Regional data-residency surcharges (Anthropic 1.1x, OpenAI 1.1x, Google regional tiers) are NOT included in base rates.
Long-context pricing tiers apply when input exceeds model threshold.
Embedding prices are input-only (no output tokens generated).

Primary sources

Last-verified date is the most recent successful daily snapshot (aicost_pricing_snapshots) or, when no snapshot exists yet, the latest successful crawler run (aicost_crawler_runs). 10 of 10 vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.) are not listed.

Anthropic

2026-06-05

https://www.anthropic.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Anthropic Docs

2026-06-05

https://platform.claude.com/docs/en/about-claude/pricing

Daily snapshot since Sep 2023 · 578 days captured

OpenAI

2026-06-05

https://openai.com/api/pricing/

Daily snapshot since Sep 2023 · 579 days captured

Google AI

2026-06-05

https://ai.google.dev/gemini-api/docs/pricing

Daily snapshot since Dec 2023 · 554 days captured

Google Vertex

2026-06-05

https://cloud.google.com/vertex-ai/generative-ai/pricing

Daily snapshot since Dec 2023 · 554 days captured

DeepSeek

2026-06-05

https://api-docs.deepseek.com/quick_start/pricing

Daily snapshot since May 2024 · 493 days captured

xAI

2026-06-05

https://x.ai/api

Daily snapshot since Nov 2024 · 411 days captured

Mistral

2026-06-05

https://mistral.ai/pricing

Daily snapshot since Dec 2023 · 552 days captured

Cohere

2026-06-05

https://cohere.com/pricing

Daily snapshot since Sep 2023 · 578 days captured

Voyage AI

2026-06-05

https://docs.voyageai.com/docs/pricing

Inferred values (marked with * in calculator tables)

Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).

Vendor / Model	Field	Why it’s inferred
Anthropic — Claude Sonnet 4.6	`cachedInput`	Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier.
Anthropic — Claude Sonnet 4.5	`cachedInput`	Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6.
Anthropic — Claude Sonnet 4.5	`batchInput`	Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Sonnet 4.5	`batchOutput`	Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount.
Anthropic — Claude Haiku 4.5	`cachedInput`	Derived at 10% of input rate — Anthropic 90% cache-hit discount convention.
OpenAI — GPT-5.4 Mini	`cachedInput`	Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier.
OpenAI — GPT-5.4 Nano	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Nano	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Nano	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`cachedInput`	Derived at 10% of input — OpenAI 90% cache-hit convention.
OpenAI — GPT-5.4 Pro	`batchInput`	Derived at 50% of input — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.4 Pro	`batchOutput`	Derived at 50% of output — OpenAI Batch API uniform 50% discount.
OpenAI — GPT-5.2	`cachedInput`	Derived at 10% of input; no residency uplift.
OpenAI — GPT-5.2	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.5 Pro	`cachedInput`	Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention.
OpenAI — GPT-5.5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.2 Pro	`cachedInput`	Derived at 10% of input — pro-tier convention.
OpenAI — GPT-5.2 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.2 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5.1	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5.1	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Pro	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Pro	`batchOutput`	Derived at 50% of output.
OpenAI — GPT-5 Nano	`cachedInput`	Derived at 10% of input.
OpenAI — GPT-5 Nano	`batchInput`	Derived at 50% of input.
OpenAI — GPT-5 Nano	`batchOutput`	Derived at 50% of output.
Google — Gemini 3 Flash	`cachedInput`	Derived at 10% of input — Google caching discount convention ~90%.
Google — Gemini 3.1 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 3.1 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 3.1 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Pro	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash	`cachedInput`	Derived at 10% of input.
Google — Gemini 2.5 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.5 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.5 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`cachedInput`	Derived at 25% of input per Google 2.0 family caching rates.
Google — Gemini 2.0 Flash	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`cachedInput`	Derived at 10% of input — Google caching convention.
Google — Gemini 2.0 Flash-Lite	`batchInput`	Derived at 50% of input — Google Batch API uniform 50% discount.
Google — Gemini 2.0 Flash-Lite	`batchOutput`	Derived at 50% of output — Google Batch API uniform 50% discount.
xAI — Grok 4 (legacy)	`cachedInput`	Extrapolated at 25% of base.

Pricing is cross-verified against the LiteLLM community registry when available. Daily snapshots are kept in aicost_pricing_snapshots; every change is logged to aicost_price_changelog with old & new values for full audit trail. Read the full methodology →

Chunking Optimizer - Chunk Size vs Cost vs Recall

The story

About this calculator: Chunking Optimizer - Chunk Size vs Cost vs Recall

Inputs you control

Outputs computed for you · model: `chunking`

What you're looking at

Ready to run the numbers?

Reading your result

Embedding models for chunked retrieval

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

The story

About this calculator: Chunking Optimizer - Chunk Size vs Cost vs Recall

Inputs you control

Outputs computed for you · model: chunking

What you're looking at

Ready to run the numbers?

Reading your result

Embedding models for chunked retrieval

Three real scenarios

Trade-offs

Best fit for "cost":

Best fit for "hallucination":

Best fit for "compliance":

Best fit for "privacy":

Best fit for "latency":

Best fit for "vendor lock-in":

Best fit for "mlops overhead":

Use cases

What this calculator can't tell you

Where to go next

Methodology

3 years of pricing history

Methodology

Primary sources

Inferred values (marked with * in calculator tables)

Outputs computed for you · model: `chunking`