Guides → Playground & Guide → Chunking Optimizer - Chunk Size vs Cost vs Recall
Meet Tomoko Sato. ML Engineer iterating on a RAG retrieval system. "Should chunks be 200, 500, 1000, or 2000 tokens? What's the cost vs recall tradeoff?"
🔥 Three weeks tuning retrieval. Quality up 4%, costs up 18%.
Chunk size shapes everything downstream. Smaller chunks = better precision (each chunk is focused) but more chunks per query = more retrieved tokens = more LLM cost. Larger chunks = lower precision (chunk has multiple ideas) but fewer chunks per query = cheaper LLM read. The optimal isn't obvious.
Tomoko's experiment: 500-token chunks at top-4 retrieval = 2K context, 78% recall@5. 1000-token chunks at top-3 = 3K context, 82% recall@5. 200-token chunks at top-8 = 1.6K context, 71% recall@5. Different chunking strategies hit different cost/quality points.
Three rules from production data. (1) Smaller chunks (200-400 tokens) win for narrow factual lookups. (2) Mid chunks (500-1000) win for explanations and context. (3) Larger chunks (1500+) win for procedural/sequential content. Match chunk size to query type - don't pick one for everything.
Chunk size is the most-tweaked, least-understood RAG parameter. Find the size that maximizes recall while controlling cost - workload-specific.
chunking
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Total retrieved tokens = chunk size × top-k. 500 × 4 = 2K. 1000 × 3 = 3K. Same total chunks, different precision. Fewer-but-bigger chunks tend to win on coherent topics.
Index size scales inverse-linear with chunk size. 5K-token doc at 500-token chunks = 10 chunks. At 200-token chunks = 25 chunks. Bigger index = bigger storage cost + more potential noise per query.
Recall@k peaks at workload-specific size. Run an eval set with varying chunk sizes. The optimum surprises most teams - usually 800-1200 tokens for general knowledge, 200-400 for code, 1500-2500 for procedural docs.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
1M code files. Small chunks (200 tokens) match code structure. Top-6 retrieval gets multiple relevant fragments. Total context: 1.2K tokens. Cheaper than larger chunks would be.
Healthy range: Tight chunks win for code
Tomoko's setup. 500-token chunks, top-4. 2K retrieved tokens per query. Standard recall ~78%. Try 1000 × 3 - often better.
Healthy range: Standard config - good baseline
Legal contracts have multi-paragraph clauses that lose meaning when split. 2000-token chunks preserve clause structure. Top-3 retrieval. ~6K context per query. Premium LLM for accuracy.
Healthy range: Large chunks preserve clause coherence
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Total cost = chunk size × top-k. The same total can be reached two ways: 500×4 or 1000×2 or 2000×1. Pick based on recall measurement, not chunk-size dogma.
Bad chunking → wrong context retrieved → LLM hallucinates from wrong context. Chunk strategy is a hallucination lever, not just cost.
If you cite chunks, smaller chunks make citations more precise. Compliance teams appreciate granular source attribution.
Chunking doesn't change privacy. All chunks live in same vector DB with same access controls.
Latency dominated by LLM read. Larger chunks at lower top-k = less context = faster TTFT. Worth measuring.
Changing chunk strategy means re-embedding the corpus. Cheap at small scale, painful at million-doc scale. Get chunking right early or budget for periodic re-embeds.
Don't tune chunking on intuition. Build an eval set with known correct answers, measure recall@k for each strategy, pick the winner.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
FAQ retrieval. Each entry is naturally bounded (~200-500 tokens). Small chunks match this. Top-3 retrieval. Cheap LLM since queries are simple.
Healthy range: Tight chunks for Q-A pairs
Papers have natural section structure. 1200-token chunks ~match paragraph clusters. 15% overlap preserves cross-section context. Top-5 for thoroughness.
Healthy range: Mid chunks with section overlap
Step-by-step procedures lose meaning when split mid-sequence. Larger chunks (1500 tokens) preserve sequential ordering. Higher overlap (20%) maintains transitions between chunks.
Healthy range: Large chunks for sequential content
Medical guidelines: many distinct facts per doc. Smaller chunks (800), more retrieved (top-6) for thoroughness. Premium LLM essential.
Healthy range: Smaller chunks, more retrieved
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Embedding Cost for chunk impact on indexing. RAG Pipeline for end-to-end.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →