Guides → Playground & Guide → Multimodal RAG Stack - Vision + Audio + Text Retrieval Cost
Meet Esme Vasquez. ML Engineer building a video-and-document Q&A product. "Users upload videos + PDFs + images + ask questions. How do we cost-out the full multimodal RAG?"
🔥 Pricing models exist for each modality - never combined cleanly. Need an architecture estimate.
Multimodal RAG combines 3+ pipelines that have different cost models. Image: per-image vision embedding + storage + retrieval. Audio: STT to text → embedding (or specialized audio embedding) + transcript storage. Text: standard chunking + embedding + retrieval. Plus the LLM read at the end with multimodal context.
Esme's product processes 1,000 videos/day (avg 10 min each) + 5K PDFs/day + 100K image queries/day. Audio transcription: 10K min/day → ~$60/day STT + embedding storage. Vision: 100K images × varies = $200-500/day vision processing. Text: 5K PDFs × 30 pages × embedding/storage. LLM read with mixed-modality context: ~$300/day. Total: ~$700-900/day = $21-27K/mo.
Three multimodal architectures. (1) Convert-everything-to-text (transcribe audio, OCR images, then text-only RAG). (2) Native multimodal embeddings (CLIP, multimodal embedding models). (3) Hybrid (image embeddings + text embeddings + audio transcripts). Each has different cost profiles.
Each input shapes the cost. Click an input on the calculator to set it — explanations below match the live calculator field by field.
Multimodal RAG combines image embeddings, audio transcription, and text retrieval. Real architecture math for production multimodal apps.
multimodal_stack
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Audio costs are per-minute, dominated by transcription. $0.003-0.006/min STT × 10K min × 30 = $900-1,800/mo. Plus embedding storage: marginal.
Vision costs are per-image, varies wildly. Low-detail classification: $0.001/image. High-detail OCR: $0.01-0.05/image. 100K/day at OCR-quality = $1-5K/mo.
Text RAG is the cheapest line per unit. Standard pipeline. 5K docs × 30 pages × ~1500 tokens/page = 225M tokens/day to embed. ~$5/day = $150/mo. Storage: marginal.
LLM read with multimodal context dominates if not optimized. Vision tokens count toward LLM input - passing images costs $0.02-0.10 per query. At 50K queries/day, this is the highest single line item.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Image-heavy small app. 10K images/day + small text corpus. Native multimodal LLM. Modest scale. ~$800/mo.
Healthy range: $500-1.2K/mo
Mixed multimodal product. Convert-to-text architecture (cheaper). $20K/mo across all line items. Audio + LLM dominate.
Healthy range: $15-25K/mo total
Consumer-scale multimodal. Cheap-tier LLM mandatory. Multi-vendor routing for vision (Gemini Flash for simple, Claude for complex). Self-hosted vector DB.
Healthy range: $200-500K/mo, multi-vendor mandatory
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Convert-to-text is the cost-effective default. Move to native multimodal only if eval shows quality benefit on your workload.
Multimodal hallucinations are sneakier. User uploads a chart; agent misreads axis values; downstream advice is wrong. Eval each modality.
Compliance is per-modality. Vision-API BAA may differ from text-API BAA. Voice biometrics has its own regulations. Check each.
Audio + images are the highest-PII modalities. Strip metadata, get explicit consent, retention policy.
Multimodal queries are slower. Streaming UI helps. Pre-process modalities in parallel where possible.
Vision and audio APIs across vendors are less standardized than text. Multi-vendor multimodal is harder than multi-vendor text.
Multimodal MLOps is genuinely harder. Eval frameworks for vision-RAG and audio-RAG are less mature. Custom eval is often required.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Video tutorial Q&A. Transcribe + chunk + embed transcripts. Skip vision (visual content not needed for Q&A). ~$20K/mo.
Healthy range: $15-25K/mo for video-only
Papers with figures. Vision for figures + text for body. Premium LLM for reasoning. Modest query volume. ~$7K/mo.
Healthy range: $5-10K/mo for academic depth
Image-similarity product search. CLIP-style embeddings, no LLM read needed for retrieval. Cheap-tier LLM only for query disambiguation. ~$22K/mo.
Healthy range: $15-30K/mo with cheap multimodal
Medical images + clinical guidelines RAG. Premium tier + HIPAA + self-hosted vector DB. Modest scale. ~$5K/mo.
Healthy range: $3-6K/mo (premium tier mandatory)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Vision Cost for image detail. Audio Cost for audio detail. RAG Pipeline for text.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →