Guides → Playground & Guide → Vision Cost - How Multimodal Pricing Actually Works
Meet Mei Lin. Product Engineer launching a receipt-OCR feature. "Vision pricing is confusing - per image? Per token? Tiles? What does my actual feature cost?"
🔥 Spec says 50K receipts/day. CFO needs a number by tomorrow.
Vision pricing isn't text pricing with a sticker tax. Different vendors price images differently - OpenAI uses tile counts (each 512×512 tile costs ~85 tokens at low detail, 765 at high), Anthropic charges per image based on dimensions, Google bills per image at flat rates. Comparison isn't apples-to-apples until you normalize.
Mei's receipt-OCR scanning 50K images/day at high detail on GPT-5.5 Vision: ~$2,800/mo. Same workload on Claude Vision: ~$2,200/mo. On Gemini 3 Pro Vision: ~$1,400/mo. Quality differences are real but small for OCR tasks - the spread is mostly about pricing model, not capability.
Three levers cut vision costs 50-70% if you use them. (1) Resolution tier - 'low detail' is 5× cheaper and fine for most non-detailed tasks. (2) Image preprocessing - resize before upload (vendor downsizes anyway, you may as well control the tier). (3) Cheap-tier vision (Gemini Flash, Haiku Vision) for simple classification.
Vision pricing is weirder than text. Tile-based, resolution-tier, per-image and per-token mixed. Real math across GPT-5.5 Vision, Claude, Gemini for production.
multimodal_stack
Below: live sliders. Move them to see numbers change in real time. * Output uses the generic compute model — for precise numbers use the full calculator below.
Each input shapes your cost. Move the slider — see the impact.
Open the full calculator — pick a model, enter your tokens, see per-call, daily, monthly, and annual cost.
🚀 Open the full calculator →Per-image cost is the unit. Total cost / images = unit economics. OCR at $0.005-0.02/image is healthy. Above $0.05/image, you're using too-premium a tier.
Watch resolution waste. If you upload 4K images for a 'is this a receipt yes/no?' task, you're paying 5× too much. Resize to thumbnails for classification.
The vendor spread is bigger for vision than text. Up to 50% between vendors at equivalent quality. Worth shopping more aggressively here than for text.
Same calculator, three different team sizes. Click a tab to see how the numbers shift.
Is-this-a-receipt classifier. Low detail (200 tokens), short output (50 tokens), cheap tier (Haiku Vision / Gemini Flash). 100K/day = $300-450/mo.
Healthy range: <$500/mo at 100K/day
Standard receipt OCR. High detail needed (line items). Balanced tier. Vendor spread: Gemini ~$1,400, Claude ~$2,200, GPT-5.5 ~$2,800. Pick by quality requirement.
Healthy range: $1,400-2,800/mo across vendors
Contract review, multi-page form processing. Premium tier (Claude Opus 4.7, GPT-5.5 Pro). High detail mandatory. ~$3.5K/mo for 5K docs.
Healthy range: $2K-5K/mo (premium tier essential)
Cost isn't the only dimension. Click any constraint — see how recommendations change.
Vision pricing varies more than text. Gemini Flash is 3-5× cheaper than GPT-5.5 Vision for similar quality on simple tasks. Worth multi-vendor testing on your actual images.
Vision hallucination is worse than text hallucination - wrong number on a receipt = wrong invoice. Test cheap vs premium on your actual edge cases before committing.
Healthcare imaging needs HIPAA + BAA. Verify before piping production images. Some vendors don't offer BAA on vision - check first.
Vision data is highly identifying - strip EXIF (geolocation, device IDs) before upload. Use enterprise no-train tier for any user-content workflow.
Vision API calls take longer than text - first byte latency typically 300-800ms vs 100-300ms for text. Streaming helps perceived latency.
Vendor APIs for image upload differ (URL vs base64, multipart, formats supported). Multi-vendor abstraction is harder than text. LiteLLM supports it; custom code needs more work.
Pre-upload preprocessing (resize, format normalization) cuts cost 30-50%. Worth a small pipeline.
Tradeoff analysis is where most AI projects go sideways. Talk to a CFO-grade AI cost analyst →
Pre-loaded scenarios for the most common applications. Click a tab to see realistic numbers — then the "Try this scenario" button to load it into the calculator above.
Content moderation classifier. Low detail sufficient (just need yes/no). Cheap tier. ~$1K/mo at 200K daily.
Healthy range: <$1.5K/mo at 200K/day
Tutoring app: explain math/science diagrams. High detail (need to read text in image). Balanced tier good enough.
Healthy range: $300-800/mo
Triage assistance only (not diagnosis). Premium tier + HIPAA + no-train mandatory. Lower volume but higher per-image cost. Compliance dominates pricing decision.
Healthy range: $500-1,200/mo (compliance tier mandatory)
Honest limitations — every model is wrong; some are useful. Where this one falls short:
For these, use: Cost Calculator for full bill. Audio Cost for voice + vision combo apps.
Author: Subu Vdaygiri, Founder & CEO of CloudIntelligence.ai. 17 years Fortune 100 (Ingram Micro, Siemens). Wharton CTO program · Kellogg CPO program · 10× AWS+Azure certified.
Why this matters: pricing for major vendors has dropped 40-90% in the last 24 months. A budget set 12 months ago is probably wrong by 30%+.
View 3-year history for →
Last-verified date is the most recent successful daily snapshot
(aicost_pricing_snapshots) or, when no snapshot exists yet,
the latest successful crawler run (aicost_crawler_runs).
10 of 10
vendors are currently verified. Aggregator services (TokenCost, AI Pricing Guru, etc.)
are not listed.
Derived from industry conventions, not directly published by the vendor. Typical conventions: cached input = 10% of base (90% off), Batch API = 50% of base (50% off).
| Vendor / Model | Field | Why it’s inferred |
|---|---|---|
| Anthropic — Claude Sonnet 4.6 | cachedInput |
Derived at 10% of input rate — Anthropic publishes 90% cache-hit discount on this tier. |
| Anthropic — Claude Sonnet 4.5 | cachedInput |
Derived at 10% of input rate; same 90% cache-hit convention as Sonnet 4.6. |
| Anthropic — Claude Sonnet 4.5 | batchInput |
Derived at 50% of standard input — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Sonnet 4.5 | batchOutput |
Derived at 50% of standard output — Anthropic documents uniform 50% Batch discount. |
| Anthropic — Claude Haiku 4.5 | cachedInput |
Derived at 10% of input rate — Anthropic 90% cache-hit discount convention. |
| OpenAI — GPT-5.4 Mini | cachedInput |
Derived at 10% of input — OpenAI documents automatic 90% discount on cache hits across GPT-5.x tier. |
| OpenAI — GPT-5.4 Nano | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Nano | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Nano | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | cachedInput |
Derived at 10% of input — OpenAI 90% cache-hit convention. |
| OpenAI — GPT-5.4 Pro | batchInput |
Derived at 50% of input — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.4 Pro | batchOutput |
Derived at 50% of output — OpenAI Batch API uniform 50% discount. |
| OpenAI — GPT-5.2 | cachedInput |
Derived at 10% of input; no residency uplift. |
| OpenAI — GPT-5.2 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.5 Pro | cachedInput |
Derived at 10% of input — OpenAI does not publish a cached rate for *-pro models; using the family convention. |
| OpenAI — GPT-5.5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.2 Pro | cachedInput |
Derived at 10% of input — pro-tier convention. |
| OpenAI — GPT-5.2 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.2 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5.1 | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5.1 | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Pro | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Pro | batchOutput |
Derived at 50% of output. |
| OpenAI — GPT-5 Nano | cachedInput |
Derived at 10% of input. |
| OpenAI — GPT-5 Nano | batchInput |
Derived at 50% of input. |
| OpenAI — GPT-5 Nano | batchOutput |
Derived at 50% of output. |
| Google — Gemini 3 Flash | cachedInput |
Derived at 10% of input — Google caching discount convention ~90%. |
| Google — Gemini 3.1 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 3.1 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 3.1 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Pro | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash | cachedInput |
Derived at 10% of input. |
| Google — Gemini 2.5 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.5 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.5 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | cachedInput |
Derived at 25% of input per Google 2.0 family caching rates. |
| Google — Gemini 2.0 Flash | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | cachedInput |
Derived at 10% of input — Google caching convention. |
| Google — Gemini 2.0 Flash-Lite | batchInput |
Derived at 50% of input — Google Batch API uniform 50% discount. |
| Google — Gemini 2.0 Flash-Lite | batchOutput |
Derived at 50% of output — Google Batch API uniform 50% discount. |
| xAI — Grok 4 (legacy) | cachedInput |
Extrapolated at 25% of base. |
Pricing is cross-verified against the
LiteLLM community registry
when available. Daily snapshots are kept in aicost_pricing_snapshots;
every change is logged to aicost_price_changelog with old & new
values for full audit trail. Read the full methodology →