โ† Back to crawler health

๐Ÿ”ง Crawler troubleshooting playbook

Symptom-driven reference for the aicost.ai crawler stack. Captures lessons from the 2026-05-01 production-hardening session. Use this when something looks wrong on /crawler-health or after a crawler run.

๐Ÿ“– This page is the quick-reference. For the full operational guide โ€” system architecture, all 5 crawler tracks in depth, complete bug class library (Aโ€“L), 7-layer defense, hybrid pricing failure modes, post-incident discipline โ€” see docs/AICOST-OPERATIONS-COMPLETE.md in the repo.
โš ๏ธ New entities are auto-detected. Promotion stays manual.

Token API new models โ€” auto-incorporated. Crawler writes to aicost_pending_models and exposes them via getAllModels() on next server restart. Calculators show them with autoDetected:true. Manual verification recommended but not required:

SELECT vendor_slug, model_slug, detection_count, last_detected_at
  FROM aicost_pending_models WHERE verification_status = 'auto'
 ORDER BY first_detected_at DESC;

Hybrid new plans (added 2026-05-03) โ€” auto-detected via the hybrid crawler, queued in aicost_pending_hybrid_plans. Plans do NOT auto-expose to consumers โ€” manual promotion required (audience tagging, capability flags, INFERRED price judgment). Plans flagged needs_inferred_review=1 when seat is null or extraction confidence is low:

SELECT vendor_slug, plan_slug, plan_name, category,
       extracted_seat_price_monthly, extraction_confidence,
       needs_inferred_review, inferred_reason, detection_count
  FROM aicost_pending_hybrid_plans WHERE verification_status = 'auto'
 ORDER BY needs_inferred_review DESC, first_detected_at DESC;

Promotion path: Move entry into lib/pricing-data.js (token) or one of the lib/hybrid-pricing-data*.js files (hybrid), then UPDATE ... SET verification_status='verified'. Reject hallucinations with 'rejected' + rejection_reason. Either action drops from cache/queue at next crawler run.

Jump to:

Daily routine โ€” what to check, when

30 seconds โ€” every day

  1. Open /crawler-health
  2. If TLDR shows โœ… All crawlers healthy โ†’ done, move on
  3. If anything red โ†’ click into the failing track, read the "why" line, follow the matching section below

5 minutes โ€” every Monday after scheduled crawls fire

  1. Run node scripts\aicost-changelog-triage.js --since=24h
  2. Check SUSPICIOUS bucket count โ€” should be 0 or near-0 with the deployed sanity rules
  3. For real changes (REVIEW + SAFE), manually edit lib/pricing-data.js with new values โ€” this is the editorial gate (see "How pricing data flows" below)
  4. Then regenerate V2 guides for affected vendors (see regen section)
  5. For NEW models โ€” verify they're real (vendor announcement) before adding to expected_models

30 minutes โ€” every quarter (last week of Q1/Q2/Q3/Q4)

  1. Open docs/MICROSOFT-MANUAL-UPDATES.md
  2. Verify all 7 Microsoft plans against vendor pages in browser (not curl)
  3. Update last_verified in lib/hybrid-pricing-data-verified-2026q2.js AND in routes/crawler-health.js MANUAL_VENDORS array
  4. Regen Microsoft V2: node scripts\generate-vendor-pricing-guides-batch.js --vendor=microsoft --force

How pricing data flows โ€” the editorial gate

This is the most important architectural fact about aicost.ai.

lib/pricing-data.js is not crawler output. It's the human-curated source of truth โ€” what aicost.ai claims is true about vendor prices. The crawler reads this file for diff comparison but never writes to it.

Data flow

Vendor page  โ”€โ”€CFโ”€โ”€โ–ถ  Mistral extracts JSON  โ”€โ”€diff vsโ”€โ”€โ–ถ  lib/pricing-data.js
                                                                       โ”‚
                                                                       โ”‚ (read-only)
                                                                       โ–ผ
                                                              aicost_price_changelog
                                                                       โ”‚
                                                              [HUMAN REVIEWS] โ—€โ”€โ”€ triage script
                                                                       โ”‚
                                                                       โ–ผ
                                                              Operator edits
                                                              lib/pricing-data.js
                                                                       โ”‚
                                                                       โ–ผ
                                                              V2 guide regen
                                                                       โ”‚
                                                                       โ–ผ
                                                              User-facing pages

Why this matters

What this means in practice

If the dashboard shows Track 1 as "stale (30+ days)", it means YOU haven't updated lib/pricing-data.js in a month. It does NOT mean the crawler failed. The fix is editorial:

  1. Run triage: node scripts\aicost-changelog-triage.js --since=30d
  2. Review accumulated REVIEW + SAFE rows
  3. Edit lib/pricing-data.js for confirmed changes
  4. Save (mtime updates)
  5. Regen V2 for affected vendors

This is the moat. Other AI cost comparison sites ship whatever an LLM extracts. aicost.ai has an editorial gate.

Understanding the 5 crawler tracks

TrackWhat it doesWrites toCadenceRun command
1. Token API Per-token rates for LLM, embedding, TTS APIs lib/pricing-data.js (static file) Nightly node scripts\aicost-pricing-crawler.js
2. Product Subscription tiers (Pro/Team/Enterprise) aicost_product_pricing + aicost_crawler_runs with product:* prefix Weekly node scripts\aicost-product-crawler.js
3. Hybrid Seat + overage plans (Anthropic Team, GitHub Copilot, etc.) aicost_price_changelog with hybrid:* prefix Weekly Mon 5 AM node lib\aicost-hybrid-crawler.js
4. Manual Microsoft + future anti-bot blocked vendors lib/hybrid-pricing-data-verified-2026q2.js Quarterly audit Edit file directly, see procedure doc
5. Benchmark Mechanism facts: cache rates, batch discounts, latencies aicost_benchmarks table Weekly node scripts\aicost-benchmark-crawler.js

Triage workflow โ€” bad data detected, what now?

After any crawler run, the triage script classifies all changes into 4 buckets:

node scripts\aicost-changelog-triage.js --since=1h
node scripts\aicost-changelog-triage.js --since=24h --vendor=google
BucketMeaningAction
๐Ÿšซ SUSPICIOUS >200% change โ€” almost certainly extraction error DO NOT ship. Investigate prompt rules + sanity checks.
๐ŸŸก REVIEW 30-200% change โ€” could be real (price war) or noise Verify against vendor page in browser. If real โ†’ ship. If wrong โ†’ revert.
โœ… SAFE โ‰ค30% change โ€” typical price drift Ship โ€” regenerate V2 for affected vendor.
๐Ÿ†• NEW First time we've seen this slug Verify it's a real new model (vendor announcement). If yes, add to expected_models in pricing-data.js.

"SUSPICIOUS" rows in changelog (cachedInput >$1, etc.)

Symptom: Triage shows cachedInput: 0.05 โ†’ 1 (+1900%) โ€” uniform $1.00 or $4.50 across multiple Google models
Cause: Mistral confused the $1/M-tokens-per-hour STORAGE fee with the per-token cache READ fee. Common Google trap.
Fix: Sanity rule in aicost-pricing-crawler.js auto-catches this โ€” if it's not, file timestamp may be older than patch:
findstr "cached_input_price >= input_price" D:\aicost\scripts\aicost-pricing-crawler.js
:: Should return โ‰ฅ1 line. If empty, the patched crawler isn't deployed.

dir D:\aicost\scripts\aicost-pricing-crawler.js
:: Should be timestamped 2026-05-01 or later.

If the corruption already wrote to pricing-data.js

:: Surgical revert: any cachedInput row from this run with >200% change
node scripts\aicost-revert-suspicious.js --field=cachedInput --pct-min=200 --since=2h --dry-run
node scripts\aicost-revert-suspicious.js --field=cachedInput --pct-min=200 --since=2h
:: (Removes --dry-run to apply. Backup auto-saved to pricing-data.js.backup-<timestamp>)

:: Then re-run google with patched extractor
node scripts\aicost-pricing-crawler.js --vendor=google
:: Verify clean: changelog should have 0 suspicious rows for this run
node scripts\aicost-changelog-triage.js --since=10m --vendor=google

New models detected โ€” real or hallucination?

Symptom: Crawler logs ๐Ÿ†• openai/gpt-5-5 โ€” NEW MODEL DETECTED
Cause: Either (a) vendor genuinely launched a new model, or (b) Mistral hallucinated a slug from page text noise.

Verification checklist

  1. Open vendor's pricing page in browser. Search for the model name.
  2. Cross-check vendor's announcements (OpenAI DevDay, Anthropic Build, Google I/O).
  3. Look at the prices Mistral extracted โ€” do they match the page?
  4. If real โ†’ add slug to expected_models in scripts/aicost-pricing-crawler.js for that vendor's config block.
  5. If hallucinated โ†’ ignore. Next crawl will likely not re-extract it. If persists, check Mistral's prompt for ambiguity.

Slug-noise patterns we've seen

Mistral outputReal slug
anthropic-proanthropic-claude-pro
openai-plusopenai-chatgpt-plus
mistral-enterprisemistral-le-chat-enterprise
veo-3-1-generate-preview (dashes)veo-3.1-generate-preview (dots)

"Expected but missing" warnings

Symptom: Crawler logs โš ๏ธ google/imagen-4.0-fast-generate-001 โ€” expected but not extracted
Cause: Either (a) vendor renamed/removed the model, (b) extraction missed it due to page layout change, or (c) slug-comparison normalization mismatch (dot vs dash).
Fix:
  1. Open vendor page in browser, confirm model still exists.
  2. If renamed โ†’ update expected_models in crawler config to new name.
  3. If removed โ†’ delete from expected_models.
  4. If still on page but Mistral missed it โ†’ look at scripts\aicost-pricing-crawler.js EXTRACTION_PROMPT and add example/hint for that model class.

Vendor shows "Never run"

Symptom: /crawler-health TLDR lists vendor as Never run
Cause: Three possibilities:
  1. Scheduled task not registered โ†’ run powershell -ExecutionPolicy Bypass -File D:\aicost\powershell\Run-AicostCrawler.ps1
  2. Vendor not in this crawler's VENDORS list โ†’ check the script's vendor array
  3. Crawler writes to a different prefix than dashboard expects โ†’ verify dashboard's HYBRID_VENDORS / PRODUCT_VENDORS arrays match actual data
Fix:
:: Step 1: confirm scheduled tasks exist
schtasks /query /tn "aicost*"

:: Step 2: trigger one manually to test
schtasks /run /tn "aicost-pricing-crawler"

:: Step 3: check actual changelog content
mysql -u root -p toolsinfo -e "SELECT vendor_slug, MAX(started_at) FROM aicost_crawler_runs GROUP BY vendor_slug ORDER BY 2 DESC LIMIT 20;"

Track shows STALE

Symptom: Track 1/2/3/5 shows ! stale badge
Cause: Last run is older than the staleness threshold:
  • Token API: 36h
  • Product: 36h
  • Hybrid: 14d (336h)
  • Benchmark: 14d (336h)
Fix: Run the crawler manually using commands from the dashboard's "Manual run commands" section. Or check Windows Task Scheduler for failed scheduled runs:
powershell "Get-WinEvent -LogName 'Microsoft-Windows-TaskScheduler/Operational' -MaxEvents 50 | Where-Object {$_.Message -like '*aicost*'} | Format-Table TimeCreated, Id, LevelDisplayName"

Mistral 401 Unauthorized

Symptom: All 13-15 vendors fail with Mistral extract failed: 401 {"detail":"Unauthorized"}
Cause: API key issue โ€” expired, rate-limit hit, billing block, or rotated.
Fix:
:: Test the key directly
curl -H "Authorization: Bearer %MISTRAL_API_KEY%" https://api.mistral.ai/v1/models

:: If 401 โ†’ log into Mistral console, check key status / billing
:: If new key needed, update D:\aicost\.env then restart pm2:
notepad D:\aicost\.env
pm2 reload aicost
Note: 401 is usually transient (rate-limit window). Retry in 30 min before assuming key is dead.

Cloudflare /markdown 422 / 429 / timeouts

Symptoms:
  • CF /markdown 422: timeout was reached โ€” page took too long to render
  • CF rate limited (429), retry 1/4 in 8s โ€” hitting CF's request rate cap
  • (no content) โ€” page rendered empty (CSR-only site)
Causes & fixes:
  • 422 timeout: Increase the actionTimeout in CF call, OR add the page to JS-wait retry list
  • 429: Built-in retry handles it (4 attempts at 8s/16s/32s/60s). If failing all 4 โ†’ CF account quota hit. Check Cloudflare dashboard.
  • No content: Crawler auto-retries with networkidle0 JS-wait. Usually works on second pass.

Vendor anti-bot blocks (Microsoft pattern)

Symptom: CF returns 9-10 KB of "Your User-Agent appears to be from an automated process" decoy content. Looks normal in size but contains no real pricing.
Cause: Vendor (currently: Microsoft) fingerprints CF's browser-rendering API and serves anti-bot content.
Fix: Move vendor to manual maintenance procedure:
  1. Add plan records to lib/hybrid-pricing-data-verified-2026q2.js with verified_by: 'manual' and inferred_fields._blocker_note
  2. Add entry to MANUAL_VENDORS array in routes/crawler-health.js
  3. Create docs/<VENDOR>-MANUAL-UPDATES.md patterned on Microsoft procedure
  4. Future: try Firecrawl with residential proxies (~$0.40/week) โ€” bypasses most anti-bot

V2 narrative generation fails (Gemini non-JSON)

Symptom: โŒ Gemini non-JSON after 2 attempts (length=NNNN) for one vendor in batch
Cause: Either (a) Gemini non-determinism, (b) sparse context (no api_models for vendor like Microsoft), or (c) bad character in prompt content.
Fix sequence:
:: 1. Check what's in the context for this vendor
node scripts\debug-builder-microsoft.js
:: (or copy this script and edit vendor slug for other vendors)

:: 2. If ctx looks right, try regen alone (often succeeds on retry)
node scripts\generate-vendor-pricing-guides-batch.js --vendor=<slug> --force

:: 3. If it persists, capture raw Gemini output
node scripts\debug-microsoft-gemini.js > D:\aicost\logs\gemini-debug.log 2>&1
:: Look at "Char at position N" in the parse error โ€” usually a smart quote, em-dash, or unescaped quote
Common patterns: vendor without api_models needs the V1-bridge defensive fallbacks (these are already deployed in the patched aicost-guide-builder-v2.js + aicost-guide-prompts-v2.js).

SQL: "Data truncated for column"

Symptom: ERROR 1265 (01000): Data truncated for column 'primary_category' on INSERT
Cause: Column is an ENUM, value not in allowed list. Length isn't the issue.
Fix:
:: Step 1: see what values are allowed
mysql -u root -p toolsinfo -e "SELECT DISTINCT primary_category FROM aicost_guide_vendor_directory ORDER BY 1;"

:: Step 2: pick existing value OR extend the ENUM
:: To extend, see the pattern in sql/2026-05-01-extend-category-enum.sql
:: Current valid values: llm-api, embedding, vector-db, transcription, tts,
::   consumer-ide, consumer-image, consumer-search, consumer-video,
::   office-ai, sales-ai, developer-assistant

MySQL DECIMAL โ†’ JS string gotcha

Symptom: TypeError: r.change_pct.toFixed is not a function in any custom script reading changelog
Cause: mysql2 returns DECIMAL columns as STRINGS (precision preservation). Math operators on them silently fail or behave unexpectedly.
Fix: Always coerce to Number after query:
const [rows] = await pool.execute('SELECT change_pct FROM aicost_price_changelog ...');
rows.forEach(r => {
    r.change_pct = r.change_pct == null ? null : Number(r.change_pct);
    r.old_value  = r.old_value  == null ? null : Number(r.old_value);
    r.new_value  = r.new_value  == null ? null : Number(r.new_value);
});

After fixing data โ€” regenerate V2 guides

Once the changelog looks clean (triage shows 0 SUSPICIOUS, sane REVIEW values), regenerate V2 guides for affected vendors so users see fresh data:

:: Single vendor (~30s, ~$0.02)
node scripts\generate-vendor-pricing-guides-batch.js --vendor=google --force

:: Multiple vendors  
node scripts\generate-vendor-pricing-guides-batch.js --vendor=google --vendor=openai --force

:: All 16 V2 vendors (~6 min, ~$0.30)
node scripts\generate-vendor-pricing-guides-batch.js --all --force

:: Verify the regen landed
mysql -u root -p toolsinfo -e "SELECT slug, last_refreshed_at FROM aicost_guides WHERE type='vendor-pricing' AND schema_version='v2' ORDER BY last_refreshed_at DESC LIMIT 5;"

:: Smoke-test a persona URL
curl -I http://localhost:3231/ai-cost-guides/pricing/google/for-developer

Adding a new vendor to the system

To add a new vendor (e.g., new LLM provider) end-to-end:

  1. Add to crawler config: edit scripts/aicost-pricing-crawler.js VENDORS array with vendor entry: slug, name, pricing URLs, expected_models
  2. Add to vendor directory:
    INSERT INTO aicost_guide_vendor_directory (vendor_slug, vendor_name, vendor_url, primary_category, ...) VALUES (...);
  3. Run pricing crawler: node scripts\aicost-pricing-crawler.js --vendor=<new-slug> โ€” verify clean extraction
  4. Add to dashboard arrays: in routes/crawler-health.js โ€” TOKEN_API_VENDORS if applicable
  5. Generate V2 guide: node scripts\generate-vendor-pricing-guides-batch.js --vendor=<new-slug> --force
  6. Smoke-test 5 persona URLs for the new vendor

Adding a manual-only (anti-bot) vendor

When a vendor blocks scraping (Salesforce, Adobe, etc.):

  1. Add plan records to lib/hybrid-pricing-data-verified-2026q2.js with verified_by: 'manual' + inferred_fields._blocker_note
  2. Add slug to DEPRECATED_BASE_SLUGS if it overlaps with base file
  3. Add entry to MANUAL_VENDORS array in routes/crawler-health.js with cadence + docs path
  4. Copy docs/MICROSOFT-MANUAL-UPDATES.md โ†’ docs/<VENDOR>-MANUAL-UPDATES.md and customize
  5. Add vendor to aicost_guide_vendor_directory if not already
  6. Test V2 generation: node scripts\generate-vendor-pricing-guides-batch.js --vendor=<slug> --force
  7. Set first quarterly audit date in MANUAL_VENDORS entry's next_due

Documentation index

VISION-AND-ARCHITECTURE.md Why V2 exists, sibling-file pattern, system architecture overview
MASTER-DEPLOYMENT-RUNBOOK.md Step-by-step deploy from scratch (Steps 1-20)
TESTING-CHECKLIST.md Post-deploy validation tests M1-M15
NEXT-STEPS.md Phased roadmap (Phase 0-7), success metrics
MICROSOFT-MANUAL-UPDATES.md Quarterly Microsoft audit procedure with audit log
2026-05-01-SESSION-LEARNINGS.md Bug retrospective: 17 bugs, fixes, patterns codified

Last updated 2026-05-01 ยท โ† Back to crawler health