← Back to crawler health

🔧 Crawler troubleshooting playbook

Symptom-driven reference for the aicost.ai crawler stack. Captures lessons from the 2026-05-01 production-hardening session. Use this when something looks wrong on /crawler-health or after a crawler run.

📖 This page is the quick-reference. For the full operational guide — system architecture, all 5 crawler tracks in depth, complete bug class library (A–L), 7-layer defense, hybrid pricing failure modes, post-incident discipline — see docs/AICOST-OPERATIONS-COMPLETE.md in the repo.

⚠️ New entities are auto-detected. Promotion stays manual.

Token API new models — auto-incorporated. Crawler writes to aicost_pending_models and exposes them via getAllModels() on next server restart. Calculators show them with autoDetected:true. Manual verification recommended but not required:

SELECT vendor_slug, model_slug, detection_count, last_detected_at
  FROM aicost_pending_models WHERE verification_status = 'auto'
 ORDER BY first_detected_at DESC;

Hybrid new plans (added 2026-05-03) — auto-detected via the hybrid crawler, queued in aicost_pending_hybrid_plans. Plans do NOT auto-expose to consumers — manual promotion required (audience tagging, capability flags, INFERRED price judgment). Plans flagged needs_inferred_review=1 when seat is null or extraction confidence is low:

SELECT vendor_slug, plan_slug, plan_name, category,
       extracted_seat_price_monthly, extraction_confidence,
       needs_inferred_review, inferred_reason, detection_count
  FROM aicost_pending_hybrid_plans WHERE verification_status = 'auto'
 ORDER BY needs_inferred_review DESC, first_detected_at DESC;

Promotion path: Move entry into lib/pricing-data.js (token) or one of the lib/hybrid-pricing-data*.js files (hybrid), then UPDATE ... SET verification_status='verified'. Reject hallucinations with 'rejected' + rejection_reason. Either action drops from cache/queue at next crawler run.

Jump to:

Daily routine — what to check, when
Understanding the 5 crawler tracks
⭐ How pricing data flows — the editorial gate
Triage workflow — bad data detected, what now?
"SUSPICIOUS" rows in changelog (cachedInput >$1, etc.)
New models detected — real or hallucination?
"Expected but missing" warnings
Vendor shows "Never run"
Track shows STALE
Mistral 401 Unauthorized
Cloudflare /markdown 422 / 429 / timeouts
Vendor anti-bot blocks (Microsoft pattern)
V2 narrative generation fails (Gemini non-JSON)
SQL: "Data truncated for column"
MySQL DECIMAL → JS string gotcha
After fixing data — regenerate V2 guides
Adding a new vendor to the system
Adding a manual-only (anti-bot) vendor
Documentation index

Daily routine — what to check, when

30 seconds — every day

Open /crawler-health
If TLDR shows ✅ All crawlers healthy → done, move on
If anything red → click into the failing track, read the "why" line, follow the matching section below

5 minutes — every Monday after scheduled crawls fire

Run node scripts\aicost-changelog-triage.js --since=24h
Check SUSPICIOUS bucket count — should be 0 or near-0 with the deployed sanity rules
For real changes (REVIEW + SAFE), manually edit lib/pricing-data.js with new values — this is the editorial gate (see "How pricing data flows" below)
Then regenerate V2 guides for affected vendors (see regen section)
For NEW models — verify they're real (vendor announcement) before adding to expected_models

30 minutes — every quarter (last week of Q1/Q2/Q3/Q4)

Open docs/MICROSOFT-MANUAL-UPDATES.md
Verify all 7 Microsoft plans against vendor pages in browser (not curl)
Update last_verified in lib/hybrid-pricing-data-verified-2026q2.js AND in routes/crawler-health.js MANUAL_VENDORS array
Regen Microsoft V2: node scripts\generate-vendor-pricing-guides-batch.js --vendor=microsoft --force

How pricing data flows — the editorial gate

This is the most important architectural fact about aicost.ai.

lib/pricing-data.js is not crawler output. It's the human-curated source of truth — what aicost.ai claims is true about vendor prices. The crawler reads this file for diff comparison but never writes to it.

Data flow

Vendor page  ──CF──▶  Mistral extracts JSON  ──diff vs──▶  lib/pricing-data.js
                                                                       │
                                                                       │ (read-only)
                                                                       ▼
                                                              aicost_price_changelog
                                                                       │
                                                              [HUMAN REVIEWS] ◀── triage script
                                                                       │
                                                                       ▼
                                                              Operator edits
                                                              lib/pricing-data.js
                                                                       │
                                                                       ▼
                                                              V2 guide regen
                                                                       │
                                                                       ▼
                                                              User-facing pages

Why this matters

LLM hallucinations cannot reach users — bad extractions land in changelog, not pricing-data.js
Trust badge is honest — V2 guides reflect when prices were human-verified, not when crawler last ran
Audit trail — every change in lib/pricing-data.js is preceded by triage-approved changelog rows
Track 1 stale threshold = 30 days — this is human-edit cadence, not crawler-run cadence

What this means in practice

If the dashboard shows Track 1 as "stale (30+ days)", it means YOU haven't updated lib/pricing-data.js in a month. It does NOT mean the crawler failed. The fix is editorial:

Run triage: node scripts\aicost-changelog-triage.js --since=30d
Review accumulated REVIEW + SAFE rows
Edit lib/pricing-data.js for confirmed changes
Save (mtime updates)
Regen V2 for affected vendors

This is the moat. Other AI cost comparison sites ship whatever an LLM extracts. aicost.ai has an editorial gate.

Understanding the 5 crawler tracks

Track	What it does	Writes to	Cadence	Run command
1. Token API	Per-token rates for LLM, embedding, TTS APIs	`lib/pricing-data.js` (static file)	Nightly	`node scripts\aicost-pricing-crawler.js`
2. Product	Subscription tiers (Pro/Team/Enterprise)	`aicost_product_pricing` + `aicost_crawler_runs` with `product:*` prefix	Weekly	`node scripts\aicost-product-crawler.js`
3. Hybrid	Seat + overage plans (Anthropic Team, GitHub Copilot, etc.)	`aicost_price_changelog` with `hybrid:*` prefix	Weekly Mon 5 AM	`node lib\aicost-hybrid-crawler.js`
4. Manual	Microsoft + future anti-bot blocked vendors	`lib/hybrid-pricing-data-verified-2026q2.js`	Quarterly audit	Edit file directly, see procedure doc
5. Benchmark	Mechanism facts: cache rates, batch discounts, latencies	`aicost_benchmarks` table	Weekly	`node scripts\aicost-benchmark-crawler.js`

Triage workflow — bad data detected, what now?

After any crawler run, the triage script classifies all changes into 4 buckets:

node scripts\aicost-changelog-triage.js --since=1h
node scripts\aicost-changelog-triage.js --since=24h --vendor=google

Bucket	Meaning	Action
🚫 SUSPICIOUS	>200% change — almost certainly extraction error	DO NOT ship. Investigate prompt rules + sanity checks.
🟡 REVIEW	30-200% change — could be real (price war) or noise	Verify against vendor page in browser. If real → ship. If wrong → revert.
✅ SAFE	≤30% change — typical price drift	Ship — regenerate V2 for affected vendor.
🆕 NEW	First time we've seen this slug	Verify it's a real new model (vendor announcement). If yes, add to `expected_models` in pricing-data.js.

"SUSPICIOUS" rows in changelog (cachedInput >$1, etc.)

Symptom: Triage shows cachedInput: 0.05 → 1 (+1900%) — uniform $1.00 or $4.50 across multiple Google models

Cause: Mistral confused the $1/M-tokens-per-hour STORAGE fee with the per-token cache READ fee. Common Google trap.

Fix: Sanity rule in aicost-pricing-crawler.js auto-catches this — if it's not, file timestamp may be older than patch:

findstr "cached_input_price >= input_price" D:\aicost\scripts\aicost-pricing-crawler.js
:: Should return ≥1 line. If empty, the patched crawler isn't deployed.

dir D:\aicost\scripts\aicost-pricing-crawler.js
:: Should be timestamped 2026-05-01 or later.

If the corruption already wrote to pricing-data.js

:: Surgical revert: any cachedInput row from this run with >200% change
node scripts\aicost-revert-suspicious.js --field=cachedInput --pct-min=200 --since=2h --dry-run
node scripts\aicost-revert-suspicious.js --field=cachedInput --pct-min=200 --since=2h
:: (Removes --dry-run to apply. Backup auto-saved to pricing-data.js.backup-<timestamp>)

:: Then re-run google with patched extractor
node scripts\aicost-pricing-crawler.js --vendor=google
:: Verify clean: changelog should have 0 suspicious rows for this run
node scripts\aicost-changelog-triage.js --since=10m --vendor=google

New models detected — real or hallucination?

Symptom: Crawler logs 🆕 openai/gpt-5-5 — NEW MODEL DETECTED

Cause: Either (a) vendor genuinely launched a new model, or (b) Mistral hallucinated a slug from page text noise.

Verification checklist

Open vendor's pricing page in browser. Search for the model name.
Cross-check vendor's announcements (OpenAI DevDay, Anthropic Build, Google I/O).
Look at the prices Mistral extracted — do they match the page?
If real → add slug to expected_models in scripts/aicost-pricing-crawler.js for that vendor's config block.
If hallucinated → ignore. Next crawl will likely not re-extract it. If persists, check Mistral's prompt for ambiguity.

Slug-noise patterns we've seen

Mistral output	Real slug
`anthropic-pro`	`anthropic-claude-pro`
`openai-plus`	`openai-chatgpt-plus`
`mistral-enterprise`	`mistral-le-chat-enterprise`
`veo-3-1-generate-preview` (dashes)	`veo-3.1-generate-preview` (dots)

"Expected but missing" warnings

Symptom: Crawler logs ⚠️ google/imagen-4.0-fast-generate-001 — expected but not extracted

Cause: Either (a) vendor renamed/removed the model, (b) extraction missed it due to page layout change, or (c) slug-comparison normalization mismatch (dot vs dash).

Fix:

Open vendor page in browser, confirm model still exists.
If renamed → update expected_models in crawler config to new name.
If removed → delete from expected_models.
If still on page but Mistral missed it → look at scripts\aicost-pricing-crawler.js EXTRACTION_PROMPT and add example/hint for that model class.

Vendor shows "Never run"

Symptom: /crawler-health TLDR lists vendor as Never run

Cause: Three possibilities:

Scheduled task not registered → run powershell -ExecutionPolicy Bypass -File D:\aicost\powershell\Run-AicostCrawler.ps1
Vendor not in this crawler's VENDORS list → check the script's vendor array
Crawler writes to a different prefix than dashboard expects → verify dashboard's HYBRID_VENDORS / PRODUCT_VENDORS arrays match actual data

Fix:

:: Step 1: confirm scheduled tasks exist
schtasks /query /tn "aicost*"

:: Step 2: trigger one manually to test
schtasks /run /tn "aicost-pricing-crawler"

:: Step 3: check actual changelog content
mysql -u root -p toolsinfo -e "SELECT vendor_slug, MAX(started_at) FROM aicost_crawler_runs GROUP BY vendor_slug ORDER BY 2 DESC LIMIT 20;"

Track shows STALE

Symptom: Track 1/2/3/5 shows ! stale badge

Cause: Last run is older than the staleness threshold:

Token API: 36h
Product: 36h
Hybrid: 14d (336h)
Benchmark: 14d (336h)

Fix: Run the crawler manually using commands from the dashboard's "Manual run commands" section. Or check Windows Task Scheduler for failed scheduled runs:

powershell "Get-WinEvent -LogName 'Microsoft-Windows-TaskScheduler/Operational' -MaxEvents 50 | Where-Object {$_.Message -like '*aicost*'} | Format-Table TimeCreated, Id, LevelDisplayName"

Mistral 401 Unauthorized

Symptom: All 13-15 vendors fail with Mistral extract failed: 401 {"detail":"Unauthorized"}

Cause: API key issue — expired, rate-limit hit, billing block, or rotated.

Fix:

:: Test the key directly
curl -H "Authorization: Bearer %MISTRAL_API_KEY%" https://api.mistral.ai/v1/models

:: If 401 → log into Mistral console, check key status / billing
:: If new key needed, update D:\aicost\.env then restart pm2:
notepad D:\aicost\.env
pm2 reload aicost

Note: 401 is usually transient (rate-limit window). Retry in 30 min before assuming key is dead.

Cloudflare /markdown 422 / 429 / timeouts

Symptoms:

CF /markdown 422: timeout was reached — page took too long to render
CF rate limited (429), retry 1/4 in 8s — hitting CF's request rate cap
(no content) — page rendered empty (CSR-only site)

Causes & fixes:

422 timeout: Increase the actionTimeout in CF call, OR add the page to JS-wait retry list
429: Built-in retry handles it (4 attempts at 8s/16s/32s/60s). If failing all 4 → CF account quota hit. Check Cloudflare dashboard.
No content: Crawler auto-retries with networkidle0 JS-wait. Usually works on second pass.

Vendor anti-bot blocks (Microsoft pattern)

Symptom: CF returns 9-10 KB of "Your User-Agent appears to be from an automated process" decoy content. Looks normal in size but contains no real pricing.

Cause: Vendor (currently: Microsoft) fingerprints CF's browser-rendering API and serves anti-bot content.

Fix: Move vendor to manual maintenance procedure:

Add plan records to lib/hybrid-pricing-data-verified-2026q2.js with verified_by: 'manual' and inferred_fields._blocker_note
Add entry to MANUAL_VENDORS array in routes/crawler-health.js
Create docs/<VENDOR>-MANUAL-UPDATES.md patterned on Microsoft procedure
Future: try Firecrawl with residential proxies (~$0.40/week) — bypasses most anti-bot

V2 narrative generation fails (Gemini non-JSON)

Symptom: ❌ Gemini non-JSON after 2 attempts (length=NNNN) for one vendor in batch

Cause: Either (a) Gemini non-determinism, (b) sparse context (no api_models for vendor like Microsoft), or (c) bad character in prompt content.

Fix sequence:

:: 1. Check what's in the context for this vendor
node scripts\debug-builder-microsoft.js
:: (or copy this script and edit vendor slug for other vendors)

:: 2. If ctx looks right, try regen alone (often succeeds on retry)
node scripts\generate-vendor-pricing-guides-batch.js --vendor=<slug> --force

:: 3. If it persists, capture raw Gemini output
node scripts\debug-microsoft-gemini.js > D:\aicost\logs\gemini-debug.log 2>&1
:: Look at "Char at position N" in the parse error — usually a smart quote, em-dash, or unescaped quote

Common patterns: vendor without api_models needs the V1-bridge defensive fallbacks (these are already deployed in the patched aicost-guide-builder-v2.js + aicost-guide-prompts-v2.js).

SQL: "Data truncated for column"

Symptom: ERROR 1265 (01000): Data truncated for column 'primary_category' on INSERT

Cause: Column is an ENUM, value not in allowed list. Length isn't the issue.

Fix:

:: Step 1: see what values are allowed
mysql -u root -p toolsinfo -e "SELECT DISTINCT primary_category FROM aicost_guide_vendor_directory ORDER BY 1;"

:: Step 2: pick existing value OR extend the ENUM
:: To extend, see the pattern in sql/2026-05-01-extend-category-enum.sql
:: Current valid values: llm-api, embedding, vector-db, transcription, tts,
::   consumer-ide, consumer-image, consumer-search, consumer-video,
::   office-ai, sales-ai, developer-assistant

MySQL DECIMAL → JS string gotcha

Symptom: TypeError: r.change_pct.toFixed is not a function in any custom script reading changelog

Cause: mysql2 returns DECIMAL columns as STRINGS (precision preservation). Math operators on them silently fail or behave unexpectedly.

Fix: Always coerce to Number after query:

const [rows] = await pool.execute('SELECT change_pct FROM aicost_price_changelog ...');
rows.forEach(r => {
    r.change_pct = r.change_pct == null ? null : Number(r.change_pct);
    r.old_value  = r.old_value  == null ? null : Number(r.old_value);
    r.new_value  = r.new_value  == null ? null : Number(r.new_value);
});

After fixing data — regenerate V2 guides

Once the changelog looks clean (triage shows 0 SUSPICIOUS, sane REVIEW values), regenerate V2 guides for affected vendors so users see fresh data:

:: Single vendor (~30s, ~$0.02)
node scripts\generate-vendor-pricing-guides-batch.js --vendor=google --force

:: Multiple vendors  
node scripts\generate-vendor-pricing-guides-batch.js --vendor=google --vendor=openai --force

:: All 16 V2 vendors (~6 min, ~$0.30)
node scripts\generate-vendor-pricing-guides-batch.js --all --force

:: Verify the regen landed
mysql -u root -p toolsinfo -e "SELECT slug, last_refreshed_at FROM aicost_guides WHERE type='vendor-pricing' AND schema_version='v2' ORDER BY last_refreshed_at DESC LIMIT 5;"

:: Smoke-test a persona URL
curl -I http://localhost:3231/ai-cost-guides/pricing/google/for-developer

Adding a new vendor to the system

To add a new vendor (e.g., new LLM provider) end-to-end:

Add to crawler config: edit scripts/aicost-pricing-crawler.js VENDORS array with vendor entry: slug, name, pricing URLs, expected_models

Add to vendor directory:

INSERT INTO aicost_guide_vendor_directory (vendor_slug, vendor_name, vendor_url, primary_category, ...) VALUES (...);

Run pricing crawler: node scripts\aicost-pricing-crawler.js --vendor=<new-slug> — verify clean extraction
Add to dashboard arrays: in routes/crawler-health.js — TOKEN_API_VENDORS if applicable
Generate V2 guide: node scripts\generate-vendor-pricing-guides-batch.js --vendor=<new-slug> --force
Smoke-test 5 persona URLs for the new vendor

Adding a manual-only (anti-bot) vendor

When a vendor blocks scraping (Salesforce, Adobe, etc.):

Add plan records to lib/hybrid-pricing-data-verified-2026q2.js with verified_by: 'manual' + inferred_fields._blocker_note
Add slug to DEPRECATED_BASE_SLUGS if it overlaps with base file
Add entry to MANUAL_VENDORS array in routes/crawler-health.js with cadence + docs path
Copy docs/MICROSOFT-MANUAL-UPDATES.md → docs/<VENDOR>-MANUAL-UPDATES.md and customize
Add vendor to aicost_guide_vendor_directory if not already
Test V2 generation: node scripts\generate-vendor-pricing-guides-batch.js --vendor=<slug> --force
Set first quarterly audit date in MANUAL_VENDORS entry's next_due

Documentation index

VISION-AND-ARCHITECTURE.md Why V2 exists, sibling-file pattern, system architecture overview

MASTER-DEPLOYMENT-RUNBOOK.md Step-by-step deploy from scratch (Steps 1-20)

TESTING-CHECKLIST.md Post-deploy validation tests M1-M15

NEXT-STEPS.md Phased roadmap (Phase 0-7), success metrics

MICROSOFT-MANUAL-UPDATES.md Quarterly Microsoft audit procedure with audit log

2026-05-01-SESSION-LEARNINGS.md Bug retrospective: 17 bugs, fixes, patterns codified

Last updated 2026-05-01 · ← Back to crawler health