¶ Overview
Scout is an AI-augmented data ingestion platform. It picks the right scraping engine per source, runs the job, normalizes the output, and exposes the result through a single REST API and a Next.js admin UI. Multi-tenant by project, with cost tracking, circuit breakers, a cron scheduler, and a Supervisor Agent that audits data quality on a loop. Used internally to feed gaming intel, UAE freezone data, and crypto/web3 sources into downstream apps
¶ Important info
Five scraping engines (Apify, Firecrawl, Crawlee, Playwright, Scrapling) sit behind a common interface with per-service circuit breakers, AbortController timeouts, and retry-with-jitter. AI discovery and extraction run on DeepSeek V3 via OpenRouter, with model fallbacks. Every data entity is project-scoped, including API keys. Boot recovers crawl runs left in 'queued' state by a previous process — the DB row is the queue's source of truth, so a deploy can't strand jobs. Graceful shutdown drains crawl + actor queues, closes Playwright, Crawlee, Redis, the Gameloom pool, and stops heartbeats before exiting. Trace retention is capped at 1 day with auto-purge. ~300 commits over three months across backend, dashboard, and Apify actors
¶ Problem faced
Five scraping stacks, each with its own quirks: Apify is cloud and bills per run, Firecrawl is great for general content but flakey on JS-heavy pages, Crawlee gives you stealth but needs local infra, Playwright is heavy, Scrapling is light but limited. Picking one per source isn't the hard part — the hard part is making them feel like one system. Failures have to be isolated so a Firecrawl outage doesn't take down Apify jobs. Costs have to roll up per job, per project, across providers. AI extraction has to ground on real page content, not hallucinate plausible-looking fields. And queue state has to survive deploys — a crawl marked 'queued' in Postgres must resume, not strand. Multi-project isolation has to be enforced at every layer, not just the dashboard.
¶ How it was solved
Layered backend: thin Fastify controllers → services → repositories → Prisma. Each external service gets its own circuit breaker, plus a shared resilient-call wrapper for timeout/retry/jitter. AI extraction uses content-anchor verification — the LLM gets the actual scraped text and a strict JSON schema, and any field that can't be traced back to the source is nullified. Cost tracking writes a row per LLM call, attributed to the job and project. Crawl + actor queues persist to Postgres; a recovery pass on boot picks up runs left 'queued' by a previous process. Shutdown drains queues with a bounded timeout, then closes Playwright, Crawlee, Redis, the Gameloom pool, and heartbeats in parallel. Project context is enforced via a Fastify preHandler on every project-scoped route, so isolation is structural, not convention. Trade-off: five engines means more surface area to maintain, but no single vendor lock-in and we route work to the cheapest engine that handles the site.