¶ Overview
Scout is an AI-augmented data ingestion platform that selects the optimal scraping engine per source, executes jobs, normalizes output, and exposes results through a unified REST API and Next.js admin UI. Built multi-tenant by design, it includes cost tracking, circuit breakers, a cron scheduler, and a Supervisor Agent that continuously audits data quality. Used internally to feed gaming intelligence, UAE freezone data, and crypto/web3 sources into downstream applications.
¶ Important info
Five scraping engines (Apify, Firecrawl, Crawlee, Playwright, and Scrapling) sit behind a common interface with per-service circuit breakers, AbortController timeouts, and retry-with-jitter logic. AI discovery and extraction run on DeepSeek V3 via OpenRouter with model fallbacks. Every data entity is scoped to a project, including API keys. On boot, the system recovers crawl runs left in a queued state by a previous process, treating the database row as the authoritative queue source so deploys cannot strand jobs. Graceful shutdown drains crawl and actor queues, closes Playwright, Crawlee, Redis, and halts heartbeats before exit. Trace retention is capped at 24 hours with automatic purge.
¶ Problem faced
Five scraping stacks, each with distinct characteristics: Apify runs in the cloud and bills per execution; Firecrawl handles general content well but struggles with JavaScript-heavy pages; Crawlee provides stealth crawling but requires local infrastructure; Playwright is powerful but resource-intensive; Scrapling is lightweight but limited in scope. Selecting the right engine per source is not the challenge. The challenge is making them behave as a single cohesive system. Failures must be isolated so a Firecrawl outage cannot cascade into Apify jobs. Costs must aggregate per job and per project across all providers. AI extraction must be grounded in real page content, not generate plausible-looking but fabricated fields. Queue state must survive deploys, ensuring any run marked queued in Postgres resumes rather than stalls. Multi-project isolation must be enforced at every layer, not just at the dashboard level
¶ How it was solved
A layered backend architecture: thin Fastify controllers feed into services, repositories, and Prisma. Every external service has a dedicated circuit breaker, backed by a shared resilient-call wrapper handling timeouts, retries, and jitter. AI extraction uses content-anchor verification: the LLM receives the actual scraped text alongside a strict JSON schema, and any field that cannot be traced back to the source is nullified. Cost tracking writes one row per LLM call, attributed to the job and project. Crawl and actor queues persist to Postgres, with a boot-time recovery pass that picks up runs left queued by a previous process. Shutdown drains queues within a bounded timeout, then closes Playwright, Crawlee, Redis, the Gameloom pool, and heartbeats in parallel. Project context is enforced via a Fastify preHandler on every project-scoped route, making isolation structural rather than conventional. The tradeoff of five engines is increased surface area, offset by zero vendor lock-in and the ability to route each job to the most cost-effective engine that can handle the target site.