Haris Ahmed
Contact
All projects2026

FileScout — Intelligence Data Platform

Centralized ingestion platform that orchestrates Apify actors, Firecrawl, Crawlee, Playwright and Scrapling behind one unified API. Discovers sources with an LLM ReAct loop, extracts structured records with content-anchored prompts to prevent hallucination, tracks cost per job, and isolates everything by project. Ships with a Next.js admin dashboard.

¶ Overview
Scout is an AI-augmented data ingestion platform. It picks the right scraping engine per source, runs the job, normalizes the output, and exposes the result through a single REST API and a Next.js admin UI. Multi-tenant by project, with cost tracking, circuit breakers, a cron scheduler, and a Supervisor Agent that audits data quality on a loop. Used internally to feed gaming intel, UAE freezone data, and crypto/web3 sources into downstream apps
¶ Important info
Five scraping engines (Apify, Firecrawl, Crawlee, Playwright, Scrapling) sit behind a common interface with per-service circuit breakers, AbortController timeouts, and retry-with-jitter. AI discovery and extraction run on DeepSeek V3 via OpenRouter, with model fallbacks. Every data entity is project-scoped, including API keys. Boot recovers crawl runs left in 'queued' state by a previous process — the DB row is the queue's source of truth, so a deploy can't strand jobs. Graceful shutdown drains crawl + actor queues, closes Playwright, Crawlee, Redis, the Gameloom pool, and stops heartbeats before exiting. Trace retention is capped at 1 day with auto-purge. ~300 commits over three months across backend, dashboard, and Apify actors
¶ Problem faced
Five scraping stacks, each with its own quirks: Apify is cloud and bills per run, Firecrawl is great for general content but flakey on JS-heavy pages, Crawlee gives you stealth but needs local infra, Playwright is heavy, Scrapling is light but limited. Picking one per source isn't the hard part — the hard part is making them feel like one system. Failures have to be isolated so a Firecrawl outage doesn't take down Apify jobs. Costs have to roll up per job, per project, across providers. AI extraction has to ground on real page content, not hallucinate plausible-looking fields. And queue state has to survive deploys — a crawl marked 'queued' in Postgres must resume, not strand. Multi-project isolation has to be enforced at every layer, not just the dashboard.
¶ How it was solved
Layered backend: thin Fastify controllers → services → repositories → Prisma. Each external service gets its own circuit breaker, plus a shared resilient-call wrapper for timeout/retry/jitter. AI extraction uses content-anchor verification — the LLM gets the actual scraped text and a strict JSON schema, and any field that can't be traced back to the source is nullified. Cost tracking writes a row per LLM call, attributed to the job and project. Crawl + actor queues persist to Postgres; a recovery pass on boot picks up runs left 'queued' by a previous process. Shutdown drains queues with a bounded timeout, then closes Playwright, Crawlee, Redis, the Gameloom pool, and heartbeats in parallel. Project context is enforced via a Fastify preHandler on every project-scoped route, so isolation is structural, not convention. Trade-off: five engines means more surface area to maintain, but no single vendor lock-in and we route work to the cheapest engine that handles the site.
¶ Stack
  • TypeScript
  • Fastify
  • Node.js
  • Prisma
  • PostgreSQL
  • Next.js
  • Apify
  • Firecrawl
  • Zod
  • Redis
  • Railway
Back to all projects
Haris Ahmed

AI engineer building intelligent systems that survive production. Available for roles & contract work.

Back to top
IndexAboutStackWorkPathContact
ElsewhereGitHubLinkedInEmail
© 2026 Haris Ahmed · All rights reservedAI systems that actually scale.
haris-ai.session
Live
Haris

Haris AI

Retrieval-augmented · Always on

Hi, I'm Haris's AI. Ask me about his work, his stack, or how to reach him. I'll get you straight to the answer.

Try asking
Enter to send · Shift+Enter for newline