Low-Latency LLM Orchestration: Real-Time Adversarial Personas in Hard Talk
How I engineered Hard Talk — an AI rehearsal platform where adversarial LLM personas interrupt and challenge users in real time over WebSockets. The latency problems nobody warns you about, and how I solved them.
A chatbot can take three seconds to answer and nobody minds. A simulated boss who's supposed to interrupt you mid-sentence cannot. Hard Talk is an AI rehearsal platform for high-stakes corporate conversations — salary negotiations, performance reviews, investor pushback — where adversarial personas challenge the user's arguments in real time. The entire value proposition lives or dies on latency. A persona that pauses politely to "think" breaks the illusion instantly.
This is what made it a genuine LLM orchestration problem rather than a prompt problem.
The constraint that drove every decision
Real conversation has a turn-taking rhythm of roughly 200–500ms. The moment an LLM persona blows past that, the user feels they're talking to software, and the rehearsal stops being useful. So the design goal wasn't "good answers" — it was good answers fast enough to feel adversarial.
Built on Python, FastAPI, WebSockets, and React/TypeScript, the system had to treat latency as a first-class architectural concern, not a tuning afterthought.
Architecture
- WebSockets, not request/response. HTTP round-trips add handshake overhead per turn and can't push a persona's interruption to the user unprompted. A persistent WebSocket connection lets the server initiate — the persona interrupts you, not the other way around.
- Token streaming end-to-end. The persona's response streams token-by-token to the UI the instant generation starts. Perceived latency is time-to-first-token, not time-to-completion — streaming collapses the felt delay dramatically even when total generation time is unchanged.
- Orchestration layer for persona state. Each persona carries its own behavioral profile (how aggressive, when to interrupt, what to attack). The orchestration layer decides when a persona should cut in based on conversation state — that's the difference between a turn-based chatbot and something that feels live.
- Backpressure handling. When a user types fast or generation lags, the system has to decide whether to queue, drop, or preempt. Getting this wrong produces either dropped interruptions or a pile-up of stale persona turns.
The latency problems nobody warns you about
- Time-to-first-token dominates UX, not tokens-per-second. Optimizing total throughput while ignoring TTFT is optimizing the wrong number.
- The WebSocket is a stateful liability. Connections drop on flaky networks mid-conversation. You need reconnect logic that restores conversation state, or the rehearsal resets and the user rage-quits.
- Concurrency is where it actually breaks. One persona streaming is easy. Multiple concurrent sessions, each holding a live model stream, is where naive implementations fall over. The orchestration layer has to manage these as bounded, cancellable resources.
What this demonstrates
Real-time AI is a distributed-systems problem wearing an LLM costume. The model is the easy part. The engineering is in the transport (WebSockets), the perceived-latency strategy (streaming + TTFT), and the orchestration that makes the system feel alive under concurrency.
I'm Haris Ahmed, an AI engineer and full-stack software engineer specializing in production LLM systems and real-time AI. More of my work is at harisahmed.dev.