¶ Overview
Production deployment on Railway and Supabase with 23 edge functions and a persona library of over 50 roles. Dual-mode AI architecture: development runs against Google AI Studio using API-key tokens with approximately 3 concurrent sessions; production runs against Vertex AI using OAuth2 service-account tokens supporting 1,000 or more concurrent sessions. The frontend is mode-agnostic because the token endpoint returns a pre-built WebSocket URL. The standout engineering detail is the multi-persona session model: three distinct voices, accents, and personalities all served from a single Gemini Live connection, with handoffs driven by bracket-prefixed persona names in the AI stream and a consecutive-turn cap of three per persona. Supporting systems include real-time hedging, filler, and vague-claim detection that fires coaching hints mid-session; per-persona frustration tracking on a scale of negative 100 to positive 100 that escalates on weak answers; an AudioWorklet fallback ensuring microphone support on iOS Safari; Stripe subscriptions with plan-based feature gating; GDPR-compliant account deletion and data export; and VirusTotal scanning on uploaded preparation documents.
¶ Important info
Production app on Railway + Supabase with 23 edge functions and a 50+ role persona library. Dual-mode AI: dev runs against Google AI Studio (API-key tokens, ~3 concurrent sessions); prod runs against Vertex AI (OAuth2 service-account tokens, 1,000+ concurrent), and the frontend is mode-agnostic because the token endpoint returns a pre-built WebSocket URL. The standout detail is the multi-persona session model — three distinct voices, accents, and personalities all served from a single Gemini Live connection, with handoffs driven by `[PersonaName]` bracket prefixes in the AI stream and a turn-routing cap of three consecutive turns per persona. Supporting systems include real-time hedging/filler/vague-claim detection that fires coaching hints mid-session, per-persona frustration tracking (-100..+100) that escalates on weak answers, an AudioWorklet fallback so the mic works on iOS Safari, Stripe subscriptions with plan-based feature gates, GDPR account deletion + data export, and VirusTotal scanning on uploaded prep documents
¶ Problem faced
Gemini Live provides only one voice per WebSocket connection. The product requires three personas in the same room with different voices, accents, and temperaments, capable of speaking to each other and to the user, with full interrupt support in both directions. Opening three parallel sessions wastes quota, fragments the conversation, and forces the client to route audio across three separate contexts. Closing and reconnecting on every persona handoff drops the conversational state the next persona needs to push back coherently. On top of that, the platform must run across two cleanly separated AI backends, Google AI Studio in development and Vertex AI in production, each with entirely different authentication models, without exposing that distinction to the browser.
¶ How it was solved
A single WebSocket connection with voice switching via Gemini session-resumption tokens. The orchestrator closes the socket with custom code 4010, reconnects with the new voice and resumption token, and the incoming persona resumes the same conversational context within one to three seconds. Turn routing caps consecutive turns per persona at three and uses a bracket-name regex on the model stream to detect handoffs. If resumption fails, a fallback mechanism initiates a fresh connection with transcript replay to preserve context. The dual-mode auth split is abstracted behind a single getAIConfig() helper and a gemini-token edge function that returns an access token, WebSocket endpoint, and model path. The browser has no awareness of which backend it is communicating with. The key tradeoff is that voice switches carry a perceptible reconnect latency of one to three seconds, so the engine favors longer turns to minimize switching frequency. The bracket-prefix handoff is a prompt-engineering contract rather than a structured field, defended with a regex guard and a round-robin fallback if the model fails to emit the expected prefix.