The architecture below is the production shape. The audio capture path is owned by the Flutter widget — on-device VAD via a Silero-style filter, mic-permission flow that respects iOS AVAudioSession backgrounding, and a cellular-aware bitrate that falls to chunked HTTP if the WebRTC handshake degrades twice. We did not build a custom STT model; the gpt-realtime-2 endpoint accepts raw audio frames directly over the WebRTC PeerConnection. Whisper-large-v3 exists in the stack only as the HTTP fallback when WebRTC fails — which it does about 1.4% of the time on US cellular, mostly on subway-tunnel transitions.
The signalling and ephemeral-key path runs on Cloudflare Workers at the edge. The Flutter app hits a Worker endpoint at session start; the Worker checks the user's anonymous device id, mints a sub-second-TTL token against the OpenAI Realtime API, and returns the token to the Flutter client. The token is never persisted on the device, never logged, and rotates per session. This was non-negotiable from the retailer's security team — no OpenAI secret ships in the Flutter binary at any point, no long-lived token sits in client storage. The Cloudflare Worker is BAA-eligible if the retailer wants to scope the path under HIPAA later (they don't need it today, but the security review asked).
Function-calling is where the integration cost is the lowest in the build. The retailer's existing Algolia index is three years old and tuned by their merchandising team — synonyms, redirect rules, faceted boost configurations, the works. We did not rebuild it. The Realtime API function-call surface gets four read tools: `search_catalog`, `narrow_facets`, `cart_status`, `account_summary`. Each tool is a thin Cloudflare Worker that proxies into the retailer's existing internal API; nothing in the catalog data pipeline changed. When the model calls `search_catalog`, the result streams back into the model context for narration, and the Flutter widget receives a parallel callback on `onSuggestions` so the product grid re-renders without waiting for the model's spoken response.
The audio output path streams the model's voice response back over the same WebRTC channel. We did not add a separate TTS provider — gpt-realtime-2's native voice quality is at the bar the product team needed for the apparel cohort (the team's brand voice is intentionally conversational, not corporate). Barge-in is handled in the widget: if the user taps the button again mid-response, the WebRTC channel sends a `response.cancel` event and the audio pipeline flushes cleanly. The transcript chip overlay shows the model's last words at the cancel boundary so the user knows what they interrupted.