Caller dials a Twilio number. Twilio Programmable Voice opens a bidirectional media stream that Cloudflare Workers proxies to the OpenAI Realtime API endpoint — the entire audio path stays on Cloudflare's edge, which buys back ~28ms median round-trip vs. routing the audio through our origin. gpt-realtime-2 streams audio in (native speech-to-speech, no separate STT call on the happy path) and emits tokens back to the same stream as it generates them. Whisper-large-v3 sits behind a runtime fallback gate for cases where the Realtime audio path returns an empty transcript — accent, heavy background noise, or low-bandwidth callers. Whisper runs on a single g5.xlarge in the customer's AWS account.
Retrieval is hybrid. pgvector 0.7 on Postgres 16 is the primary store, holding the 8,200 help-center articles chunked to 480 tokens with 80-token overlap. Pinecone serverless mirrors the same corpus, fed by the same ingest pipeline, on a 50/50 A/B mirror in production — both lanes serve real queries, both lanes are measured. Cross-encoder rerank uses BAAI bge-reranker-large self-hosted on the same g5.xlarge as Whisper (it's idle when Whisper isn't being called). The model receives the top-12 reranked chunks plus the rolling conversation state every turn.
Tool-calling is the part that earned its complexity. The Realtime API supports OpenAI's function-calling JSON schema; we ship three tools on the session: lookup_article, handoff_to_human, and schedule_callback. The agent has zero write tools — it cannot mutate a customer record, it cannot escalate without a human in the loop, it cannot promise a refund. The handoff_to_human schema is the load-bearing one; we walk through it in the code-block section below. PagerDuty wires the warm-transfer leg: when the model calls handoff, the call SID + rolling transcript summary + retrieved chunk IDs go straight onto a PagerDuty incident; the on-call rep picks up with the agent's reasoning already on their screen.
The architecture diagram below is the production shape. Hover any node for its tool inventory and per-stage latency budget. The streaming dots between the decide step and TTS visualise the real behaviour: gpt-realtime-2 does not wait until it has finished generating before TTS starts speaking — the tokens flow through. That single property is what gets the p95 first-token under 600ms.