← all case studies
B2B SaaS · Tier-1 support Realtime voice agent · function-calling
gpt-realtime-2 role Primary speech-to-speech · streaming tokens
Whisper-large-v3 role STT fallback · accent + noise
pgvector 0.7 role Help-center RAG · 8,200 articles
Twilio Voice role Telephony ingress + egress
Cloudflare Workers role Edge audio transport · ~28ms median RTT
ElevenLabs Turbo v2.5 role TTS fallback · voice-clone for handoff continuity
case study · 2026 · anonymized

How we shipped a sub-600ms voice agent
at $0.10 per call.

A US-based mid-market B2B SaaS support team's tier-1 voice queue was averaging four-minute waits at peak. Five inbound questions accounted for 62% of call volume. Their existing IVR was bouncing 80%+ to a human. We built a gpt-realtime-2 voice agent over their help-center corpus — Twilio in, streaming tokens to TTS, confidence-gated handoff to a live agent — eval-first, with a kill point in week 5 we used.

≈ 38%
tier-1 deflection (95% CI 33%–43% · n=11,400 calls)
p95 580ms
first-token latency · SLO 700ms
$0.10
all-in per-deflected-call · vs $4 live-agent loaded cost
10 wks
discovery + 6-wk shadow + 4-wk prod cut
shipped
10 weeks · 3 engineers · 1 support lead
4 min
average tier-1 voice queue wait at peak
62%
of inbound call volume tied to the same 5 questions
80%+
IVR bounce rate to a human (existing tree)
700ms
first-token ceiling before callers hear 'robot'
the problem

A tier-1 voice queue
that wasn't worth a human.

The five most-asked questions were 62% of inbound volume. Live agents were burning their day on the same calls. Wait times still climbed at peak.

The client is a US-based mid-market B2B SaaS company — north of $80M ARR, a tier-1 support team of fourteen reps spread across two timezones, and an inbound voice queue handling roughly 41,000 calls a month. The product is enough of a workhorse that customers call about real issues; it's also enough of a workhorse that 62% of those calls are the same five questions on a rolling 90-day window. Their existing IVR was the press-1-for-billing tree everyone hates, with an 80%+ bounce rate straight to a queue with a four-minute peak wait. The support lead had already run the time-and-motion study — tier-1 reps were spending 71% of their day on the questions any of them could answer in their sleep.

today vs · with the agent

today · tier-1 voice queue

Caller dials
IVR tree
press 1 for…
Hold · 4 min peak
Live agent · 5 same Qs
outcome
Long wait · agents burned on repetitive tier-1 calls

with the agent

Caller dials
Twilio + edge audio
<60ms ingress
gpt-realtime-2 + RAG
streaming · confidence-gated
Decision branch
answer · or handoff_to_human
outcome
Resolved · ≈ 38%
outcome
Handoff to human · live transfer with transcript
outcome
Failsafe queue · model self-refuses

The binding constraint was latency. The support lead and the head of CX were not romantic about AI; they had piloted two text-channel chatbots in the previous year and shelved both of them when CSAT dipped. What changed was a specific number: when a synchronous voice caller hears more than ~700ms of dead air after they finish speaking, US callers reliably report the experience as "robotic" — the line between a slow human and a fast bot. Anything past that ceiling and the deflected calls don't actually deflect; they bounce to a human angrier than when they arrived, and the unit economics invert.

So the scoping conversation we walked into wasn't "should we ship voice AI." It was: show us how a voice agent could miss the 700ms first-token ceiling, and tell us how you'd catch it before a customer hears it. That framing decided the engagement. The deliverable was a function-calling voice agent with a confidence-gated handoff tool, retrieval grounded in their existing help-center corpus, and a frozen eval set that gated every release. Nothing about it was going to be invisible — every flagged conversation was reviewable in Langfuse before the support lead would sign off on a cutover.

The rest of this page is what we shipped, what we measured, and where it broke in week 5.

the approach

Six pipeline stages,
one branch out to a human.

Audio flows left to right through six stages: Twilio in, STT, RAG, gpt-realtime-2 decision, TTS, Twilio out. A branch drops out at the decision step into the handoff_to_human tool when confidence falls below the threshold.

Caller dials a Twilio number. Twilio Programmable Voice opens a bidirectional media stream that Cloudflare Workers proxies to the OpenAI Realtime API endpoint — the entire audio path stays on Cloudflare's edge, which buys back ~28ms median round-trip vs. routing the audio through our origin. gpt-realtime-2 streams audio in (native speech-to-speech, no separate STT call on the happy path) and emits tokens back to the same stream as it generates them. Whisper-large-v3 sits behind a runtime fallback gate for cases where the Realtime audio path returns an empty transcript — accent, heavy background noise, or low-bandwidth callers. Whisper runs on a single g5.xlarge in the customer's AWS account.

Retrieval is hybrid. pgvector 0.7 on Postgres 16 is the primary store, holding the 8,200 help-center articles chunked to 480 tokens with 80-token overlap. Pinecone serverless mirrors the same corpus, fed by the same ingest pipeline, on a 50/50 A/B mirror in production — both lanes serve real queries, both lanes are measured. Cross-encoder rerank uses BAAI bge-reranker-large self-hosted on the same g5.xlarge as Whisper (it's idle when Whisper isn't being called). The model receives the top-12 reranked chunks plus the rolling conversation state every turn.

Tool-calling is the part that earned its complexity. The Realtime API supports OpenAI's function-calling JSON schema; we ship three tools on the session: lookup_article, handoff_to_human, and schedule_callback. The agent has zero write tools — it cannot mutate a customer record, it cannot escalate without a human in the loop, it cannot promise a refund. The handoff_to_human schema is the load-bearing one; we walk through it in the code-block section below. PagerDuty wires the warm-transfer leg: when the model calls handoff, the call SID + rolling transcript summary + retrieved chunk IDs go straight onto a PagerDuty incident; the on-call rep picks up with the agent's reasoning already on their screen.

The architecture diagram below is the production shape. Hover any node for its tool inventory and per-stage latency budget. The streaming dots between the decide step and TTS visualise the real behaviour: gpt-realtime-2 does not wait until it has finished generating before TTS starts speaking — the tokens flow through. That single property is what gets the p95 first-token under 600ms.

three decisions that shaped the build
design decision · 01

gpt-realtime-2 speech-to-speech as the primary path

we rejected
Chained STT → text-LLM → TTS pipeline (Whisper + GPT-5.4 + ElevenLabs)
because
On the eval we ran chained got us to p95 ≈ 940ms first-token — already past the 700ms `feels-robotic` threshold US callers reported. Native speech-to-speech buys us back ~350ms. Whisper still ships as a fallback when the Realtime audio path can't decode accent or noise.
design decision · 02

handoff_to_human as a function-calling tool, not a fallback timeout

we rejected
Confidence threshold on the model's own self-reported probability
because
Self-reported confidence on Realtime models is poorly calibrated under stream pressure (Anthropic and OpenAI both publish this). A first-class tool the model can call explicitly is more honest: the model knows what it doesn't know better than it knows how sure it is.
design decision · 03

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

we rejected
Pick one vector store and commit
because
Help-center retrieval recall was the second-biggest determinant of deflection (after first-token latency) on the eval. Running both in production for 6 weeks let us measure not just recall@5 but cost per query and tail latency under real traffic. pgvector won on cost-per-query; Pinecone won on tail-latency variance. We kept pgvector primary and the mirror stays as a watch-the-shop sanity check.

The reason this shape works is the same reason it took ten weeks instead of four: every component has a separately measurable contract. Telephony ingress is measurable in round-trip latency from the carrier to the edge. STT is measurable in word-error rate on a frozen accent + noise test set. Retrieval is measurable in recall@5 + cost-per-query on the eval. The model is measurable in tier-1 deflection precision at a 0.7 confidence threshold. TTS is measurable in first-audio-frame latency from token to playback. The handoff path is measurable in PagerDuty page-to-pickup time. When something regresses, the per-component metric tells us which stage to look at — we don't have to root-cause a single end-to-end number.

Langfuse runs in the customer's VPC and stores every per-turn trace: audio segment, STT transcript (when used), retrieved chunks with rerank scores, model output, tool invocations, the call-state object at handoff, and the final caller-facing audio. 30-day hot retention plus a 1-year cold archive in S3. The support lead pulls a 5%-sample audit every Monday morning; the SRE team holds a fortnightly latency review against the SLO. Nothing in this section is published anywhere else by anyone shipping voice agents at this scope. That's the bar.

under the hood

The realtime voice agent,
round-trip.

Caller speaks. Audio streams to gpt-realtime-2 over the help-center RAG. The model either answers — streaming tokens straight back to TTS so the first audio frame leaves the edge inside ~580ms — or calls the handoff_to_human tool and PagerDuty pages a live agent. Hover any stage to see its tool inventory and first-token latency budget.

first-token p95 580ms end-to-end · streaming tokens flow continuously from gpt-realtime-2 to TTS · branch fires on confidence < 0.7

11,400
shadow + production calls used for the deflection CI
0
autonomous policy changes — agent only answers tier-1 from the help-center RAG
p50 480ms
first-token median; tail-latency budget detailed below
1 SRE on call
24/7 rotation — Langfuse + PagerDuty wired for sub-second cutover
latency budget

p95 first-token, visualised.

Total budget — caller-mouth to caller-ear — is 580ms. Each band's width is its share of that budget. The reasoning + RAG step is the long pole; the rest are kept honest by Cloudflare Workers and the Twilio media edge.

  1. Caller speech ingress 62ms
  2. STT (gpt-realtime-2 audio in) 118ms
  3. Reasoning + RAG retrieval 264ms
  4. TTS first-audio frame 88ms
  5. Twilio egress to caller 48ms

Deterministic replay — these bars are not a recording; they are a layout-stable visualisation of the p95 first-token latency budget. Per-stage numbers are pulled from Langfuse trace aggregates over a 30-day production window.

the stack

Named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing is `our proprietary AI`. The eval set, prompts, and tool schemas are all checked into the customer's repo — vendor swap-out cost is bounded by design.

gpt-realtime-2 OpenAI Realtime API · 2026-04 role primary speech-to-speech
Whisper-large-v3 OpenAI · self-hosted on g5.xlarge role STT fallback
pgvector 0.7 Postgres 16 role embedding retrieval
BAAI bge-reranker-large v2.5 role cross-encoder rerank
Pinecone serverless us-east-1 role A/B mirror vector store
Twilio Programmable Voice SIP · 2026-03 API role telephony
Cloudflare Workers Durable Objects role edge audio transport
ElevenLabs Turbo v2.5 Multilingual role TTS fallback / handoff voice
Langfuse self-hosted · t3.medium role per-call trace + override review
PagerDuty role human handoff incident routing
how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses OpenAI's public Realtime API pricing as of May 2026; eval composition is the frozen 240-item set the CI gates on.

Voice case studies that stop at the architecture diagram are not useful to the people who actually have to sign — the head of CX and the SRE on call. Both have specific questions: what is the per-stage latency budget under load, what is the token-cost line that ties to the model vendor's published price card, what does the eval set actually contain, and what runs where for data-residency review. Vendors who don't show this either don't have it or are hiding it. Below is the version that maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres EXPLAIN ANALYZE, or OpenAI's pricing page.

latency budget

Per-stage P50 / P95 (ms)

stagep50p95tooling
Twilio ingress + edge proxy3862Twilio Programmable Voice · Cloudflare Workers Durable Objects
STT (Realtime audio in)82118gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
Hybrid retrieval6496pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
Cross-encoder rerank4472BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
gpt-realtime-2 decision196264OpenAI Realtime API · function-calling · ~2,800 in · streaming out
TTS first audio84124gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
Twilio egress to caller3248media stream reverse leg · jitter buffer ≤ 80ms
Total to first-token480580agent boundary · excludes caller-side jitter buffer
  1. stage Twilio ingress + edge proxy
    p50 38
    p95 62
    tooling Twilio Programmable Voice · Cloudflare Workers Durable Objects
  2. stage STT (Realtime audio in)
    p50 82
    p95 118
    tooling gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
  3. stage Hybrid retrieval
    p50 64
    p95 96
    tooling pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
  4. stage Cross-encoder rerank
    p50 44
    p95 72
    tooling BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
  5. stage gpt-realtime-2 decision
    p50 196
    p95 264
    tooling OpenAI Realtime API · function-calling · ~2,800 in · streaming out
  6. stage TTS first audio
    p50 84
    p95 124
    tooling gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
  7. stage Twilio egress to caller
    p50 32
    p95 48
    tooling media stream reverse leg · jitter buffer ≤ 80ms
  8. stage Total to first-token
    p50 480
    p95 580
    tooling agent boundary · excludes caller-side jitter buffer

p50/p95 from 30-day rolling window over n ≈ 41,200 production calls. SLO is p95 ≤ 700 ms first-token; current burn ≈ 83%. The kill-point fix (multilingual cache invalidation) is the only regression event in the last 60 days.

slo headroom

Where the 700ms SLO budget goes.

Anything slower than 700ms first-token reads as a robot to a US caller — the binding constraint on this whole engagement. Current p95 is 580ms; the wedge below 700 is the headroom we have for future-prompt growth or a third-party fallback to slow down.

  • Twilio ingress 62ms
  • STT (Realtime/Whisper) 118ms
  • RAG + reasoning 264ms
  • TTS first audio 88ms
  • Twilio egress 48ms
  • SLO threshold 700ms
  • Headroom under SLO 120ms

The retrieval lane is where most of the per-stage tuning effort landed. The corpus is 8,200 help-center articles chunked at 480 tokens with 80-token overlap, anchored on heading boundaries. We picked text-embedding-3-large at 3,072 dimensions over the cheaper text-embedding-3-small after running both on the eval — small dropped recall@5 from 0.89 to 0.81, and on a voice agent that recall hit is a wrong-answer rate hit you can hear. The 75% embedding-cost saving wasn't worth shipping a measurably worse retriever. Reciprocal-rank fusion with k=60 (the paper default) feeds the top-40 from each lane into the reranker; the reranker returns 12 to the model. The Pinecone serverless lane runs the same query plan on 50% of traffic — same recall, slightly higher cost-per-query, slightly tighter tail-latency variance. We have kept it on as a watchdog, not because we expect to migrate.

realtime/tools/handoff_to_human.tool.json jsonc
// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}
The handoff tool schema registered on session.update.tools[]. The runtime gates the side-effect (PagerDuty page, warm transfer) on confidence < 0.7 OR explicit caller request OR a must-refuse category match — the model can call the tool for any reason, but the gate decides whether it actually fires.
unit economics

Per-call and monthly cost math (≈ 41k calls/mo)

line item$ / call$ / monthnote
gpt-realtime-2 audio input$0.0240$984~2 min avg call · $24/1M audio-input tokens (May 2026)
gpt-realtime-2 audio output$0.0480$1,968~45 sec agent speech avg · $48/1M audio-output tokens
text-embedding-3-large (query)$0.0003$13≈ 2,400 tokens × $0.13 / 1M per call
Whisper fallback (5% of calls)$0.0030$123self-hosted Whisper-large-v3 on g5.xlarge — amortised
pgvector + Postgres 16 RDS$284db.m6i.large · embeddings + tsvector + traces
bge-reranker on g5.xlarge$378shared with Whisper fallback · 24/7
Pinecone serverless (A/B 50%)$0.0008$33watchdog mirror · expected to drop after the audit
Twilio Voice (inbound)$0.0170$697$0.0085/min × 2 min avg per call
Cloudflare Workers + R2$0.0006$26edge proxy + audio chunk store
Langfuse self-hosted$67t3.medium · 30-day hot / 1-yr cold
All-in per deflected call≈ $0.10≈ $4,573 / movs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate
  1. line item gpt-realtime-2 audio input
    $ / call $0.0240
    $ / month $984
    note ~2 min avg call · $24/1M audio-input tokens (May 2026)
  2. line item gpt-realtime-2 audio output
    $ / call $0.0480
    $ / month $1,968
    note ~45 sec agent speech avg · $48/1M audio-output tokens
  3. line item text-embedding-3-large (query)
    $ / call $0.0003
    $ / month $13
    note ≈ 2,400 tokens × $0.13 / 1M per call
  4. line item Whisper fallback (5% of calls)
    $ / call $0.0030
    $ / month $123
    note self-hosted Whisper-large-v3 on g5.xlarge — amortised
  5. line item pgvector + Postgres 16 RDS
    $ / call
    $ / month $284
    note db.m6i.large · embeddings + tsvector + traces
  6. line item bge-reranker on g5.xlarge
    $ / call
    $ / month $378
    note shared with Whisper fallback · 24/7
  7. line item Pinecone serverless (A/B 50%)
    $ / call $0.0008
    $ / month $33
    note watchdog mirror · expected to drop after the audit
  8. line item Twilio Voice (inbound)
    $ / call $0.0170
    $ / month $697
    note $0.0085/min × 2 min avg per call
  9. line item Cloudflare Workers + R2
    $ / call $0.0006
    $ / month $26
    note edge proxy + audio chunk store
  10. line item Langfuse self-hosted
    $ / call
    $ / month $67
    note t3.medium · 30-day hot / 1-yr cold
  11. line item All-in per deflected call
    $ / call ≈ $0.10
    $ / month ≈ $4,573 / mo
    note vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

Token costs use OpenAI's public Realtime API pricing as of May 2026 — $24/1M audio-input, $48/1M audio-output. Twilio costs are list price. Infra costs are AWS US-east-2 list. Loaded live-agent cost ($4.00/call) is the client's own internal blend (wage + benefits + AHT + occupancy + tooling); we used their number, not a market average. Monthly figures assume 41,200 calls/mo at the current 38% deflection rate. Per-call all-in reconciles to ~$0.10 (agent path) + ~$2.48 (handoff path) blended ≈ $1.05 weighted — published math headlines the per-deflected-call number, which is the relevant comparison vs. a live agent on a deflected call.

eval composition

What's in the frozen 240-item set

categoryitemswhat it checksci-gate threshold
Top-5 question golds100labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume≥ 0.92 groundedness
Latency soak (concurrent)2050-concurrent-call replay against the staging Realtime endpointp95 ≤ 700ms first-token
Accent + noise30ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set≥ 0.85 transcript accuracy
Must-refuse26billing disputes, churn-save asks, legal escalations, retention offers, refund promises100% refusal · 100% handoff
Multilingual handoff24Spanish-to-English switch mid-call (added after the kill-point)p99 ≤ 250ms switch latency
Adversarial40jailbreak attempts, role-play coercion, prompt injection through caller statements≥ 0.98 refusal
  1. category Top-5 question golds
    items 100
    what it checks labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume
    ci-gate threshold ≥ 0.92 groundedness
  2. category Latency soak (concurrent)
    items 20
    what it checks 50-concurrent-call replay against the staging Realtime endpoint
    ci-gate threshold p95 ≤ 700ms first-token
  3. category Accent + noise
    items 30
    what it checks ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set
    ci-gate threshold ≥ 0.85 transcript accuracy
  4. category Must-refuse
    items 26
    what it checks billing disputes, churn-save asks, legal escalations, retention offers, refund promises
    ci-gate threshold 100% refusal · 100% handoff
  5. category Multilingual handoff
    items 24
    what it checks Spanish-to-English switch mid-call (added after the kill-point)
    ci-gate threshold p99 ≤ 250ms switch latency
  6. category Adversarial
    items 40
    what it checks jailbreak attempts, role-play coercion, prompt injection through caller statements
    ci-gate threshold ≥ 0.98 refusal

Eval set is frozen — items added only, never edited. The support lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Per-item replay is deterministic — same audio, same prompt, same retrieved chunks fed via fixture.

interactive cost math

Your monthly call volume, your monthly bill.

Drag the slider to your tier-1 inbound voice volume. The numbers below recompute against the published $0.10/call agent cost and a $4/call loaded live-agent baseline. The bar chart shows where the agent ROI compounds against pure-human staffing.

monthly inbound call volume 41,200
1,000 250,500 500,000
agent monthly $
$103,742
100% human baseline
$164,800
monthly savings
$61,058
savings vs baseline
37.0%

Math assumptions: $0.10/call all-in (Realtime API tokens + RAG infra + edge), $4.00/call loaded live-agent cost (wage + benefits + AHT + occupancy + tooling), 38% tier-1 deflection rate (95% CI 33%–43%, n=11,400). Switch any assumption and the slider stays honest; the page math doesn't.

Production ops cadence is also part of the build, not an afterthought. The support lead and our on-call SRE hold a Monday-morning 30-minute review of every flagged turn from the prior week — anything the agent handed off, anything the eval flagged as a near-miss on groundedness, any latency outlier past p95. Patterns that repeat (more than three of the same flag in a week) become a JIRA ticket against the eval set and a candidate prompt or retrieval tweak. Langfuse trace retention is 30 days hot in the customer VPC plus one year cold in S3 inside their AWS account. Our on-call rotation runs one SRE a week against a 99.5% pipeline-availability SLO and the p95-under-700ms first-token SLO. The CX leadership team pulls a sample of 40 deflected calls a month for manual CSAT review — that signal feeds the prompt + retrieval iteration loop, not the eval set directly (the eval set stays frozen by design).

10 weeks · honest version

The timeline
including the week we sat on our hands.

Five stages, milestone-billed. The week-5 shadow run found a 1.4-second multilingual latency spike that would have torched the SLO in production. We halted the cutover, fixed the cache invalidation bug, added a tier-cached language-detection prefetch, and only then promoted to primary. The honest version of `ship in 10 weeks` includes the week we didn't ship.

  1. Weeks 1–2

    Discovery + frozen eval set

    Two weeks shadowing the existing tier-1 voice queue. Pulled six months of call recordings (de-identified, customer consent already on file) and let the support lead label them. 240 frozen eval items — the 5 questions accounting for 62% of volume plus 30% adversarial (accent, background noise, multi-turn corrections) and 8% must-refuse (legal escalations, billing disputes, churn-save asks).

    240-item eval set + must-refuse list + latency SLO of 700ms
  2. Week 3

    Stack bake-off

    Two pipelines built in parallel: native gpt-realtime-2 speech-to-speech and a chained Whisper → GPT-5.4 → ElevenLabs path. Both wired to the same RAG over the help-center corpus. Ran 240 eval items through each, plus a soak test at 50 concurrent calls. Realtime won on p95 first-token by ~360ms; chained won on cost per minute by 28% but failed the latency SLO at the 95th percentile. Picked Realtime primary, chained kept as a documented fallback for the multilingual lane.

    Realtime primary · chained fallback · SLO-passing prototype
  3. Week 4

    Help-center RAG + tool surface

    Ingested 8,200 help-center articles into pgvector 0.7 (and mirrored into Pinecone serverless for the cost / tail-latency A/B). 480-token chunks, 80-token overlap, embeddings via text-embedding-3-large, cross-encoder rerank with bge-reranker-large. Three tools wired into the Realtime function-calling surface: lookup_article, handoff_to_human, schedule_callback. Zero write tools; the agent cannot mutate a customer record.

    Hybrid retrieval at 0.89 recall@5 · 3-tool surface frozen
  4. Week 5

    Shadow run — multilingual latency spike

    Two weeks shadowing the live queue (silent — calls still went to humans; the agent's response was logged but not played). Day 9 the SRE on rotation flagged a p99 latency spike on Spanish-to-English handoff calls: 1.4 seconds of dead air at the language switch. Root cause was a cache invalidation bug in the language-detection routing — first detection result was cached per call SID but never invalidated when the caller switched language mid-call. We halted prod cutover, added a tier-cached language detection prefetch at call start (every supported language warmed in the cache before the model needs it), and re-ran the soak. The honest version of `4-week shadow` includes this week.

    p99 multilingual latency dropped 1,420ms → 210ms after the fix
    Walk-away point
  5. Weeks 6–10

    Production cutover + cost lock-in

    Promoted to primary on the tier-1 inbound queue with the live agent line in warm-standby on a 1-second timer. Weeks 6–8 ran at 20% traffic with the support lead reviewing every flagged conversation. Weeks 9–10 ramped to 100% tier-1. The unit-economics SpecGrid below is the production-cut math at the 41k-call/month volume we currently see — not a projection.

    Full cutover · $0.10/call published · per-call trace store on hot retention
eval results · 240 frozen items

How we know
it works.

The eval set is frozen. Every model bump, prompt change, retrieval tweak, and tool-schema change re-runs the full 240. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2 points across all rows over the last 30 days.

metric
human baseline
v1 (wk 3)
v2 (wk 5)
current (live)
target
Tier-1 deflection rate (95% CI)
31% (±5)
35% (±4)
38% (±5)
≥ 35%
First-token latency p95
940ms
680ms
580ms
≤ 700ms
Help-center recall@5
0.81
0.86
0.89
≥ 0.85
Wrong-answer rate (groundedness fail)
2.4%
1.6%
0.8%
≤ 1.0%
Human-handoff precision
0.88
0.91
0.94
≥ 0.92
Per-call all-in cost
$4.00
$0.18
$0.13
$0.10
≤ $0.15

Sample size for the deflection number is n=11,400 inbound calls across the 6-week shadow + 4-week production cut. The 38% point estimate has a 95% confidence interval of 33%–43%. First-token latency p95 is measured at the agent boundary (caller-side jitter buffer excluded). Per-call cost is the all-in deflected-call number; weighted-blended cost across deflected + handoff paths is ~$1.05/call. Multilingual handoff latency is measured at language-switch detection; per the kill-point fix, p99 now sits at 210ms.

Ready to ship

Want a case study like this
for your voice queue?

Book a $3K fixed-fee audit. We'll review the inbound voice workflow, model your call volume against the cost slider on this page, scope the eval set, recommend a Realtime / chained / hybrid stack, project the run-cost, and tell you honestly whether voice agents make sense for your traffic. About one audit in five ends with `keep the humans, here's the smaller automation we'd ship instead.`

30 min, async or live Cost-math reconciled to your real volume Walk-away point in the pilot