B2B SaaS · Tier-1 support Realtime voice agent · function-calling

gpt-realtime-2

Whisper-large-v3

pgvector 0.7

Twilio Voice

Cloudflare Workers

ElevenLabs Turbo v2.5

case study · 2026 · anonymized

How we shipped a sub-600ms voice agent
at $0.10 per call.

A US-based mid-market B2B SaaS support team's tier-1 voice queue was averaging four-minute waits at peak. Five inbound questions accounted for 62% of call volume. Their existing IVR was bouncing 80%+ to a human. We built a gpt-realtime-2 voice agent over their help-center corpus — Twilio in, streaming tokens to TTS, confidence-gated handoff to a live agent — eval-first, with a kill point in week 5 we used.

≈ 38%

tier-1 deflection (95% CI 33%–43% · n=11,400 calls)

p95 580ms

first-token latency · SLO 700ms

$0.10

all-in per-deflected-call · vs $4 live-agent loaded cost

10 wks

discovery + 6-wk shadow + 4-wk prod cut

shipped

10 weeks · 3 engineers · 1 support lead

4 min

average tier-1 voice queue wait at peak

62%

of inbound call volume tied to the same 5 questions

80%+

IVR bounce rate to a human (existing tree)

700ms

first-token ceiling before callers hear 'robot'

the problem

A tier-1 voice queue
that wasn't worth a human.

The five most-asked questions were 62% of inbound volume. Live agents were burning their day on the same calls. Wait times still climbed at peak.

The client is a US-based mid-market B2B SaaS company — north of $80M ARR, a tier-1 support team of fourteen reps spread across two timezones, and an inbound voice queue handling roughly 41,000 calls a month. The product is enough of a workhorse that customers call about real issues; it's also enough of a workhorse that 62% of those calls are the same five questions on a rolling 90-day window. Their existing IVR was the press-1-for-billing tree everyone hates, with an 80%+ bounce rate straight to a queue with a four-minute peak wait. The support lead had already run the time-and-motion study — tier-1 reps were spending 71% of their day on the questions any of them could answer in their sleep.

today vs · with the agent

today · tier-1 voice queue

Caller dials

IVR tree

press 1 for…

Hold · 4 min peak

Live agent · 5 same Qs

outcome

Long wait · agents burned on repetitive tier-1 calls

with the agent

Caller dials

Twilio + edge audio

<60ms ingress

gpt-realtime-2 + RAG

streaming · confidence-gated

Decision branch

answer · or handoff_to_human

outcome

Resolved · ≈ 38%

outcome

Handoff to human · live transfer with transcript

outcome

Failsafe queue · model self-refuses

The binding constraint was latency. The support lead and the head of CX were not romantic about AI; they had piloted two text-channel chatbots in the previous year and shelved both of them when CSAT dipped. What changed was a specific number: when a synchronous voice caller hears more than ~700ms of dead air after they finish speaking, US callers reliably report the experience as "robotic" — the line between a slow human and a fast bot. Anything past that ceiling and the deflected calls don't actually deflect; they bounce to a human angrier than when they arrived, and the unit economics invert.

So the scoping conversation we walked into wasn't "should we ship voice AI." It was: show us how a voice agent could miss the 700ms first-token ceiling, and tell us how you'd catch it before a customer hears it. That framing decided the engagement. The deliverable was a function-calling voice agent with a confidence-gated handoff tool, retrieval grounded in their existing help-center corpus, and a frozen eval set that gated every release. Nothing about it was going to be invisible — every flagged conversation was reviewable in Langfuse before the support lead would sign off on a cutover.

The rest of this page is what we shipped, what we measured, and where it broke in week 5.

the approach

Six pipeline stages,
one branch out to a human.

Audio flows left to right through six stages: Twilio in, STT, RAG, gpt-realtime-2 decision, TTS, Twilio out. A branch drops out at the decision step into the handoff_to_human tool when confidence falls below the threshold.

Caller dials a Twilio number. Twilio Programmable Voice opens a bidirectional media stream that Cloudflare Workers proxies to the OpenAI Realtime API endpoint — the entire audio path stays on Cloudflare's edge, which buys back ~28ms median round-trip vs. routing the audio through our origin. gpt-realtime-2 streams audio in (native speech-to-speech, no separate STT call on the happy path) and emits tokens back to the same stream as it generates them. Whisper-large-v3 sits behind a runtime fallback gate for cases where the Realtime audio path returns an empty transcript — accent, heavy background noise, or low-bandwidth callers. Whisper runs on a single g5.xlarge in the customer's AWS account.

Retrieval is hybrid. pgvector 0.7 on Postgres 16 is the primary store, holding the 8,200 help-center articles chunked to 480 tokens with 80-token overlap. Pinecone serverless mirrors the same corpus, fed by the same ingest pipeline, on a 50/50 A/B mirror in production — both lanes serve real queries, both lanes are measured. Cross-encoder rerank uses BAAI bge-reranker-large self-hosted on the same g5.xlarge as Whisper (it's idle when Whisper isn't being called). The model receives the top-12 reranked chunks plus the rolling conversation state every turn.

Tool-calling is the part that earned its complexity. The Realtime API supports OpenAI's function-calling JSON schema; we ship three tools on the session: lookup_article, handoff_to_human, and schedule_callback. The agent has zero write tools — it cannot mutate a customer record, it cannot escalate without a human in the loop, it cannot promise a refund. The handoff_to_human schema is the load-bearing one; we walk through it in the code-block section below. PagerDuty wires the warm-transfer leg: when the model calls handoff, the call SID + rolling transcript summary + retrieved chunk IDs go straight onto a PagerDuty incident; the on-call rep picks up with the agent's reasoning already on their screen.

The architecture diagram below is the production shape. Hover any node for its tool inventory and per-stage latency budget. The streaming dots between the decide step and TTS visualise the real behaviour: gpt-realtime-2 does not wait until it has finished generating before TTS starts speaking — the tokens flow through. That single property is what gets the p95 first-token under 600ms.

three decisions that shaped the build

design decision · 01

gpt-realtime-2 speech-to-speech as the primary path

we rejected: Chained STT → text-LLM → TTS pipeline (Whisper + GPT-5.4 + ElevenLabs)
because: On the eval we ran chained got us to p95 ≈ 940ms first-token — already past the 700ms `feels-robotic` threshold US callers reported. Native speech-to-speech buys us back ~350ms. Whisper still ships as a fallback when the Realtime audio path can't decode accent or noise.

design decision · 02

handoff_to_human as a function-calling tool, not a fallback timeout

we rejected: Confidence threshold on the model's own self-reported probability
because: Self-reported confidence on Realtime models is poorly calibrated under stream pressure (Anthropic and OpenAI both publish this). A first-class tool the model can call explicitly is more honest: the model knows what it doesn't know better than it knows how sure it is.

design decision · 03

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

we rejected: Pick one vector store and commit
because: Help-center retrieval recall was the second-biggest determinant of deflection (after first-token latency) on the eval. Running both in production for 6 weeks let us measure not just recall@5 but cost per query and tail latency under real traffic. pgvector won on cost-per-query; Pinecone won on tail-latency variance. We kept pgvector primary and the mirror stays as a watch-the-shop sanity check.

The reason this shape works is the same reason it took ten weeks instead of four: every component has a separately measurable contract. Telephony ingress is measurable in round-trip latency from the carrier to the edge. STT is measurable in word-error rate on a frozen accent + noise test set. Retrieval is measurable in recall@5 + cost-per-query on the eval. The model is measurable in tier-1 deflection precision at a 0.7 confidence threshold. TTS is measurable in first-audio-frame latency from token to playback. The handoff path is measurable in PagerDuty page-to-pickup time. When something regresses, the per-component metric tells us which stage to look at — we don't have to root-cause a single end-to-end number.

Langfuse runs in the customer's VPC and stores every per-turn trace: audio segment, STT transcript (when used), retrieved chunks with rerank scores, model output, tool invocations, the call-state object at handoff, and the final caller-facing audio. 30-day hot retention plus a 1-year cold archive in S3. The support lead pulls a 5%-sample audit every Monday morning; the SRE team holds a fortnightly latency review against the SLO. Nothing in this section is published anywhere else by anyone shipping voice agents at this scope. That's the bar.

under the hood

The realtime voice agent,
round-trip.

Caller speaks. Audio streams to gpt-realtime-2 over the help-center RAG. The model either answers — streaming tokens straight back to TTS so the first audio frame leaves the edge inside ~580ms — or calls the handoff_to_human tool and PagerDuty pages a live agent. Hover any stage to see its tool inventory and first-token latency budget.

tool inventory

Hover or focus a stage on the left to see its tool surface, first-token latency budget, and what runs at the edge vs. in the OpenAI API path.

first-token p95 580ms end-to-end · streaming tokens flow continuously from gpt-realtime-2 to TTS · branch fires on confidence < 0.7

11,400

shadow + production calls used for the deflection CI

autonomous policy changes — agent only answers tier-1 from the help-center RAG

p50 480ms

first-token median; tail-latency budget detailed below

1 SRE on call

24/7 rotation — Langfuse + PagerDuty wired for sub-second cutover

latency budget

p95 first-token, visualised.

Total budget — caller-mouth to caller-ear — is 580ms. Each band's width is its share of that budget. The reasoning + RAG step is the long pole; the rest are kept honest by Cloudflare Workers and the Twilio media edge.

62ms Caller speech ingress

118ms STT (gpt-realtime-2 audio in)

264ms Reasoning + RAG retrieval

88ms TTS first-audio frame

48ms Twilio egress to caller

Caller speech ingress 62ms
STT (gpt-realtime-2 audio in) 118ms
Reasoning + RAG retrieval 264ms
TTS first-audio frame 88ms
Twilio egress to caller 48ms

Deterministic replay — these bars are not a recording; they are a layout-stable visualisation of the p95 first-token latency budget. Per-stage numbers are pulled from Langfuse trace aggregates over a 30-day production window.

the stack

Named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing is `our proprietary AI`. The eval set, prompts, and tool schemas are all checked into the customer's repo — vendor swap-out cost is bounded by design.

gpt-realtime-2 OpenAI Realtime API · 2026-04

Whisper-large-v3 OpenAI · self-hosted on g5.xlarge

pgvector 0.7 Postgres 16

BAAI bge-reranker-large v2.5

Pinecone serverless us-east-1

Twilio Programmable Voice SIP · 2026-03 API

Cloudflare Workers Durable Objects

ElevenLabs Turbo v2.5 Multilingual

Langfuse self-hosted · t3.medium

PagerDuty

how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses OpenAI's public Realtime API pricing as of May 2026; eval composition is the frozen 240-item set the CI gates on.

Voice case studies that stop at the architecture diagram are not useful to the people who actually have to sign — the head of CX and the SRE on call. Both have specific questions: what is the per-stage latency budget under load, what is the token-cost line that ties to the model vendor's published price card, what does the eval set actually contain, and what runs where for data-residency review. Vendors who don't show this either don't have it or are hiding it. Below is the version that maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres EXPLAIN ANALYZE, or OpenAI's pricing page.

latency budget

Per-stage P50 / P95 (ms)

stage	p50	p95	tooling
Twilio ingress + edge proxy	38	62	Twilio Programmable Voice · Cloudflare Workers Durable Objects
STT (Realtime audio in)	82	118	gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
Hybrid retrieval	64	96	pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
Cross-encoder rerank	44	72	BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
gpt-realtime-2 decision	196	264	OpenAI Realtime API · function-calling · ~2,800 in · streaming out
TTS first audio	84	124	gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
Twilio egress to caller	32	48	media stream reverse leg · jitter buffer ≤ 80ms
Total to first-token	480	580	agent boundary · excludes caller-side jitter buffer

stage Twilio ingress + edge proxy
p50 38
p95 62
tooling Twilio Programmable Voice · Cloudflare Workers Durable Objects
stage STT (Realtime audio in)
p50 82
p95 118
tooling gpt-realtime-2 native audio · Whisper-large-v3 fallback on miss
stage Hybrid retrieval
p50 64
p95 96
tooling pgvector 0.7 top-40 ∥ Pinecone serverless top-40 (A/B) → RRF k=60
stage Cross-encoder rerank
p50 44
p95 72
tooling BAAI bge-reranker-large · g5.xlarge in customer VPC · top-12
stage gpt-realtime-2 decision
p50 196
p95 264
tooling OpenAI Realtime API · function-calling · ~2,800 in · streaming out
stage TTS first audio
p50 84
p95 124
tooling gpt-realtime-2 native TTS · ElevenLabs Turbo v2.5 fallback
stage Twilio egress to caller
p50 32
p95 48
tooling media stream reverse leg · jitter buffer ≤ 80ms
stage Total to first-token
p50 480
p95 580
tooling agent boundary · excludes caller-side jitter buffer

p50/p95 from 30-day rolling window over n ≈ 41,200 production calls. SLO is p95 ≤ 700 ms first-token; current burn ≈ 83%. The kill-point fix (multilingual cache invalidation) is the only regression event in the last 60 days.

slo headroom

Where the 700ms SLO budget goes.

Anything slower than 700ms first-token reads as a robot to a US caller — the binding constraint on this whole engagement. Current p95 is 580ms; the wedge below 700 is the headroom we have for future-prompt growth or a third-party fallback to slow down.

Twilio ingress 62ms
STT (Realtime/Whisper) 118ms
RAG + reasoning 264ms
TTS first audio 88ms
Twilio egress 48ms
SLO threshold 700ms
Headroom under SLO 120ms

The retrieval lane is where most of the per-stage tuning effort landed. The corpus is 8,200 help-center articles chunked at 480 tokens with 80-token overlap, anchored on heading boundaries. We picked text-embedding-3-large at 3,072 dimensions over the cheaper text-embedding-3-small after running both on the eval — small dropped recall@5 from 0.89 to 0.81, and on a voice agent that recall hit is a wrong-answer rate hit you can hear. The 75% embedding-cost saving wasn't worth shipping a measurably worse retriever. Reciprocal-rank fusion with k=60 (the paper default) feeds the top-40 from each lane into the reranker; the reranker returns 12 to the model. The Pinecone serverless lane runs the same query plan on 50% of traffic — same recall, slightly higher cost-per-query, slightly tighter tail-latency variance. We have kept it on as a watchdog, not because we expect to migrate.

realtime/tools/handoff_to_human.tool.json jsonc

// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}

// realtime/tools/handoff_to_human.tool.json
// Function-calling JSON schema registered on session.update.tools[].
// Confidence threshold is checked on the call-state object BEFORE the
// model is allowed to invoke this tool — the model can request handoff
// for any reason, but the runtime gates the side-effect (PagerDuty page,
// warm transfer to live agent) on confidence < 0.7 OR explicit caller
// request OR a must-refuse category match.
{
  "type": "function",
  "name": "handoff_to_human",
  "description": "Transfer this call to a live tier-1 support agent. Use when the caller's intent falls outside the help-center corpus, when the model's own confidence in the retrieved answer is below 0.7, when the caller explicitly asks for a human, or when the conversation hits a must-refuse category (billing dispute, churn-save, legal escalation).",
  "parameters": {
    "type": "object",
    "required": ["reason", "confidence", "call_state"],
    "properties": {
      "reason": {
        "type": "string",
        "enum": [
          "low_confidence",
          "out_of_scope",
          "caller_request",
          "must_refuse_category",
          "multilingual_handoff"
        ],
        "description": "Why the handoff is being requested. Used for routing + analytics."
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1,
        "description": "Model's confidence in the retrieved answer at the moment of handoff. The runtime gate trips on < 0.7 for the low_confidence reason; other reasons bypass the threshold."
      },
      "call_state": {
        "type": "object",
        "required": ["call_sid", "language", "transcript_summary", "retrieved_chunk_ids"],
        "properties": {
          "call_sid":            { "type": "string", "pattern": "^CA[a-f0-9]{32}$" },
          "language":            { "type": "string", "pattern": "^[a-z]{2}(-[A-Z]{2})?$" },
          "transcript_summary":  { "type": "string", "maxLength": 800 },
          "retrieved_chunk_ids": {
            "type": "array",
            "items": { "type": "string", "pattern": "^kb_[a-f0-9]{12}$" },
            "minItems": 0,
            "maxItems": 12
          }
        }
      }
    }
  }
}

The handoff tool schema registered on session.update.tools[]. The runtime gates the side-effect (PagerDuty page, warm transfer) on confidence < 0.7 OR explicit caller request OR a must-refuse category match — the model can call the tool for any reason, but the gate decides whether it actually fires.

unit economics

Per-call and monthly cost math (≈ 41k calls/mo)

line item	$ / call	$ / month	note
gpt-realtime-2 audio input	$0.0240	$984	~2 min avg call · $24/1M audio-input tokens (May 2026)
gpt-realtime-2 audio output	$0.0480	$1,968	~45 sec agent speech avg · $48/1M audio-output tokens
text-embedding-3-large (query)	$0.0003	$13	≈ 2,400 tokens × $0.13 / 1M per call
Whisper fallback (5% of calls)	$0.0030	$123	self-hosted Whisper-large-v3 on g5.xlarge — amortised
pgvector + Postgres 16 RDS	—	$284	db.m6i.large · embeddings + tsvector + traces
bge-reranker on g5.xlarge	—	$378	shared with Whisper fallback · 24/7
Pinecone serverless (A/B 50%)	$0.0008	$33	watchdog mirror · expected to drop after the audit
Twilio Voice (inbound)	$0.0170	$697	$0.0085/min × 2 min avg per call
Cloudflare Workers + R2	$0.0006	$26	edge proxy + audio chunk store
Langfuse self-hosted	—	$67	t3.medium · 30-day hot / 1-yr cold
All-in per deflected call	≈ $0.10	≈ $4,573 / mo	vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

line item gpt-realtime-2 audio input
$ / call $0.0240
$ / month $984
note ~2 min avg call · $24/1M audio-input tokens (May 2026)
line item gpt-realtime-2 audio output
$ / call $0.0480
$ / month $1,968
note ~45 sec agent speech avg · $48/1M audio-output tokens
line item text-embedding-3-large (query)
$ / call $0.0003
$ / month $13
note ≈ 2,400 tokens × $0.13 / 1M per call
line item Whisper fallback (5% of calls)
$ / call $0.0030
$ / month $123
note self-hosted Whisper-large-v3 on g5.xlarge — amortised
line item pgvector + Postgres 16 RDS
$ / call —
$ / month $284
note db.m6i.large · embeddings + tsvector + traces
line item bge-reranker on g5.xlarge
$ / call —
$ / month $378
note shared with Whisper fallback · 24/7
line item Pinecone serverless (A/B 50%)
$ / call $0.0008
$ / month $33
note watchdog mirror · expected to drop after the audit
line item Twilio Voice (inbound)
$ / call $0.0170
$ / month $697
note $0.0085/min × 2 min avg per call
line item Cloudflare Workers + R2
$ / call $0.0006
$ / month $26
note edge proxy + audio chunk store
line item Langfuse self-hosted
$ / call —
$ / month $67
note t3.medium · 30-day hot / 1-yr cold
line item All-in per deflected call
$ / call ≈ $0.10
$ / month ≈ $4,573 / mo
note vs. $4.00 loaded live-agent cost per call · ~40× cheaper at the deflection rate

Token costs use OpenAI's public Realtime API pricing as of May 2026 — $24/1M audio-input, $48/1M audio-output. Twilio costs are list price. Infra costs are AWS US-east-2 list. Loaded live-agent cost ($4.00/call) is the client's own internal blend (wage + benefits + AHT + occupancy + tooling); we used their number, not a market average. Monthly figures assume 41,200 calls/mo at the current 38% deflection rate. Per-call all-in reconciles to ~$0.10 (agent path) + ~$2.48 (handoff path) blended ≈ $1.05 weighted — published math headlines the per-deflected-call number, which is the relevant comparison vs. a live agent on a deflected call.

eval composition

What's in the frozen 240-item set

category	items	what it checks	ci-gate threshold
Top-5 question golds	100	labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume	≥ 0.92 groundedness
Latency soak (concurrent)	20	50-concurrent-call replay against the staging Realtime endpoint	p95 ≤ 700ms first-token
Accent + noise	30	ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set	≥ 0.85 transcript accuracy
Must-refuse	26	billing disputes, churn-save asks, legal escalations, retention offers, refund promises	100% refusal · 100% handoff
Multilingual handoff	24	Spanish-to-English switch mid-call (added after the kill-point)	p99 ≤ 250ms switch latency
Adversarial	40	jailbreak attempts, role-play coercion, prompt injection through caller statements	≥ 0.98 refusal

category Top-5 question golds
items 100
what it checks labelled correct answer + retrieved chunk IDs on the 5 questions accounting for 62% of volume
ci-gate threshold ≥ 0.92 groundedness
category Latency soak (concurrent)
items 20
what it checks 50-concurrent-call replay against the staging Realtime endpoint
ci-gate threshold p95 ≤ 700ms first-token
category Accent + noise
items 30
what it checks ASR-stress eval drawn from the Common Voice multi-accent slice + a 12-clip noise overlay set
ci-gate threshold ≥ 0.85 transcript accuracy
category Must-refuse
items 26
what it checks billing disputes, churn-save asks, legal escalations, retention offers, refund promises
ci-gate threshold 100% refusal · 100% handoff
category Multilingual handoff
items 24
what it checks Spanish-to-English switch mid-call (added after the kill-point)
ci-gate threshold p99 ≤ 250ms switch latency
category Adversarial
items 40
what it checks jailbreak attempts, role-play coercion, prompt injection through caller statements
ci-gate threshold ≥ 0.98 refusal

Eval set is frozen — items added only, never edited. The support lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Per-item replay is deterministic — same audio, same prompt, same retrieved chunks fed via fixture.

interactive cost math

Your monthly call volume, your monthly bill.

Drag the slider to your tier-1 inbound voice volume. The numbers below recompute against the published $0.10/call agent cost and a $4/call loaded live-agent baseline. The bar chart shows where the agent ROI compounds against pure-human staffing.

monthly inbound call volume 41,200

1,000 250,500 500,000

agent monthly $: $103,742
100% human baseline: $164,800
monthly savings: $61,058
savings vs baseline: 37.0%

Math assumptions: $0.10/call all-in (Realtime API tokens + RAG infra + edge), $4.00/call loaded live-agent cost (wage + benefits + AHT + occupancy + tooling), 38% tier-1 deflection rate (95% CI 33%–43%, n=11,400). Switch any assumption and the slider stays honest; the page math doesn't.

Production ops cadence is also part of the build, not an afterthought. The support lead and our on-call SRE hold a Monday-morning 30-minute review of every flagged turn from the prior week — anything the agent handed off, anything the eval flagged as a near-miss on groundedness, any latency outlier past p95. Patterns that repeat (more than three of the same flag in a week) become a JIRA ticket against the eval set and a candidate prompt or retrieval tweak. Langfuse trace retention is 30 days hot in the customer VPC plus one year cold in S3 inside their AWS account. Our on-call rotation runs one SRE a week against a 99.5% pipeline-availability SLO and the p95-under-700ms first-token SLO. The CX leadership team pulls a sample of 40 deflected calls a month for manual CSAT review — that signal feeds the prompt + retrieval iteration loop, not the eval set directly (the eval set stays frozen by design).

10 weeks · honest version

The timeline
including the week we sat on our hands.

Five stages, milestone-billed. The week-5 shadow run found a 1.4-second multilingual latency spike that would have torched the SLO in production. We halted the cutover, fixed the cache invalidation bug, added a tier-cached language-detection prefetch, and only then promoted to primary. The honest version of `ship in 10 weeks` includes the week we didn't ship.

Weeks 1–2

Discovery + frozen eval set

Two weeks shadowing the existing tier-1 voice queue. Pulled six months of call recordings (de-identified, customer consent already on file) and let the support lead label them. 240 frozen eval items — the 5 questions accounting for 62% of volume plus 30% adversarial (accent, background noise, multi-turn corrections) and 8% must-refuse (legal escalations, billing disputes, churn-save asks).

240-item eval set + must-refuse list + latency SLO of 700ms
Week 3

Stack bake-off

Two pipelines built in parallel: native gpt-realtime-2 speech-to-speech and a chained Whisper → GPT-5.4 → ElevenLabs path. Both wired to the same RAG over the help-center corpus. Ran 240 eval items through each, plus a soak test at 50 concurrent calls. Realtime won on p95 first-token by ~360ms; chained won on cost per minute by 28% but failed the latency SLO at the 95th percentile. Picked Realtime primary, chained kept as a documented fallback for the multilingual lane.

Realtime primary · chained fallback · SLO-passing prototype
Week 4

Help-center RAG + tool surface

Ingested 8,200 help-center articles into pgvector 0.7 (and mirrored into Pinecone serverless for the cost / tail-latency A/B). 480-token chunks, 80-token overlap, embeddings via text-embedding-3-large, cross-encoder rerank with bge-reranker-large. Three tools wired into the Realtime function-calling surface: lookup_article, handoff_to_human, schedule_callback. Zero write tools; the agent cannot mutate a customer record.

Hybrid retrieval at 0.89 recall@5 · 3-tool surface frozen
Week 5

Shadow run — multilingual latency spike

Two weeks shadowing the live queue (silent — calls still went to humans; the agent's response was logged but not played). Day 9 the SRE on rotation flagged a p99 latency spike on Spanish-to-English handoff calls: 1.4 seconds of dead air at the language switch. Root cause was a cache invalidation bug in the language-detection routing — first detection result was cached per call SID but never invalidated when the caller switched language mid-call. We halted prod cutover, added a tier-cached language detection prefetch at call start (every supported language warmed in the cache before the model needs it), and re-ran the soak. The honest version of `4-week shadow` includes this week.

p99 multilingual latency dropped 1,420ms → 210ms after the fix

Walk-away point
Weeks 6–10

Production cutover + cost lock-in

Promoted to primary on the tier-1 inbound queue with the live agent line in warm-standby on a 1-second timer. Weeks 6–8 ran at 20% traffic with the support lead reviewing every flagged conversation. Weeks 9–10 ramped to 100% tier-1. The unit-economics SpecGrid below is the production-cut math at the 41k-call/month volume we currently see — not a projection.

Full cutover · $0.10/call published · per-call trace store on hot retention

eval results · 240 frozen items

How we know
it works.

The eval set is frozen. Every model bump, prompt change, retrieval tweak, and tool-schema change re-runs the full 240. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2 points across all rows over the last 30 days.

metric

human baseline

v1 (wk 3)

v2 (wk 5)

current (live)

target

Tier-1 deflection rate (95% CI)

—

31% (±5)

35% (±4)

38% (±5)

≥ 35%

First-token latency p95

—

940ms

680ms

580ms

≤ 700ms

Help-center recall@5

—

0.81

0.86

0.89

≥ 0.85

Wrong-answer rate (groundedness fail)

—

2.4%

1.6%

0.8%

≤ 1.0%

Human-handoff precision

—

0.88

0.91

0.94

≥ 0.92

Per-call all-in cost

$4.00

$0.18

$0.13

$0.10

≤ $0.15

Sample size for the deflection number is n=11,400 inbound calls across the 6-week shadow + 4-week production cut. The 38% point estimate has a 95% confidence interval of 33%–43%. First-token latency p95 is measured at the agent boundary (caller-side jitter buffer excluded). Per-call cost is the all-in deflected-call number; weighted-blended cost across deflected + handoff paths is ~$1.05/call. Multilingual handoff latency is measured at language-switch detection; per the kill-point fix, p99 now sits at 210ms.

When NOT to ship this. A voice agent built on these patterns will burn money or hurt CSAT in any of the following situations — we will turn down the engagement before scoping a pilot:

HIPAA-equivalent regulated voice. Healthcare patient lines, financial advisory voice, legal client calls — anything BAA / fiduciary / privileged. OpenAI's Realtime API is not BAA-eligible as of May 2026. If the workflow needs that posture, the answer is a different architecture (Whisper-on-prem + GPT-4o-mini behind Azure OpenAI + ElevenLabs Enterprise) at materially higher per-call cost; we'll scope that build separately, but it's not this case study.
Accent or dialect not represented in the frozen eval. The accent + noise eval has 30 items drawn from Common Voice; that covers most US callers and a usable share of EN-IN, EN-AU, EN-GB. If your inbound voice traffic is meaningfully different (heavy AAVE, regional Caribbean English, non-native L2 speakers), the eval set has to grow before the model is allowed to take calls — and the deflection number you read above does not transfer.
Low-bandwidth or feature-phone callers. The first-token math depends on a continuous 8kHz+ stream. Callers on flaky cellular, hotel landlines, or feature phones over G.711 µ-law degrade Whisper word-error-rate by 3–5x; on those calls the latency tail blows out and CSAT inverts. If your traffic is heavily weighted to those segments, the right product is an asynchronous IVR with callback, not a real-time voice agent.
Regulatory call-recording obligations the agent can't satisfy. Two-party-consent states (CA, FL, MA, MD, MT, NV, NH, PA, WA), MiFID II / Dodd-Frank recordkeeping, regulated medical advice — the recording, retention, and disclosure posture either exists at week 1 of the pilot or the pilot doesn't get signed. We will not let the model handle a regulated call without a documented compliance plan that names the recording disclosure script, the retention period, and the redaction posture.

— the kill-point section that has to be on every honest case study

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build — or that a similar build on your stack would draw from.

OpenAI Development

Realtime API depth, GPT-5.4 / 5.4-mini routing, Azure OpenAI for regulated deployments — the model-pillar this case study sits on.

AI Voice Agents

The voice-agent pillar — telephony, barge-in tuning, IVR replacement, multilingual support. Same eval-first loop used on this build.

AI Chatbot Development

Async-text sibling — when voice is wrong primitive or async-first beats real-time on cost.

AI Agent Development

Function-calling, tool surfaces, multi-step agents — voice is one interface; the same agentic stack underneath.

All AI Case Studies

Six case studies — RAG, agents, voice, and chatbots. Same operator detail across every page.

AI Consulting

$3K fixed-fee audit. We'll model your call volume against the cost slider above and tell you honestly if a voice agent pays back.

Ready to ship

Want a case study like this
for your voice queue?

Book a $3K fixed-fee audit. We'll review the inbound voice workflow, model your call volume against the cost slider on this page, scope the eval set, recommend a Realtime / chained / hybrid stack, project the run-cost, and tell you honestly whether voice agents make sense for your traffic. About one audit in five ends with `keep the humans, here's the smaller automation we'd ship instead.`

Read the voice agents pillar

30 min, async or live Cost-math reconciled to your real volume Walk-away point in the pilot

How we shipped a sub-600ms voice agent
at $0.10 per call.

A tier-1 voice queue
that wasn't worth a human.

today · tier-1 voice queue

with the agent

Six pipeline stages,
one branch out to a human.

gpt-realtime-2 speech-to-speech as the primary path

handoff_to_human as a function-calling tool, not a fallback timeout

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

The realtime voice agent,
round-trip.

p95 first-token, visualised.

Named tools,
named versions.

Production shape,
under the hood.

Per-stage P50 / P95 (ms)

Where the 700ms SLO budget goes.

Per-call and monthly cost math (≈ 41k calls/mo)

What's in the frozen 240-item set

Your monthly call volume, your monthly bill.

The timeline
including the week we sat on our hands.

Discovery + frozen eval set

Stack bake-off

Help-center RAG + tool surface

Shadow run — multilingual latency spike

Production cutover + cost lock-in

How we know
it works.

Where this case study
points back to.

OpenAI Development

AI Voice Agents

AI Chatbot Development

AI Agent Development

All AI Case Studies

AI Consulting

Want a case study like this
for your voice queue?

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

How we shipped a sub-600ms voice agent at $0.10 per call.

A tier-1 voice queue that wasn't worth a human.

today · tier-1 voice queue

with the agent

Six pipeline stages, one branch out to a human.

gpt-realtime-2 speech-to-speech as the primary path

handoff_to_human as a function-calling tool, not a fallback timeout

pgvector 0.7 primary + Pinecone serverless on a 50/50 A/B mirror

The realtime voice agent, round-trip.

Named tools, named versions.

Production shape, under the hood.

The timeline including the week we sat on our hands.

Discovery + frozen eval set

Stack bake-off

Help-center RAG + tool surface

Shadow run — multilingual latency spike

Production cutover + cost lock-in

How we know it works.

Where this case study points back to.

OpenAI Development

AI Voice Agents

AI Chatbot Development

AI Agent Development

All AI Case Studies

AI Consulting

Want a case study like this for your voice queue?

How we shipped a sub-600ms voice agent
at $0.10 per call.

A tier-1 voice queue
that wasn't worth a human.

Six pipeline stages,
one branch out to a human.

The realtime voice agent,
round-trip.

Named tools,
named versions.

Production shape,
under the hood.

The timeline
including the week we sat on our hands.

How we know
it works.

Where this case study
points back to.

Want a case study like this
for your voice queue?