OpenAI Realtime API
Speech-to-speechSub-600ms TTFT · barge-in · in-app voice copilots
AI voice agent development company shipping production conversational voice on three stacks — OpenAI Realtime API, chained STT+LLM+TTS, and Deepgram Voice Agent. Sub-600ms first-token latency, $0.005–$0.06/min honest cost range, eval-gated against your real call recordings. Telephony (Twilio · Telnyx · LiveKit · Daily) and mobile-native voice inside Flutter / iOS / Android / web. First pilot live in 5–7 weeks, behind a feature flag, with a walk-away point if the metric won't move.
Every voice ai agent pattern below has been shipped from this exact playbook — telephony inbound and outbound, in-app voice copilots, IVR replacement, multilingual support, voice-first ecommerce. Each one comes with an eval suite, audit logging, barge-in tuning, and a per-minute cost target — not a demo reel.
The crown-jewel use case. A real phone number (Twilio, Telnyx, or your existing SIP trunk) answered by a voice ai agent that handles tier-1 calls — appointment changes, order status, billing balance, store hours, password resets. Confidence gate at 0.7; anything below escalates to a human queue with a structured handoff summary, not a re-asked greeting. We ship these with barge-in, hold detection, and DTMF fallback for callers who hit a frustration ceiling.
Outbound voice agents for appointment reminders, payment-due nudges, NPS surveys, and lead qualification. Per-call cost lands at $0.05–$0.15 (vs $4–$8 for a human callback) and you control the cadence + script + escalation rules. Compliance baked in — TCPA-aware time windows, opt-out detection on every turn, full call recording with consent prompts where required. Built on a chained stack to keep cost low at outbound scale.
Voice agents that live inside your iOS / Android / Flutter / web app — not a phone call. Conversational ai voice for hands-free interfaces in food delivery, automotive, field service, and accessibility surfaces. Mobile-native delivery is a differentiator: we wrote the UI kit a lot of Flutter screens run on, so the voice surface ships with proper interrupt handling and mic-permission UX, not a half-baked WebRTC bolt-on.
Direct replacement of legacy IVR menus with a single open-ended voice ai agent that routes by intent instead of by keypad. Average call shortens 40–90 seconds when callers can just say what they want, and the same agent can answer the question the caller would have been routed to anyway. Plays cleanly inside existing contact-center platforms — Genesys, Five9, Amazon Connect — via SIP REFER or media-streaming.
Voice agents that operate across English, Spanish, Hindi, Portuguese, Arabic, Mandarin, French, and German out of the box. Chained stacks shine here — Deepgram or Whisper for STT in the caller's language, GPT-5 or Sonnet 4.6 for the reasoning step (both genuinely multilingual without separate models), and ElevenLabs or Cartesia for TTS that doesn't sound like a 2018 phone tree. Language detection on the first turn, no menu required.
Voice agents for hands-busy commerce — kitchen-side reordering, drive-thru, in-car, accessibility-first browsing. Grounded in your Shopify / WooCommerce / commercetools catalog with function calls into order-management, payment-tokenization, and 3PL APIs. Realtime API is the right choice here when TTFT under 600ms is non-negotiable for the natural conversational feel customers expect.
The single biggest decision in voice ai development is which architecture you pick — and most listicles dodge it. OpenAI Realtime API, chained STT+LLM+TTS, and unified-vendor stacks each win different workloads. Per-dimension honest comparison below; the audit picks for your workload.
Numbers are typical production traces from shipped pilots. Per-workload picks vary with your eval data and call volume.
Five tactics stacked, ordered by impact on chained voice stacks. Most voice pilots see per-minute cost drop 60–80% at the same eval-suite quality after this pass. Realtime API has a tighter optimization surface (~$0.06/min floor) — chained is where the real cost engineering lives. This optimization pass is included in every voice agent pilot, post-cutover.
The voice agent is half the story; the telephony layer underneath is the other half. Twilio for ubiquity, Telnyx for margin on high-volume outbound, LiveKit for in-app voice copilots that share a codebase across mobile + web + phone, Daily for the most opinionated open-source stack via Pipecat. Pick a platform to see the integration code, auth model, and timeline.
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect
# 1. Incoming call → TwiML returns <Connect><Stream/> to our WS endpoint
def incoming_call(request):
response = VoiceResponse()
connect = Connect()
connect.stream(url="wss://voice.api.example.com/media")
response.append(connect)
return str(response)
# 2. Media WS receives 8kHz µ-law audio frames every 20ms
# 3. We re-encode to PCM, send into OpenAI Realtime or Whisper
# 4. TTS audio (ElevenLabs Flash) streams back to Twilio media WS
# 5. Barge-in cancels in-flight TTS within ~280ms via VAD signal
A chained voice ai agent is three vendors stacked — STT in, LLM in the middle, TTS out. Each layer has 2–3 production-grade choices in 2026 with meaningful tradeoffs on cost, latency, and language coverage. Here's the default stack we ship; we re-pick per workload when the eval data demands it.
Deepgram Nova-3 is our default for telephony (8kHz µ-law, streaming, sub-200ms partial transcripts, multilingual). Whisper-large-v3 still wins on rare-language coverage and noisy environments — we pick per workload. OpenAI Realtime API rolls its own STT internally so this layer disappears entirely when you go speech-to-speech. We benchmark on your real call audio before recommending; vendor benchmarks lie about phone-quality audio constantly.
GPT-5 / Sonnet 4.6 for grounded multi-turn reasoning, function calling, and tool use (the part that turns a voice ai agent into something that does work, not just talks). GPT-5-mini or Haiku 4.5 for cheap classify + routing layers in front. Realtime API ties you to OpenAI's reasoning model — fine when latency dominates the requirement, less fine when your eval data says Claude wins for your domain. Model-agnostic by default, locked-in only when you ask for it.
ElevenLabs Flash v2.5 for sub-200ms TTFT and natural prosody (the default for in-app voice copilots and premium telephony). Cartesia Sonic for the cheapest-per-character production-grade voice (winning on long outbound calls where per-minute cost dominates). OpenAI TTS (gpt-4o-mini-tts) when you want a single-vendor stack. The choice is a $/min + brand-voice decision, not a quality one — all three sound human in 2026.
The disqualifying signal when shortlisting any voice ai company is asking how they measure quality and getting marketing-speak back. Here are the four numbers we publish on every voice agent pilot, measured on a held-out eval set from your real call recordings. Below the targets, we don't roll out — we tune.
On a held-out eval set of 200+ real call snippets from your domain, what fraction does the agent classify and respond to correctly? Most production voice ai agents we see land at 84–93% on tier-1 intents after tuning. Below 80% on a frozen eval set means the prompt or retrieval needs work before rollout, not after.
How often does the agent invent a policy, a balance, a hours-of-operation, or a product feature that doesn't exist? Measured by sampled human review (200 calls/week minimum during pilot). Target is <2 per 100 calls before production rollout. Anything above 5 means your RAG grounding is missing the source documents the agent needs.
When the caller interrupts the agent mid-sentence, how fast does the agent stop talking? Target: <300ms p50, <500ms p95. Above 500ms and the conversation feels robotic — callers start over-talking instead of waiting. This is the metric that decides whether a voice agent feels alive or feels like 2015 IVR.
From call start to the moment the agent does the thing the caller wanted — booked the appointment, looked up the order, queued the refund. Target depends on the workflow, but we publish the baseline you'll be measured against during the audit. Most replacements of human tier-1 calls cut MTTA by 40–70% just by removing the queue wait.
Same pricing as our other engagements. Most voice ai company shortlists hide pricing — we publish it. Audit first to scope architecture + telephony, run a 5–7 week pilot on the highest-ROI workflow, then continuous if you want to ship the next 2–3.
Find the voice workflow worth shipping, in the right architecture, before any build commitment.
One voice ai agent shipped end-to-end on your chosen architecture and telephony stack, with eval data, not a demo reel.
Embedded squad shipping new voice workflows + tuning the live ones.
Four stages, milestone-billed, with a walk-away point at the vendor-pick decision. Most voice agent failures happen because the team picked the architecture by ideology, not eval — both are in week 1 and week 2 here, not after the pilot bill arrives.
Harvest 50–200 real call snippets from your call recordings (or run a structured intake if you're greenfield). Build the eval set the voice ai agent will be measured against — intent-match accuracy on each call type, expected MTTA per workflow. Scope locked: which call types in, which out, who escalates to whom.
Wire the chosen telephony stack (Twilio · Telnyx · LiveKit · Daily) into a sandbox number. Run STT options against your real call audio — Whisper vs Deepgram vs Realtime — and pick per WER on YOUR audio, not vendor benchmark audio. TTS A/B for brand voice. RAG corpus ingested if grounding is in scope.
Wire the full anatomy: telephony → STT → LLM (with tool use + RAG) → TTS → guardrails → log. Tune the hard parts: barge-in thresholds (<300ms p50), endpointing (when has the caller stopped talking?), hold-music handling, DTMF fallback, escalation handoff with structured summary. Behind a feature flag on a sub-100-call cohort.
Run the 4-metric eval against shadow traffic for 2 weeks. Hallucination review by human sample. Roll out 10% → 50% → 100% gated by eval movement. Token-optimization pass post-cutover: route to cheaper models per turn, cache the system prompt, idle-trim. Most voice pilots see per-minute cost drop 60–80% at the same eval-suite quality.
Three anonymized capability patterns drawn from real voice ai agent engagements — one on the Realtime API (mobile-native), one chained (outbound telephony), one chained-multilingual (IVR replacement). Named references shared under NDA once we know what you're building.
Tier-1 voice queue averaging 4-minute wait at peak. Five inbound questions accounted for 62% of call volume. Existing IVR bouncing 80%+ to a human. The binding constraint was sub-700ms first-token — anything slower and US callers report a robot.
gpt-realtime-2 speech-to-speech over a help-center pgvector RAG, Twilio + Cloudflare Workers for sub-60ms ingress, handoff_to_human as a function-calling tool with confidence gating at 0.7. Whisper fallback for accent + noise. Kill point caught + fixed at week 5 (multilingual cache invalidation).
Customers re-ordering favourites mid-commute or hands-busy in the kitchen had a 7-tap nested-menu reorder flow. Drop-off at step 3 was 38%. Voice was the obvious primitive — but every hosted voice SDK added 200–400ms of round-trip that broke the conversational feel.
OpenAI Realtime API integrated directly into the Flutter shell via a thin Dart-to-WebRTC bridge. Function calls into the existing order-management API (no rewrite required). Sub-600ms TTFT measured end-to-end on the user's actual device, not on a wired desktop. Barge-in handled natively. Mic-permission UX and on-screen visual feedback shipped in the same release.
Multi-location clinic running 4,000 outbound reminder calls weekly through a human contact-center, each call $4–6 fully-loaded. Reschedule rate captured at the call was <30% because reps couldn't see the calendar live.
Chained stack (Telnyx + Deepgram Nova-3 + GPT-5-mini + ElevenLabs Flash) for outbound cost economics. Function calls into the EHR's scheduling API to offer real open slots live during the call. TCPA-aware calling windows, opt-out detection per turn, audit logging of every consent prompt. Escalation to a human queue when the caller asked anything outside reschedule scope.
Legacy IVR for a multi-country retail brand routed calls across 6 languages with 4-level keypad menus. Average pre-agent wait was 90 seconds; 22% of callers abandoned before reaching a human; non-English callers fared worst because the menu was English-first.
Single open-ended voice agent answers in the caller's language (detected on first turn) and routes by intent, not by keypad. Chained stack — Whisper-large-v3 for STT (best rare-language coverage), Sonnet 4.6 for multilingual reasoning, ElevenLabs Multilingual v2 for TTS. Integrated into the existing Genesys contact center via SIP REFER so the human-handoff queues didn't change.
Five things, in this order. (1) An audit — 1–2 weeks, $3K — that picks the voice workflow worth shipping, the architecture (Realtime · chained · unified vendor), and the telephony stack (Twilio · Telnyx · LiveKit · Daily). (2) A pilot — 5–7 weeks, $10–25K fixed — that ships one voice ai agent end-to-end on your chosen stack with the full anatomy wired (telephony → STT → LLM → TTS → guardrails → log) and gated by eval data, not a demo. (3) The 4-metric eval pass — intent-match accuracy, hallucination per 100 calls, barge-in latency, mean-time-to-action — run on a held-out set from your real call recordings. (4) Pilot rollout 10% → 50% → 100% gated by metric movement, with a walk-away point. (5) Continuous tier from $5K/month if you want to ship the next workflow or tune the live one. We're a voice ai company that publishes pricing and architecture choices openly — most listicle SERPs hide both.
Neither wins universally. Here's the honest decision. <strong>OpenAI Realtime API</strong> wins when first-token latency is the dominant requirement (in-app voice copilots, premium telephony, anything where conversational feel breaks above 800ms) and you can pay ~$0.06/min in production. <strong>Chained STT+LLM+TTS</strong> wins on cost (10–30× cheaper at scale), model flexibility (swap Claude ⇄ GPT per turn), and multilingual coverage — the right pick for high-volume outbound, multilingual support, and any workflow where per-minute economics dominate. <strong>Deepgram Voice Agent</strong> (unified vendor) wins when you want the fastest time-to-pilot with a single SLA and don't need to swap models. Our audit picks per workload, not per ideology — the openai realtime api development path is the right answer about a third of the time in our shipped pilots.
Honest numbers from shipped pilots. <strong>OpenAI Realtime API</strong>: ~$0.06/min all-in (model + audio in/out + telephony). <strong>Chained Whisper + GPT-5 + ElevenLabs Multilingual</strong>: ~$0.18/min naively, down to $0.05/min after the standard optimization pass (route to mini models per turn, cache the system prompt, idle-trim silence, swap Whisper for Deepgram Nova-3). <strong>Chained Deepgram + GPT-5-mini + Cartesia</strong>: ~$0.005–$0.02/min — the cheapest stack we ship, used on high-volume outbound. <strong>Deepgram Voice Agent</strong>: ~$0.04/min bundled. Per-minute cost is published in every audit deliverable in $/min terms, not vague "depends on usage" ranges. The waterfall above shows the optimization pass we run on every pilot.
Yes — that's the modern definition. A voice ai agent that just answers FAQs is a 2018 IVR with better TTS. Agentic ai development for voice means the agent can call your APIs (refund a charge, book an appointment, look up an order), retrieve from your RAG corpus per turn (catalog, policy docs, ticket history), make multi-step decisions (verify identity → check policy → execute action), and hand off to a human with a structured summary when policy says it should. We build voice agents on the same agentic-workflows stack as our <a href="/services/ai-agent-development/">AI agent development</a> work — the only difference is the interface is voice, not chat. Most agentic process automation projects benefit from a voice surface for a subset of the workflow; we'll tell you which during the audit.
Four metrics, evaluated on a held-out set drawn from your real call recordings. (1) <strong>Intent-match accuracy</strong> — does the agent classify and respond to each call type correctly? Target >85% on tier-1 intents. (2) <strong>Hallucination rate per 100 calls</strong> — how often does the agent invent a policy, balance, or feature that doesn't exist? Target <2 per 100 calls before rollout. (3) <strong>Barge-in latency</strong> — when the caller interrupts, how fast does the agent stop? Target <300ms p50, <500ms p95. (4) <strong>Mean-time-to-action (MTTA)</strong> — from call start to the moment the agent does the thing the caller wanted. Baseline measured during the audit, target movement defined per workflow. We sample 200 calls/week minimum during pilot for human review. Listicle competitors don't publish eval methodology because they don't run it — this is the disqualifying signal when shortlisting any voice agent platform vendor.
Five honest scenarios where we'll tell you not to build a voice agent. (1) <strong>Low-bandwidth or noisy environments</strong> — factory floors, transit, anywhere the STT layer can't get clean audio reliably. Text or a structured form wins. (2) <strong>Regulated read-backs</strong> — when the caller must read or confirm specific text verbatim (financial disclosures, prescription instructions), voice is slower and error-prone vs a screen with a checkbox. (3) <strong>Complex UI flows</strong> — flight selection, multi-product configuration, anything that needs a comparison grid. Voice forces serial enumeration of options. (4) <strong>Cost-sensitive workflows at massive scale</strong> — if your per-interaction budget is sub-$0.01 and volume is in the tens of millions, even the cheapest chained stack will outprice a chat or async channel. (5) <strong>Users who don't want to talk</strong> — most under-35 cohorts in B2C SaaS, anyone in an open-plan office, anyone who's already on a video call. The right voice strategy is often a voice surface for some of your traffic, not all of it.
Yes — and that's a meaningful differentiator. Most voice ai company SERPs are telephony-centric (Vapi, Retell, Synthflow all start at a phone number). We ship voice copilots inside iOS, Android, Flutter, and web apps via LiveKit Agents SDK or a direct Realtime-API WebRTC bridge. We wrote the UI kit that a lot of Flutter screens run on, so the voice surface ships with proper mic-permission UX, on-screen visual feedback during turn-taking, and interrupt handling that doesn't fight the OS. See the food-delivery capability pattern above for a shipped example — sub-600ms TTFT measured on real devices, +18% order-completion lift on the voice cohort. For mobile-specific work, the cross-link is <a href="/flutter-app-development-company/">Flutter app development</a>.
Honest answer — sometimes yes. <strong>Buy a platform</strong> (Vapi, Retell, Synthflow, Bland, Air) when your workflow is standard (inbound deflection, outbound reminders, appointment scheduling), your call volume is moderate (under ~100K minutes/month), and you don't have a unique RAG corpus or unusual tool surface. You'll be live in days, not weeks, and the per-minute cost premium (typically 30–60% above a custom build) is worth it for the time-to-value. <strong>Build custom</strong> when you need (a) deep RAG over a proprietary corpus, (b) tool-use against internal systems no platform integrates with, (c) a mobile-native or in-app voice surface, (d) per-minute economics that platform pricing can't hit at your volume, or (e) a vendor-agnostic stack you can swap as the model landscape moves. Our audit will tell you which side you're on — we've recommended "go buy Vapi" to about a quarter of voice audit clients and that's the right answer for them. <a href="/services/ai-integration-services/">AI integration services</a> often picks up the platform integration work when the answer is buy.
Voice is rarely the only AI surface you're building. These sibling pages cover the adjacent decisions — text channels, multi-step agents, the Realtime API itself, and mobile-native delivery.
The text-channel sibling — when async beats voice or voice is wrong primitive.
Read more 02Multi-step agents with tool use — voice is one interface to the same agentic stack.
Read more 03Realtime API depth and openai realtime api development — the speech-to-speech path.
Read more 04Plug your voice agent into Salesforce, Zendesk, HubSpot, EHR, OMS.
Read more 05Mobile-native voice copilots inside your Flutter / iOS / Android app.
Read more 06When the voice turn is one step in a larger agentic process automation pipeline.
Read moreBook a free AI voice agent audit. We'll review your call recordings or use-case shortlist, recommend the architecture (Realtime · chained · unified) and telephony stack (Twilio · Telnyx · LiveKit · Daily), project per-minute cost in $/min terms, design the 4-metric eval set, and give you a 90-day voice agent roadmap. No deck, no obligation to build.