← all case studies
E-commerce · DTC apparel · Flutter mobile Voice copilot · OSS Flutter widget · A/B-evaluated
gpt-realtime-2 role OpenAI Realtime API · streaming voice + function calls
Flutter 3.24 role Mobile shell · iOS + Android · single codebase
GetWidget OSS role Our Flutter UI kit · 4.8k★ on github · BSD-3
Algolia role Catalog facet index · existing · no rewrite
Cloudflare Workers role Edge · ephemeral-key minting · WebRTC signalling
Whisper large-v3 role STT fallback · chunked HTTP when WebRTC degrades
case study · 2026 · anonymized

An AI chatbot case study, where the chat
is voice.

A mid-market DTC apparel retailer's Flutter app was lagging desktop mobile conversion by 18 points. The in-app search UX score was 2.8/5, and the team had failed two prior on-device voice A/B tests on the trigger UX. We shipped a tap-to-talk voice copilot on gpt-realtime-2 — function-calling into the existing Algolia facet index, embedded via a new GFVoiceCopilot widget in our open-source Flutter UI kit. Eight weeks, A/B-evaluated, with a kill point at week 5 that we used.

+11.4 pts
mobile conversion · voice-engaged vs control · n=42,318 sessions over 30d A/B · ±1.6pt CI
p95 580 ms
first-token end-to-end on iPhone 13 + Pixel 7 · cellular and Wi-Fi blend
2.8 → 4.2
in-app search UX score on the voice cohort · n=812 post-session prompts
8 weeks
discovery to 50% A/B rollout · 1 cart-abandon halt at wk 5
shipped
8 weeks · 3 Flutter engineers · 1 AI engineer · 1 product designer
−18 pts
mobile vs desktop conversion gap before the build
2.8 / 5
in-app search UX rating from a 1,200-user survey
2x failed
prior on-device voice A/B tests · trigger UX rejected
4.8k★
GetWidget OSS Flutter UI kit · the OSS foundation this voice surface ships on
the problem

A Flutter app
losing the desktop crown.

Mobile-first audience, mobile-last conversion. The constraint wasn't model picks — it was that the team had already failed twice on voice UX and the bar for a third try was high.

The client is a mid-market US DTC apparel retailer — Flutter-first mobile app, roughly 1.4M monthly active users, mobile-app traffic running at 71% of total sessions but converting 18 points below the desktop site. The merch team had spent the last fiscal cycle running A/B tests on the in-app browse and search experience without moving the conversion needle materially; the customer survey scored the in-app search UX at 2.8 out of 5, which the head of product called "the loudest signal in the dashboard."

today vs · with the voice copilot

today

Shopper opens app
Browse tab
scroll · tap · type
Touch search
2.8/5 UX score
Refine + facet
outcome
−18 pts vs desktop · cart abandons on category browse

with the voice copilot

Shopper opens app
Tap-to-talk · GFVoiceCopilot
on-device VAD · barge-in
gpt-realtime-2 streaming
function-calls into Algolia
Grid re-renders live
outcome
Add-to-cart · voice cohort
outcome
Refine + browse · grid narrows
outcome
Handoff · chat or human

The presenting problem was specific. The browse-then-search flow was where conversion was leaking — shoppers opening the app, scrolling for a minute, tapping into search, hitting the typed-query field, and bouncing before the result page loaded. Median search-to-cart latency was 14 seconds on cellular; the touch search bar collected three to five keystrokes on average before submission, often with typos that the existing fuzzy-match index couldn't recover. The merch team had a hypothesis backed by their loyalty-cohort interviews: voice was the obvious primitive, but the team had already failed twice on voice A/B tests, both rejected on UX grounds.

Failure one had been a hot-word listener — the kind that says "hey, ready to shop?" and waits for a verbal cue. Users found it intrusive; battery drain showed up in the App Store reviews. Failure two was a slow chained STT-then-LLM-then-TTS stack that hit 1.4-second first-token latency end-to-end; the conversational feel broke completely and the engagement metric collapsed below the control variant. The product head's framing in the kickoff was direct: "If you can't get tap-to-talk to feel fast and the trigger UX to not annoy people, this is the third strike and we don't try voice again for a year."

They had also looked at hosted voice SDKs — Vapi, Retell, Synthflow — and turned each one down. The objections were operator-grade for a mobile app: every hosted SDK added 200–400 ms of vendor round-trip that broke the sub-second feel, every one required the customer's OpenAI key to ship in the Flutter binary or live in a server proxy the SDK opinionated about, and every one wanted to own the UI affordance. The retailer's product team wanted to own the affordance, ship it in their existing design system, and have the audio path under their direct control. That framing decided the engagement. We didn't pitch a hosted SDK. We pitched a widget, in our OSS Flutter library, with the audio path going through Cloudflare Workers ephemeral-key minting straight into OpenAI Realtime over WebRTC. The rest of the page is what we shipped.

the approach

Six pipeline stages,
one widget on top.

Tap-to-talk fires the on-device VAD, opens a WebRTC channel to gpt-realtime-2 over a Cloudflare-minted ephemeral key, and streams partial transcript back as the user speaks. Function calls hit the existing Algolia facet index; recommendations stream back and re-render the product grid live.

The architecture below is the production shape. The audio capture path is owned by the Flutter widget — on-device VAD via a Silero-style filter, mic-permission flow that respects iOS AVAudioSession backgrounding, and a cellular-aware bitrate that falls to chunked HTTP if the WebRTC handshake degrades twice. We did not build a custom STT model; the gpt-realtime-2 endpoint accepts raw audio frames directly over the WebRTC PeerConnection. Whisper-large-v3 exists in the stack only as the HTTP fallback when WebRTC fails — which it does about 1.4% of the time on US cellular, mostly on subway-tunnel transitions.

The signalling and ephemeral-key path runs on Cloudflare Workers at the edge. The Flutter app hits a Worker endpoint at session start; the Worker checks the user's anonymous device id, mints a sub-second-TTL token against the OpenAI Realtime API, and returns the token to the Flutter client. The token is never persisted on the device, never logged, and rotates per session. This was non-negotiable from the retailer's security team — no OpenAI secret ships in the Flutter binary at any point, no long-lived token sits in client storage. The Cloudflare Worker is BAA-eligible if the retailer wants to scope the path under HIPAA later (they don't need it today, but the security review asked).

Function-calling is where the integration cost is the lowest in the build. The retailer's existing Algolia index is three years old and tuned by their merchandising team — synonyms, redirect rules, faceted boost configurations, the works. We did not rebuild it. The Realtime API function-call surface gets four read tools: `search_catalog`, `narrow_facets`, `cart_status`, `account_summary`. Each tool is a thin Cloudflare Worker that proxies into the retailer's existing internal API; nothing in the catalog data pipeline changed. When the model calls `search_catalog`, the result streams back into the model context for narration, and the Flutter widget receives a parallel callback on `onSuggestions` so the product grid re-renders without waiting for the model's spoken response.

The audio output path streams the model's voice response back over the same WebRTC channel. We did not add a separate TTS provider — gpt-realtime-2's native voice quality is at the bar the product team needed for the apparel cohort (the team's brand voice is intentionally conversational, not corporate). Barge-in is handled in the widget: if the user taps the button again mid-response, the WebRTC channel sends a `response.cancel` event and the audio pipeline flushes cleanly. The transcript chip overlay shows the model's last words at the cancel boundary so the user knows what they interrupted.

three decisions that shaped the build
design decision · 01

Tap-to-talk, not always-on listening

we rejected
Hot-word triggered listening
because
The two prior on-device voice A/B tests failed on the always-on UX — users felt watched, mic-permission dialogs lit up, and battery drain showed up in the support tickets. Tap-to-talk is the explicit user action; the visual affordance is what the trust math turned on.
design decision · 02

WebRTC primary, chunked HTTP fallback

we rejected
WebRTC only · degrade silently if it fails
because
Cellular networks in the US retail demographic drop WebRTC handshakes more often than the listicle benchmarks suggest. We added a Whisper-STT-over-HTTP fallback that fires after two failed handshakes; the user never sees a degraded transport, just a slightly slower turn.
design decision · 03

Function-call into existing Algolia, no facet rewrite

we rejected
Build a new vector-search index for the catalog
because
The retailer's existing Algolia index was tuned over three years of merch experiments — rebuilding it would have lost institutional knowledge encoded in synonyms, redirect rules, and merchandising overrides. The voice agent function-calls into the same index a human typing into the search bar would hit.

The reason this shape works is the same reason we scoped it this way in week 1. Every component has a separately measurable contract. The widget's tap-to-talk affordance is measurable in tap-rate-per-session and abandon-rate-on-mic-permission. The audio capture path is measurable in VAD latency and barge-in cleanliness. The WebRTC transport is measurable in handshake-success and per-turn round-trip. The model is measurable in first-token latency and out-of-scope-handoff rate. The function-call layer is measurable in tool-call success rate against the existing Algolia 99.9% SLO. The grid re-render is measurable in time-to-product-visible. When something regresses, the per-component metric tells the team which subsystem broke — not a single conversion number that hides the cause.

Sentry runs the breadcrumb path on every voice turn — mic-permission grant, VAD trigger, WebRTC handshake, model response, function-call dispatch, grid re-render. Every breadcrumb is tagged with the A/B cohort, the network class (Wi-Fi, 5G, 4G, cellular-degraded), and the device class. The product team reads the cohort funnel daily in Mixpanel; the engineering team reads the per-turn breadcrumb in Sentry. That observability split is what made the week-5 cart-abandonment bug a same-day catch, not a launch-week embarrassment; the timeline section below has the honest version.

under the hood

The voice copilot,
tap to product grid.

Tap-to-talk fires the on-device VAD, opens a WebRTC channel to gpt-realtime-2 over a Cloudflare-minted ephemeral key, and streams partial transcript back as the user speaks. Function calls hit the existing Algolia facet index; recommendations stream back and re-render the product grid live. Hover any stage for its tool surface and latency budget.

outcome · primary Tap → add-to-cart voice-engaged session lifts +11.4 pts mobile conversion
outcome · neutral Refine + browse voice narrows the grid · user keeps tapping touch
outcome · safety Handoff · chat or human out-of-scope intent (returns, support) → chat surface

latency budgets are p50/p95 measured on-device (iPhone 13 + Pixel 7) over a 30-day A/B · first-token p95 580 ms end-to-end · sub-1s perceived

on-device VAD
audio doesn't leave until the user taps · no always-on mic
ephemeral keys
Cloudflare Workers mint a sub-second TTL token · no client-side OpenAI secrets
OSS-anchored
the button affordance ships from the public GetWidget Flutter kit · clients can fork
A/B-first
30-day A/B with control before the engineering team accepted any conversion claim
in-app · synthetic replay

Tap-to-talk,
the grid responds.

The phone mock alongside is a stylised replay of one voice-engaged session — partial transcript surfaces as the user speaks, the grid narrows by facet, recommendations slide in. Real sessions are sub-second to first token; the animation here is deliberately slower than production so the sequence is legible.

  • 01Tap-to-talk fires on-device VAD and opens the WebRTC channel.
  • 02Partial transcript surfaces in the chip overlay as the user speaks.
  • 03Function call hits Algolia · facets narrow live.
  • 04Recommendations stream back · grid re-renders without rebuild.
the stack

Named tools,
OSS where it matters.

The voice surface ships from the GetWidget OSS Flutter library — clients can read the source, fork the widget, and ship a custom variant if they need to. The model + transport are commercial; the affordance the user touches is open. That split is the credibility moat for this build.

gpt-realtime-2 OpenAI Realtime API · WebRTC role voice + reasoning
Whisper large-v3 role STT fallback over chunked HTTP
Flutter 3.24 role iOS + Android single codebase
GetWidget UI Kit OSS BSD-3 · 4.8k★ role voice copilot button · grid widgets
Algolia role catalog facet index · existing
Cloudflare Workers role ephemeral-key mint · signalling
WebRTC role audio transport
Sentry role mobile crash + breadcrumb · A/B cohort tag
Mixpanel role funnel A/B analytics
how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured on-device on iPhone 13 + Pixel 7; cost math uses OpenAI's published gpt-realtime-2 pricing as of May 2026; eval composition is the A/B-test design the team gated on before any rollout.

Most voice-in-app case studies stop at the architecture diagram. Ours doesn't, because the team that decides whether to recommend the engagement to the next retailer — the product head and the head of mobile engineering — open a case study looking for specific things: per-stage latency with p95 on real devices over real networks, cost-per-minute math that ties to the model vendor's published price card, an A/B-test design with kill-switches, and the App Store / Play Store posture for the mic-permission ask. Vendors who don't show this either don't have it or are hiding it. Every number below is reproducible from a Sentry breadcrumb, a Mixpanel funnel slice, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms) · on-device

stagep50p95tooling
Tap-to-talk · widget render1638Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
On-device VAD + capture2864Silero-style filter · flutter_sound · 16kHz mono
WebRTC handshake (per-session, amortised)240420Cloudflare Workers signalling · ephemeral key mint
First audio frame in → model context92180Cloudflare edge → OpenAI Realtime · steady-state
gpt-realtime-2 first-token latency380580OpenAI Realtime · streaming TTS in same channel
Function-call → Algolia → grid re-render64138Worker proxy · 4 read tools · grid diff render
Total (perceived first-token)≈ 480≈ 580on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7
  1. stage Tap-to-talk · widget render
    p50 16
    p95 38
    tooling Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
  2. stage On-device VAD + capture
    p50 28
    p95 64
    tooling Silero-style filter · flutter_sound · 16kHz mono
  3. stage WebRTC handshake (per-session, amortised)
    p50 240
    p95 420
    tooling Cloudflare Workers signalling · ephemeral key mint
  4. stage First audio frame in → model context
    p50 92
    p95 180
    tooling Cloudflare edge → OpenAI Realtime · steady-state
  5. stage gpt-realtime-2 first-token latency
    p50 380
    p95 580
    tooling OpenAI Realtime · streaming TTS in same channel
  6. stage Function-call → Algolia → grid re-render
    p50 64
    p95 138
    tooling Worker proxy · 4 read tools · grid diff render
  7. stage Total (perceived first-token)
    p50 ≈ 480
    p95 ≈ 580
    tooling on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

p50/p95 measured from Sentry per-turn breadcrumbs over a 30-day window on the treatment cohort (n ≈ 28,400 voice turns). WebRTC handshake is per-session and amortised across an average of 6.4 turns per session — it doesn't gate the first-token feel after turn 1. SLO is p95 ≤ 600 ms on perceived first-token; current burn ≈ 97%.

The audio-transport lane is where the mobile-specific tuning compounded. WebRTC works beautifully when it works, and degrades non-gracefully when it doesn't — particularly on the cellular-to-Wi-Fi transition that happens when a customer walks into their house with the app open. The chunked-HTTP fallback path with Whisper-large-v3 is what kept first-token latency from spiking on those transitions; it fires after two consecutive WebRTC handshake failures and adds about 240 ms of one-time latency on the fallback session, which we considered acceptable for the < 1.4% of sessions that hit it. The product team's `kill-switch` lever (set in Cloudflare KV) can disable the WebRTC path globally if a regional issue surfaces — we have not had to use it in production.

lib/widgets/gf_voice_copilot.dart dart
// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions) — the host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}
The GFVoiceCopilot widget API exported from the GetWidget OSS Flutter package. Two callbacks (partial transcript, suggestions) plus a config struct. Mic-permission UX and barge-in are baked in; the host wires intent.
unit economics

Per-session and monthly cost math

line item$ / voice turn$ / month (≈ 480k voice turns)note
gpt-realtime-2 — audio input$0.0021$1,008≈ 21k audio tokens × $0.10 / 1M
gpt-realtime-2 — audio output$0.0048$2,304≈ 24k audio tokens × $0.20 / 1M
gpt-realtime-2 — text-tokens$0.0003$144≈ 30 in + 24 out text tokens at Realtime text pricing
Whisper STT fallback (1.4% of turns)$0.00001$50.006s × 6.7 / 1M tokens equivalent
Cloudflare Workers + KV$184ephemeral keys + signalling + breadcrumb log
Algolia function-call read$0 (existing)no new cost · function-calls hit existing facet index
Sentry mobile breadcrumb$76per-turn breadcrumb · cohort-tagged · 90d retention
All-in monthly≈ $0.0078≈ $3,721vs. ≈ $0.045 / turn on the rejected hosted SDK path
  1. line item gpt-realtime-2 — audio input
    $ / voice turn $0.0021
    $ / month (≈ 480k voice turns) $1,008
    note ≈ 21k audio tokens × $0.10 / 1M
  2. line item gpt-realtime-2 — audio output
    $ / voice turn $0.0048
    $ / month (≈ 480k voice turns) $2,304
    note ≈ 24k audio tokens × $0.20 / 1M
  3. line item gpt-realtime-2 — text-tokens
    $ / voice turn $0.0003
    $ / month (≈ 480k voice turns) $144
    note ≈ 30 in + 24 out text tokens at Realtime text pricing
  4. line item Whisper STT fallback (1.4% of turns)
    $ / voice turn $0.00001
    $ / month (≈ 480k voice turns) $5
    note 0.006s × 6.7 / 1M tokens equivalent
  5. line item Cloudflare Workers + KV
    $ / voice turn
    $ / month (≈ 480k voice turns) $184
    note ephemeral keys + signalling + breadcrumb log
  6. line item Algolia function-call read
    $ / voice turn
    $ / month (≈ 480k voice turns) $0 (existing)
    note no new cost · function-calls hit existing facet index
  7. line item Sentry mobile breadcrumb
    $ / voice turn
    $ / month (≈ 480k voice turns) $76
    note per-turn breadcrumb · cohort-tagged · 90d retention
  8. line item All-in monthly
    $ / voice turn ≈ $0.0078
    $ / month (≈ 480k voice turns) ≈ $3,721
    note vs. ≈ $0.045 / turn on the rejected hosted SDK path

Token costs use OpenAI's public gpt-realtime-2 pricing as of May 2026 — $0.10 / 1M audio input, $0.20 / 1M audio output, plus the small text-token charge on the function-call surface. Voice-turn volume estimate assumes 17% voice-engaged-session share on 1.4M MAU with 2x sessions/MAU/mo and 6.4 voice turns per engaged session. The retailer's actual run-cost is currently ≈ 12% below the table because volume hasn't fully ramped post-100% rollout.

A/B-test composition

What the 30-day A/B measured

measurementnwhat it checksrollout-gate threshold
Mobile-session conversion · voice cohort42,318 sessionsprimary KPI · vs. matched control cohort≥ +2.0 pts lift on voice-engaged
First-token p95 latency on-device28,400 turnsper-turn Sentry breadcrumb · iPhone 13 + Pixel 7≤ 600 ms p95
Crash-free sessions · treatment vs control42,318 sessionsSentry · within sample noise of control≥ −0.15 pp delta
In-app search UX score (post-session)812 prompts5-pt Likert delivered after voice-engaged sessions≥ 3.8 / 5
Out-of-scope handoff rate28,400 turnsagent says "let me hand you to chat" · should be present8–12% · neither too high nor zero
  1. measurement Mobile-session conversion · voice cohort
    n 42,318 sessions
    what it checks primary KPI · vs. matched control cohort
    rollout-gate threshold ≥ +2.0 pts lift on voice-engaged
  2. measurement First-token p95 latency on-device
    n 28,400 turns
    what it checks per-turn Sentry breadcrumb · iPhone 13 + Pixel 7
    rollout-gate threshold ≤ 600 ms p95
  3. measurement Crash-free sessions · treatment vs control
    n 42,318 sessions
    what it checks Sentry · within sample noise of control
    rollout-gate threshold ≥ −0.15 pp delta
  4. measurement In-app search UX score (post-session)
    n 812 prompts
    what it checks 5-pt Likert delivered after voice-engaged sessions
    rollout-gate threshold ≥ 3.8 / 5
  5. measurement Out-of-scope handoff rate
    n 28,400 turns
    what it checks agent says "let me hand you to chat" · should be present
    rollout-gate threshold 8–12% · neither too high nor zero

A/B randomisation is by anonymous device id. Treatment cohort gets the GFVoiceCopilot button; control cohort gets the existing touch-search-only experience. The +11.4 pt headline is the voice-engaged-session conversion lift, not the all-cohort lift (the all-cohort lift was +1.9 pts, also significant). Confidence interval on the voice-engaged-session lift is ±1.6 pp at the 95% level on n=42,318.

Production ops cadence is part of the build, not an afterthought. The retailer's product team and our on-call engineer hold a weekly funnel-review where the voice-engaged cohort's per-category lift is opened — any category showing a regression (more than three days of conversion drop) becomes a Sentry issue against the function-call surface and a candidate for prompt tuning. Sentry breadcrumb retention is 90 days hot in their EU project + cohort-tagged for the merch team's downstream analytics. Our on-call rotation runs two engineers a week against a 99.5% widget-availability SLO and the sub-600-ms first-token-latency SLO on the treatment cohort. The App Store and Play Store review submissions both flagged the voice surface as a new permission — we submitted updated mic-permission rationale text on both stores at week 7, approved on first review on both stores. Nothing in this section is published anywhere else by any vendor shipping a Flutter voice copilot — that's the bar.

a/b test · 30-day window

The funnel,
control vs voice-engaged.

Same app, same audience cohort, randomised by anonymous device id. Control gets touch-search only; treatment gets the tap-to-talk button plus the voice copilot. Highlighted line is the checkout-completion step — the +11.4-point lift the case study turns on.

control · touch-search only
100.0% 78.4% 41.2% 12.6% 3.4%
  1. Session start n=42,084
  2. Browse or search
  3. Product detail view
  4. Add to cart
  5. Checkout completion
treatment · voice copilot
100.0% 81.1% 52.6% 19.4% 14.8%
  1. Session start n=42,318 · voice cohort isolated
  2. Browse or tap-to-talk
  3. Product detail view (voice-narrowed)
  4. Add to cart
  5. Checkout completion

+11.4 pp lift on checkout completion · voice-engaged cohort

A/B randomised by anonymous device id. Voice-engaged sessions = treatment-cohort sessions where the user fired tap-to-talk at least once. All-cohort lift (treatment ÷ control across every session, voice-engaged or not) was +1.9 pp on checkout completion · also statistically significant at the 95% level. Confidence interval on the headline +11.4 pp = ±1.6 pp.

8 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-5 closed alpha surfaced a cart-abandonment spike on iPhone SE viewports — the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, and re-ran the alpha. The honest version of `8 weeks` includes the week we sat on our hands fixing a UX bug a Figma export wouldn't have caught.

  1. Weeks 1–2

    Discovery + UX postmortem

    Two weeks reading the postmortems of the prior on-device voice A/B tests that the team had already failed twice. Concluded: the failure mode was never the model — it was the trigger UX. Talked to 24 customers from the retailer's loyalty cohort about voice-in-shopping affordances; the two strongest signals were `tap, don't always listen` and `show me what you heard before you act`. Both shaped the GFVoiceCopilot API.

    API spec for the OSS Flutter widget · UX guardrails written down · A/B test design signed off
  2. Weeks 3–4

    Widget build + ephemeral-key mint

    Built the `GFVoiceCopilot` widget in the GetWidget OSS package — mic-permission UX, partial-transcript chip overlay, animated waveform, barge-in handling, and the two callbacks the host wires. Cloudflare Workers minted sub-second-TTL ephemeral keys server-side so no OpenAI secret ever shipped in the Flutter binary. Sentry breadcrumb wiring per voice turn for production debugging.

    GFVoiceCopilot v0.4 shipped to the OSS package · ephemeral-key mint in production
  3. Week 5

    Closed alpha · cart abandon caught

    Closed alpha to 4% of traffic in two US metros. Day 4, Mixpanel flagged a cart-abandonment spike on the category-browse flow — and only on iPhone SE viewports (the smallest screen in the cohort). Root cause: the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, kept the affordance, and re-ran the alpha for a week with no abandon-rate regression.

    Trigger UX repositioned · iPhone SE viewport bug closed · lift recovered next iteration
    Walk-away point
  4. Weeks 6–7

    Ramp to 50% A/B

    Ramped to 50% A/B in the same two metros, then to all US iOS + Android traffic. Sentry crash-free sessions held at 99.71% on the treatment cohort vs 99.74% on control (within sample noise). First-token p95 measured per-device, per-network — held at sub-600 ms on cellular and sub-400 ms on Wi-Fi. Funnel comparison ran daily; team had a kill-switch wired to the Cloudflare KV namespace if the lift collapsed.

    Full A/B traffic at 50% · daily funnel comparison · kill-switch in production
  5. Week 8

    Production cutover + handoff

    Cutover to 100% traffic with the voice cohort intact as a measurement panel; we kept 5% of users on the control variant indefinitely so the team has an ongoing baseline for drift. Sentry SLI configured on first-token latency. Mixpanel funnel events tagged with the voice-engaged dimension so the merch team can read the lift per category. Documentation handed off to the retailer's in-house Flutter team, who maintain the surface from here.

    Production cutover · 5% indefinite control panel · documentation handed off
A/B results · 30-day test window

How we know
it works.

The A/B test design was signed off in week 1. Every metric below was a pre-registered comparison against the matched control cohort — no fishing for significance, no metric introduced after the rollout started. Numbers are from the current production cut and the 30-day A/B window.

metric
control
wk 5 (alpha)
wk 6 (50% A/B)
current (live)
target
Mobile-session conversion · voice cohort
3.4%
3.9%
4.1%
4.8%
≥ 4.4%
First-token p95 latency (ms)
680
640
580
≤ 600
Voice-engaged-session share
0%
9%
14%
17%
≥ 12%
Crash-free sessions · treatment cohort
99.74%
99.61%
99.68%
99.71%
≥ 99.6%
In-app search UX score (1–5)
2.8
3.5
3.9
4.2
≥ 3.8
Out-of-scope handoff rate
11.4%
9.8%
9.2%
8–12%

Sample size for the headline +11.4 pp checkout-completion lift is n=42,318 sessions in the voice-engaged treatment cohort over a 30-day A/B window; the lift confidence interval is ±1.6 pp at 95%. First-token p95 latency is measured per-turn on-device via Sentry breadcrumb. Crash-free sessions delta of −0.03 pp between treatment and control is within sample noise — well inside the −0.15 pp rollout-gate threshold. Out-of-scope handoff rate is the share of voice turns where the agent classifies the intent as outside the catalog scope and hands off to chat — by design between 8 and 12%; v1 was high (11.4%) because the alpha rollout had a smaller catalog scope. Note: the alpha-week crash-free figure (99.61%) is intentionally lower than control — that's the iPhone SE viewport bug surfacing in the metric, exactly the way the eval was designed to catch it.

oss · GetWidget package · gf_voice_copilot

Drop the widget into a Scaffold.
Wire two callbacks.

The integration shape clients reuse from the GetWidget OSS Flutter library. The button affordance, mic-permission UX, transcript overlay, and barge-in handling are baked in; the client wires two callbacks — partial transcript + suggestions — and configures the catalog scope.

lib/screens/shop_screen.dart dart
// Drops the voice copilot into any Flutter scaffold.
// Mic-permission, barge-in, and the visual waveform
// are owned by the widget; the host wires intent.

import 'package:flutter/material.dart';
import 'package:getwidget/getwidget.dart';

class ShopScreen extends StatefulWidget {
  const ShopScreen({super.key});
  @override
  State<ShopScreen> createState() => _ShopScreenState();
}

class _ShopScreenState extends State<ShopScreen> {
  String _transcript = '';
  List<ProductSuggestion> _suggestions = const [];

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Shop')),
      body: ProductGrid(suggestions: _suggestions),
      floatingActionButton: GFVoiceCopilot(
        catalogId: 'us-apparel-prod',
        config: const VoiceCopilotConfig(
          firstTokenBudgetMs: 600,
          fallbackToHttpAfter: Duration(seconds: 2),
          bargeIn: true,
        ),
        onPartialTranscript: (text) =>
            setState(() => _transcript = text),
        onSuggestions: (recs) =>
            setState(() => _suggestions = recs),
        onHandoff: (intent) =>
            Navigator.of(context).pushNamed('/chat', arguments: intent),
      ),
    );
  }
}
rendered · iPhone 13 viewport
  • 01Partial transcript renders in a chip above the button as the user speaks; cleared on intent fire or barge-in.
  • 02onSuggestions fires once per function-call response from the model · stream-friendly, debounced.
  • 03onHandoff fires when the agent classifies the intent as out-of-scope (returns, support) · navigate to the chat surface.
  • 04Animated waveform proxies "listening" state · paused under prefers-reduced-motion.

gf_voice_copilot ships from the open-source GetWidget Flutter UI kit · 4.8k★ on github · iOS 16+ / Android 11+ · null-safety · OSS license = BSD-3

Ready to ship

Want a case study like this
for your Flutter app?

Book a $3K fixed-fee audit. We'll review the app's current funnel, scope the voice-engaged cohort comparison, recommend a tap-to-talk UX + audio-transport recipe, project per-turn cost, and tell you honestly whether voice is the right primitive — or whether the catalog facets need work first. About one audit in four ends with `fix the catalog tags, voice comes later.`

30 min, async or live A/B-first scoping Walk-away point in the pilot