E-commerce · DTC apparel · Flutter mobile Voice copilot · OSS Flutter widget · A/B-evaluated

gpt-realtime-2

Flutter 3.24

GetWidget OSS

Algolia

Cloudflare Workers

Whisper large-v3

case study · 2026 · anonymized

An AI chatbot case study, where the chat
is voice.

A mid-market DTC apparel retailer's Flutter app was lagging desktop mobile conversion by 18 points. The in-app search UX score was 2.8/5, and the team had failed two prior on-device voice A/B tests on the trigger UX. We shipped a tap-to-talk voice copilot on gpt-realtime-2 — function-calling into the existing Algolia facet index, embedded via a new GFVoiceCopilot widget in our open-source Flutter UI kit. Eight weeks, A/B-evaluated, with a kill point at week 5 that we used.

+11.4 pts

mobile conversion · voice-engaged vs control · n=42,318 sessions over 30d A/B · ±1.6pt CI

p95 580 ms

first-token end-to-end on iPhone 13 + Pixel 7 · cellular and Wi-Fi blend

2.8 → 4.2

in-app search UX score on the voice cohort · n=812 post-session prompts

8 weeks

discovery to 50% A/B rollout · 1 cart-abandon halt at wk 5

shipped

8 weeks · 3 Flutter engineers · 1 AI engineer · 1 product designer

−18 pts

mobile vs desktop conversion gap before the build

2.8 / 5

in-app search UX rating from a 1,200-user survey

2x failed

prior on-device voice A/B tests · trigger UX rejected

4.8k★

GetWidget OSS Flutter UI kit · the OSS foundation this voice surface ships on

the problem

A Flutter app
losing the desktop crown.

Mobile-first audience, mobile-last conversion. The constraint wasn't model picks — it was that the team had already failed twice on voice UX and the bar for a third try was high.

The client is a mid-market US DTC apparel retailer — Flutter-first mobile app, roughly 1.4M monthly active users, mobile-app traffic running at 71% of total sessions but converting 18 points below the desktop site. The merch team had spent the last fiscal cycle running A/B tests on the in-app browse and search experience without moving the conversion needle materially; the customer survey scored the in-app search UX at 2.8 out of 5, which the head of product called "the loudest signal in the dashboard."

today vs · with the voice copilot

today

Shopper opens app

Browse tab

scroll · tap · type

Touch search

2.8/5 UX score

Refine + facet

outcome

−18 pts vs desktop · cart abandons on category browse

with the voice copilot

Shopper opens app

Tap-to-talk · GFVoiceCopilot

on-device VAD · barge-in

gpt-realtime-2 streaming

function-calls into Algolia

Grid re-renders live

outcome

Add-to-cart · voice cohort

outcome

Refine + browse · grid narrows

outcome

Handoff · chat or human

The presenting problem was specific. The browse-then-search flow was where conversion was leaking — shoppers opening the app, scrolling for a minute, tapping into search, hitting the typed-query field, and bouncing before the result page loaded. Median search-to-cart latency was 14 seconds on cellular; the touch search bar collected three to five keystrokes on average before submission, often with typos that the existing fuzzy-match index couldn't recover. The merch team had a hypothesis backed by their loyalty-cohort interviews: voice was the obvious primitive, but the team had already failed twice on voice A/B tests, both rejected on UX grounds.

Failure one had been a hot-word listener — the kind that says "hey, ready to shop?" and waits for a verbal cue. Users found it intrusive; battery drain showed up in the App Store reviews. Failure two was a slow chained STT-then-LLM-then-TTS stack that hit 1.4-second first-token latency end-to-end; the conversational feel broke completely and the engagement metric collapsed below the control variant. The product head's framing in the kickoff was direct: "If you can't get tap-to-talk to feel fast and the trigger UX to not annoy people, this is the third strike and we don't try voice again for a year."

They had also looked at hosted voice SDKs — Vapi, Retell, Synthflow — and turned each one down. The objections were operator-grade for a mobile app: every hosted SDK added 200–400 ms of vendor round-trip that broke the sub-second feel, every one required the customer's OpenAI key to ship in the Flutter binary or live in a server proxy the SDK opinionated about, and every one wanted to own the UI affordance. The retailer's product team wanted to own the affordance, ship it in their existing design system, and have the audio path under their direct control. That framing decided the engagement. We didn't pitch a hosted SDK. We pitched a widget, in our OSS Flutter library, with the audio path going through Cloudflare Workers ephemeral-key minting straight into OpenAI Realtime over WebRTC. The rest of the page is what we shipped.

the approach

Six pipeline stages,
one widget on top.

Tap-to-talk fires the on-device VAD, opens a WebRTC channel to gpt-realtime-2 over a Cloudflare-minted ephemeral key, and streams partial transcript back as the user speaks. Function calls hit the existing Algolia facet index; recommendations stream back and re-render the product grid live.

The architecture below is the production shape. The audio capture path is owned by the Flutter widget — on-device VAD via a Silero-style filter, mic-permission flow that respects iOS AVAudioSession backgrounding, and a cellular-aware bitrate that falls to chunked HTTP if the WebRTC handshake degrades twice. We did not build a custom STT model; the gpt-realtime-2 endpoint accepts raw audio frames directly over the WebRTC PeerConnection. Whisper-large-v3 exists in the stack only as the HTTP fallback when WebRTC fails — which it does about 1.4% of the time on US cellular, mostly on subway-tunnel transitions.

The signalling and ephemeral-key path runs on Cloudflare Workers at the edge. The Flutter app hits a Worker endpoint at session start; the Worker checks the user's anonymous device id, mints a sub-second-TTL token against the OpenAI Realtime API, and returns the token to the Flutter client. The token is never persisted on the device, never logged, and rotates per session. This was non-negotiable from the retailer's security team — no OpenAI secret ships in the Flutter binary at any point, no long-lived token sits in client storage. The Cloudflare Worker is BAA-eligible if the retailer wants to scope the path under HIPAA later (they don't need it today, but the security review asked).

Function-calling is where the integration cost is the lowest in the build. The retailer's existing Algolia index is three years old and tuned by their merchandising team — synonyms, redirect rules, faceted boost configurations, the works. We did not rebuild it. The Realtime API function-call surface gets four read tools: `search_catalog`, `narrow_facets`, `cart_status`, `account_summary`. Each tool is a thin Cloudflare Worker that proxies into the retailer's existing internal API; nothing in the catalog data pipeline changed. When the model calls `search_catalog`, the result streams back into the model context for narration, and the Flutter widget receives a parallel callback on `onSuggestions` so the product grid re-renders without waiting for the model's spoken response.

The audio output path streams the model's voice response back over the same WebRTC channel. We did not add a separate TTS provider — gpt-realtime-2's native voice quality is at the bar the product team needed for the apparel cohort (the team's brand voice is intentionally conversational, not corporate). Barge-in is handled in the widget: if the user taps the button again mid-response, the WebRTC channel sends a `response.cancel` event and the audio pipeline flushes cleanly. The transcript chip overlay shows the model's last words at the cancel boundary so the user knows what they interrupted.

three decisions that shaped the build

design decision · 01

Tap-to-talk, not always-on listening

we rejected: Hot-word triggered listening
because: The two prior on-device voice A/B tests failed on the always-on UX — users felt watched, mic-permission dialogs lit up, and battery drain showed up in the support tickets. Tap-to-talk is the explicit user action; the visual affordance is what the trust math turned on.

design decision · 02

WebRTC primary, chunked HTTP fallback

we rejected: WebRTC only · degrade silently if it fails
because: Cellular networks in the US retail demographic drop WebRTC handshakes more often than the listicle benchmarks suggest. We added a Whisper-STT-over-HTTP fallback that fires after two failed handshakes; the user never sees a degraded transport, just a slightly slower turn.

design decision · 03

Function-call into existing Algolia, no facet rewrite

we rejected: Build a new vector-search index for the catalog
because: The retailer's existing Algolia index was tuned over three years of merch experiments — rebuilding it would have lost institutional knowledge encoded in synonyms, redirect rules, and merchandising overrides. The voice agent function-calls into the same index a human typing into the search bar would hit.

The reason this shape works is the same reason we scoped it this way in week 1. Every component has a separately measurable contract. The widget's tap-to-talk affordance is measurable in tap-rate-per-session and abandon-rate-on-mic-permission. The audio capture path is measurable in VAD latency and barge-in cleanliness. The WebRTC transport is measurable in handshake-success and per-turn round-trip. The model is measurable in first-token latency and out-of-scope-handoff rate. The function-call layer is measurable in tool-call success rate against the existing Algolia 99.9% SLO. The grid re-render is measurable in time-to-product-visible. When something regresses, the per-component metric tells the team which subsystem broke — not a single conversion number that hides the cause.

Sentry runs the breadcrumb path on every voice turn — mic-permission grant, VAD trigger, WebRTC handshake, model response, function-call dispatch, grid re-render. Every breadcrumb is tagged with the A/B cohort, the network class (Wi-Fi, 5G, 4G, cellular-degraded), and the device class. The product team reads the cohort funnel daily in Mixpanel; the engineering team reads the per-turn breadcrumb in Sentry. That observability split is what made the week-5 cart-abandonment bug a same-day catch, not a launch-week embarrassment; the timeline section below has the honest version.

under the hood

The voice copilot,
tap to product grid.

outcome · primary Tap → add-to-cart voice-engaged session lifts +11.4 pts mobile conversion

outcome · neutral Refine + browse voice narrows the grid · user keeps tapping touch

outcome · safety Handoff · chat or human out-of-scope intent (returns, support) → chat surface

tool inventory

Hover or focus a stage on the left to see the tools it touches, its latency budget, and which part of the mobile session it owns.

latency budgets are p50/p95 measured on-device (iPhone 13 + Pixel 7) over a 30-day A/B · first-token p95 580 ms end-to-end · sub-1s perceived

on-device VAD

audio doesn't leave until the user taps · no always-on mic

ephemeral keys

Cloudflare Workers mint a sub-second TTL token · no client-side OpenAI secrets

OSS-anchored

the button affordance ships from the public GetWidget Flutter kit · clients can fork

A/B-first

30-day A/B with control before the engineering team accepted any conversion claim

in-app · synthetic replay

Tap-to-talk,
the grid responds.

The phone mock alongside is a stylised replay of one voice-engaged session — partial transcript surfaces as the user speaks, the grid narrows by facet, recommendations slide in. Real sessions are sub-second to first token; the animation here is deliberately slower than production so the sequence is legible.

01Tap-to-talk fires on-device VAD and opens the WebRTC channel.
02Partial transcript surfaces in the chip overlay as the user speaks.
03Function call hits Algolia · facets narrow live.
04Recommendations stream back · grid re-renders without rebuild.

the stack

Named tools,
OSS where it matters.

The voice surface ships from the GetWidget OSS Flutter library — clients can read the source, fork the widget, and ship a custom variant if they need to. The model + transport are commercial; the affordance the user touches is open. That split is the credibility moat for this build.

gpt-realtime-2 OpenAI Realtime API · WebRTC

Whisper large-v3

Flutter 3.24

GetWidget UI Kit OSS BSD-3 · 4.8k★

Algolia

Cloudflare Workers

WebRTC

Sentry

Mixpanel

how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured on-device on iPhone 13 + Pixel 7; cost math uses OpenAI's published gpt-realtime-2 pricing as of May 2026; eval composition is the A/B-test design the team gated on before any rollout.

Most voice-in-app case studies stop at the architecture diagram. Ours doesn't, because the team that decides whether to recommend the engagement to the next retailer — the product head and the head of mobile engineering — open a case study looking for specific things: per-stage latency with p95 on real devices over real networks, cost-per-minute math that ties to the model vendor's published price card, an A/B-test design with kill-switches, and the App Store / Play Store posture for the mic-permission ask. Vendors who don't show this either don't have it or are hiding it. Every number below is reproducible from a Sentry breadcrumb, a Mixpanel funnel slice, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms) · on-device

stage	p50	p95	tooling
Tap-to-talk · widget render	16	38	Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
On-device VAD + capture	28	64	Silero-style filter · flutter_sound · 16kHz mono
WebRTC handshake (per-session, amortised)	240	420	Cloudflare Workers signalling · ephemeral key mint
First audio frame in → model context	92	180	Cloudflare edge → OpenAI Realtime · steady-state
gpt-realtime-2 first-token latency	380	580	OpenAI Realtime · streaming TTS in same channel
Function-call → Algolia → grid re-render	64	138	Worker proxy · 4 read tools · grid diff render
Total (perceived first-token)	≈ 480	≈ 580	on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

stage Tap-to-talk · widget render
p50 16
p95 38
tooling Flutter 3.24 · GetWidget GFVoiceCopilot · MaterialState
stage On-device VAD + capture
p50 28
p95 64
tooling Silero-style filter · flutter_sound · 16kHz mono
stage WebRTC handshake (per-session, amortised)
p50 240
p95 420
tooling Cloudflare Workers signalling · ephemeral key mint
stage First audio frame in → model context
p50 92
p95 180
tooling Cloudflare edge → OpenAI Realtime · steady-state
stage gpt-realtime-2 first-token latency
p50 380
p95 580
tooling OpenAI Realtime · streaming TTS in same channel
stage Function-call → Algolia → grid re-render
p50 64
p95 138
tooling Worker proxy · 4 read tools · grid diff render
stage Total (perceived first-token)
p50 ≈ 480
p95 ≈ 580
tooling on-device · cellular + Wi-Fi blend · iPhone 13 + Pixel 7

p50/p95 measured from Sentry per-turn breadcrumbs over a 30-day window on the treatment cohort (n ≈ 28,400 voice turns). WebRTC handshake is per-session and amortised across an average of 6.4 turns per session — it doesn't gate the first-token feel after turn 1. SLO is p95 ≤ 600 ms on perceived first-token; current burn ≈ 97%.

The audio-transport lane is where the mobile-specific tuning compounded. WebRTC works beautifully when it works, and degrades non-gracefully when it doesn't — particularly on the cellular-to-Wi-Fi transition that happens when a customer walks into their house with the app open. The chunked-HTTP fallback path with Whisper-large-v3 is what kept first-token latency from spiking on those transitions; it fires after two consecutive WebRTC handshake failures and adds about 240 ms of one-time latency on the fallback session, which we considered acceptable for the < 1.4% of sessions that hit it. The product team's `kill-switch` lever (set in Cloudflare KV) can disable the WebRTC path globally if a regional issue surfaces — we have not had to use it in production.

lib/widgets/gf_voice_copilot.dart dart

// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions) — the host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}

// gf_voice_copilot.dart — GetWidget OSS Flutter package
//
// Drop the voice copilot into any Flutter scaffold. Mic-permission
// UX, the partial transcript chip, the animated waveform, and
// barge-in handling all live in the widget. The host wires the
// two callbacks: partial transcript (during) + suggestions (after
// the function call resolves).

import 'package:flutter/material.dart';

class VoiceCopilotConfig {
  /// First-token latency budget. The widget surfaces a degraded
  /// affordance when the model exceeds it twice in a row.
  final int firstTokenBudgetMs;

  /// Time to wait before falling from WebRTC to chunked HTTP + Whisper.
  final Duration fallbackToHttpAfter;

  /// Honour barge-in (tap-to-cancel mid-response).
  final bool bargeIn;

  const VoiceCopilotConfig({
    this.firstTokenBudgetMs = 600,
    this.fallbackToHttpAfter = const Duration(seconds: 2),
    this.bargeIn = true,
  });
}

class ProductSuggestion {
  final String sku;
  final String title;
  final num priceCents;
  const ProductSuggestion({
    required this.sku,
    required this.title,
    required this.priceCents,
  });
}

typedef OnSuggestions   = void Function(List<ProductSuggestion>);
typedef OnTranscript    = void Function(String partial);
typedef OnHandoff       = void Function(String intent);

class GFVoiceCopilot extends StatefulWidget {
  /// Scopes the function-call surface to this catalog (per-store SKU set).
  final String catalogId;

  /// Latency + fallback behaviour.
  final VoiceCopilotConfig config;

  /// Fires repeatedly as the model surfaces partial transcript.
  final OnTranscript onPartialTranscript;

  /// Fires once per function-call response with the suggestion list.
  final OnSuggestions onSuggestions;

  /// Fires when the agent classifies the intent as out-of-scope
  /// (returns, support, account questions) — the host should
  /// navigate to a chat or human surface here.
  final OnHandoff onHandoff;

  const GFVoiceCopilot({
    super.key,
    required this.catalogId,
    required this.onPartialTranscript,
    required this.onSuggestions,
    required this.onHandoff,
    this.config = const VoiceCopilotConfig(),
  });

  @override
  State<GFVoiceCopilot> createState() => _GFVoiceCopilotState();
}

The GFVoiceCopilot widget API exported from the GetWidget OSS Flutter package. Two callbacks (partial transcript, suggestions) plus a config struct. Mic-permission UX and barge-in are baked in; the host wires intent.

unit economics

Per-session and monthly cost math

line item	$ / voice turn	$ / month (≈ 480k voice turns)	note
gpt-realtime-2 — audio input	$0.0021	$1,008	≈ 21k audio tokens × $0.10 / 1M
gpt-realtime-2 — audio output	$0.0048	$2,304	≈ 24k audio tokens × $0.20 / 1M
gpt-realtime-2 — text-tokens	$0.0003	$144	≈ 30 in + 24 out text tokens at Realtime text pricing
Whisper STT fallback (1.4% of turns)	$0.00001	$5	0.006s × 6.7 / 1M tokens equivalent
Cloudflare Workers + KV	—	$184	ephemeral keys + signalling + breadcrumb log
Algolia function-call read	—	$0 (existing)	no new cost · function-calls hit existing facet index
Sentry mobile breadcrumb	—	$76	per-turn breadcrumb · cohort-tagged · 90d retention
All-in monthly	≈ $0.0078	≈ $3,721	vs. ≈ $0.045 / turn on the rejected hosted SDK path

line item gpt-realtime-2 — audio input
$ / voice turn $0.0021
$ / month (≈ 480k voice turns) $1,008
note ≈ 21k audio tokens × $0.10 / 1M
line item gpt-realtime-2 — audio output
$ / voice turn $0.0048
$ / month (≈ 480k voice turns) $2,304
note ≈ 24k audio tokens × $0.20 / 1M
line item gpt-realtime-2 — text-tokens
$ / voice turn $0.0003
$ / month (≈ 480k voice turns) $144
note ≈ 30 in + 24 out text tokens at Realtime text pricing
line item Whisper STT fallback (1.4% of turns)
$ / voice turn $0.00001
$ / month (≈ 480k voice turns) $5
note 0.006s × 6.7 / 1M tokens equivalent
line item Cloudflare Workers + KV
$ / voice turn —
$ / month (≈ 480k voice turns) $184
note ephemeral keys + signalling + breadcrumb log
line item Algolia function-call read
$ / voice turn —
$ / month (≈ 480k voice turns) $0 (existing)
note no new cost · function-calls hit existing facet index
line item Sentry mobile breadcrumb
$ / voice turn —
$ / month (≈ 480k voice turns) $76
note per-turn breadcrumb · cohort-tagged · 90d retention
line item All-in monthly
$ / voice turn ≈ $0.0078
$ / month (≈ 480k voice turns) ≈ $3,721
note vs. ≈ $0.045 / turn on the rejected hosted SDK path

Token costs use OpenAI's public gpt-realtime-2 pricing as of May 2026 — $0.10 / 1M audio input, $0.20 / 1M audio output, plus the small text-token charge on the function-call surface. Voice-turn volume estimate assumes 17% voice-engaged-session share on 1.4M MAU with 2x sessions/MAU/mo and 6.4 voice turns per engaged session. The retailer's actual run-cost is currently ≈ 12% below the table because volume hasn't fully ramped post-100% rollout.

A/B-test composition

What the 30-day A/B measured

measurement	n	what it checks	rollout-gate threshold
Mobile-session conversion · voice cohort	42,318 sessions	primary KPI · vs. matched control cohort	≥ +2.0 pts lift on voice-engaged
First-token p95 latency on-device	28,400 turns	per-turn Sentry breadcrumb · iPhone 13 + Pixel 7	≤ 600 ms p95
Crash-free sessions · treatment vs control	42,318 sessions	Sentry · within sample noise of control	≥ −0.15 pp delta
In-app search UX score (post-session)	812 prompts	5-pt Likert delivered after voice-engaged sessions	≥ 3.8 / 5
Out-of-scope handoff rate	28,400 turns	agent says "let me hand you to chat" · should be present	8–12% · neither too high nor zero

measurement Mobile-session conversion · voice cohort
n 42,318 sessions
what it checks primary KPI · vs. matched control cohort
rollout-gate threshold ≥ +2.0 pts lift on voice-engaged
measurement First-token p95 latency on-device
n 28,400 turns
what it checks per-turn Sentry breadcrumb · iPhone 13 + Pixel 7
rollout-gate threshold ≤ 600 ms p95
measurement Crash-free sessions · treatment vs control
n 42,318 sessions
what it checks Sentry · within sample noise of control
rollout-gate threshold ≥ −0.15 pp delta
measurement In-app search UX score (post-session)
n 812 prompts
what it checks 5-pt Likert delivered after voice-engaged sessions
rollout-gate threshold ≥ 3.8 / 5
measurement Out-of-scope handoff rate
n 28,400 turns
what it checks agent says "let me hand you to chat" · should be present
rollout-gate threshold 8–12% · neither too high nor zero

A/B randomisation is by anonymous device id. Treatment cohort gets the GFVoiceCopilot button; control cohort gets the existing touch-search-only experience. The +11.4 pt headline is the voice-engaged-session conversion lift, not the all-cohort lift (the all-cohort lift was +1.9 pts, also significant). Confidence interval on the voice-engaged-session lift is ±1.6 pp at the 95% level on n=42,318.

Production ops cadence is part of the build, not an afterthought. The retailer's product team and our on-call engineer hold a weekly funnel-review where the voice-engaged cohort's per-category lift is opened — any category showing a regression (more than three days of conversion drop) becomes a Sentry issue against the function-call surface and a candidate for prompt tuning. Sentry breadcrumb retention is 90 days hot in their EU project + cohort-tagged for the merch team's downstream analytics. Our on-call rotation runs two engineers a week against a 99.5% widget-availability SLO and the sub-600-ms first-token-latency SLO on the treatment cohort. The App Store and Play Store review submissions both flagged the voice surface as a new permission — we submitted updated mic-permission rationale text on both stores at week 7, approved on first review on both stores. Nothing in this section is published anywhere else by any vendor shipping a Flutter voice copilot — that's the bar.

a/b test · 30-day window

The funnel,
control vs voice-engaged.

Same app, same audience cohort, randomised by anonymous device id. Control gets touch-search only; treatment gets the tap-to-talk button plus the voice copilot. Highlighted line is the checkout-completion step — the +11.4-point lift the case study turns on.

control · touch-search only

Session start n=42,084
Browse or search
Product detail view
Add to cart
Checkout completion

treatment · voice copilot

Session start n=42,318 · voice cohort isolated
Browse or tap-to-talk
Product detail view (voice-narrowed)
Add to cart
Checkout completion

+11.4 pp lift on checkout completion · voice-engaged cohort

A/B randomised by anonymous device id. Voice-engaged sessions = treatment-cohort sessions where the user fired tap-to-talk at least once. All-cohort lift (treatment ÷ control across every session, voice-engaged or not) was +1.9 pp on checkout completion · also statistically significant at the 95% level. Confidence interval on the headline +11.4 pp = ±1.6 pp.

8 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-5 closed alpha surfaced a cart-abandonment spike on iPhone SE viewports — the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, and re-ran the alpha. The honest version of `8 weeks` includes the week we sat on our hands fixing a UX bug a Figma export wouldn't have caught.

Weeks 1–2

Discovery + UX postmortem

Two weeks reading the postmortems of the prior on-device voice A/B tests that the team had already failed twice. Concluded: the failure mode was never the model — it was the trigger UX. Talked to 24 customers from the retailer's loyalty cohort about voice-in-shopping affordances; the two strongest signals were `tap, don't always listen` and `show me what you heard before you act`. Both shaped the GFVoiceCopilot API.

API spec for the OSS Flutter widget · UX guardrails written down · A/B test design signed off
Weeks 3–4

Widget build + ephemeral-key mint

Built the `GFVoiceCopilot` widget in the GetWidget OSS package — mic-permission UX, partial-transcript chip overlay, animated waveform, barge-in handling, and the two callbacks the host wires. Cloudflare Workers minted sub-second-TTL ephemeral keys server-side so no OpenAI secret ever shipped in the Flutter binary. Sentry breadcrumb wiring per voice turn for production debugging.

GFVoiceCopilot v0.4 shipped to the OSS package · ephemeral-key mint in production
Week 5

Closed alpha · cart abandon caught

Closed alpha to 4% of traffic in two US metros. Day 4, Mixpanel flagged a cart-abandonment spike on the category-browse flow — and only on iPhone SE viewports (the smallest screen in the cohort). Root cause: the voice-trigger button was occluding the price label on the bottom-right tile of the product grid. We halted the rollout, repositioned the trigger above the safe-area inset, kept the affordance, and re-ran the alpha for a week with no abandon-rate regression.

Trigger UX repositioned · iPhone SE viewport bug closed · lift recovered next iteration

Walk-away point
Weeks 6–7

Ramp to 50% A/B

Ramped to 50% A/B in the same two metros, then to all US iOS + Android traffic. Sentry crash-free sessions held at 99.71% on the treatment cohort vs 99.74% on control (within sample noise). First-token p95 measured per-device, per-network — held at sub-600 ms on cellular and sub-400 ms on Wi-Fi. Funnel comparison ran daily; team had a kill-switch wired to the Cloudflare KV namespace if the lift collapsed.

Full A/B traffic at 50% · daily funnel comparison · kill-switch in production
Week 8

Production cutover + handoff

Cutover to 100% traffic with the voice cohort intact as a measurement panel; we kept 5% of users on the control variant indefinitely so the team has an ongoing baseline for drift. Sentry SLI configured on first-token latency. Mixpanel funnel events tagged with the voice-engaged dimension so the merch team can read the lift per category. Documentation handed off to the retailer's in-house Flutter team, who maintain the surface from here.

Production cutover · 5% indefinite control panel · documentation handed off

A/B results · 30-day test window

How we know
it works.

The A/B test design was signed off in week 1. Every metric below was a pre-registered comparison against the matched control cohort — no fishing for significance, no metric introduced after the rollout started. Numbers are from the current production cut and the 30-day A/B window.

metric

control

wk 5 (alpha)

wk 6 (50% A/B)

current (live)

target

Mobile-session conversion · voice cohort

3.4%

3.9%

4.1%

4.8%

≥ 4.4%

First-token p95 latency (ms)

—

680

640

580

≤ 600

Voice-engaged-session share

14%

17%

≥ 12%

Crash-free sessions · treatment cohort

99.74%

99.61%

99.68%

99.71%

≥ 99.6%

In-app search UX score (1–5)

2.8

3.5

3.9

4.2

≥ 3.8

Out-of-scope handoff rate

—

11.4%

9.8%

9.2%

8–12%

Sample size for the headline +11.4 pp checkout-completion lift is n=42,318 sessions in the voice-engaged treatment cohort over a 30-day A/B window; the lift confidence interval is ±1.6 pp at 95%. First-token p95 latency is measured per-turn on-device via Sentry breadcrumb. Crash-free sessions delta of −0.03 pp between treatment and control is within sample noise — well inside the −0.15 pp rollout-gate threshold. Out-of-scope handoff rate is the share of voice turns where the agent classifies the intent as outside the catalog scope and hands off to chat — by design between 8 and 12%; v1 was high (11.4%) because the alpha rollout had a smaller catalog scope. Note: the alpha-week crash-free figure (99.61%) is intentionally lower than control — that's the iPhone SE viewport bug surfacing in the metric, exactly the way the eval was designed to catch it.

oss · GetWidget package · gf_voice_copilot

Drop the widget into a Scaffold.
Wire two callbacks.

The integration shape clients reuse from the GetWidget OSS Flutter library. The button affordance, mic-permission UX, transcript overlay, and barge-in handling are baked in; the client wires two callbacks — partial transcript + suggestions — and configures the catalog scope.

lib/screens/shop_screen.dart dart

// Drops the voice copilot into any Flutter scaffold.
// Mic-permission, barge-in, and the visual waveform
// are owned by the widget; the host wires intent.

import 'package:flutter/material.dart';
import 'package:getwidget/getwidget.dart';

class ShopScreen extends StatefulWidget {
  const ShopScreen({super.key});
  @override
  State<ShopScreen> createState() => _ShopScreenState();
}

class _ShopScreenState extends State<ShopScreen> {
  String _transcript = '';
  List<ProductSuggestion> _suggestions = const [];

  @override
  Widget build(BuildContext context) {
    return Scaffold(
      appBar: AppBar(title: const Text('Shop')),
      body: ProductGrid(suggestions: _suggestions),
      floatingActionButton: GFVoiceCopilot(
        catalogId: 'us-apparel-prod',
        config: const VoiceCopilotConfig(
          firstTokenBudgetMs: 600,
          fallbackToHttpAfter: Duration(seconds: 2),
          bargeIn: true,
        ),
        onPartialTranscript: (text) =>
            setState(() => _transcript = text),
        onSuggestions: (recs) =>
            setState(() => _suggestions = recs),
        onHandoff: (intent) =>
            Navigator.of(context).pushNamed('/chat', arguments: intent),
      ),
    );
  }
}

rendered · iPhone 13 viewport

01Partial transcript renders in a chip above the button as the user speaks; cleared on intent fire or barge-in.
02onSuggestions fires once per function-call response from the model · stream-friendly, debounced.
03onHandoff fires when the agent classifies the intent as out-of-scope (returns, support) · navigate to the chat surface.
04Animated waveform proxies "listening" state · paused under prefers-reduced-motion.

gf_voice_copilot ships from the open-source GetWidget Flutter UI kit · 4.8k★ on github · iOS 16+ / Android 11+ · null-safety · OSS license = BSD-3

When NOT to ship this. A Flutter voice copilot built on these patterns will hurt the app experience in any of the following situations — we will turn down the engagement before scoping a pilot:

Hot-word listening is on the scope sheet, full stop. Always-on mic = mic-permission churn + battery-drain reviews + a trust gradient that's hard to recover. Tap-to-talk is the only voice-trigger UX we ship for retail apps. Clients who insist on always-on get a different recommendation and a different vendor.
The team can't run a 30-day A/B before rollout. Mobile voice conversion claims without an A/B-baseline are vibes. We've seen vendor decks where the conversion-lift number turned out to be a Wednesday-vs-Sunday comparison. Our pilot scope includes the matched control cohort and a 30-day window — non-negotiable.
App Store / Play Store mic-permission posture isn't taken seriously. Both stores reject voice surfaces that don't have well-written mic-permission rationale strings, store-listing screenshots showing the tap-to-talk affordance, and a privacy policy that explicitly addresses audio handling. Clients who treat the store-submission as a TODO get a different pilot timeline (an extra week) and a hard requirement on the rationale strings before rollout.
The catalog isn't already searchable in well-tuned facets. Voice in-app needs a strong existing search-and-facet substrate — the model function-calls into it. If the catalog is poorly tagged, voice will surface that as a 9–14% out-of-scope handoff rate (which we measure on every pilot) and the conversion lift won't materialise. The right pilot in that case starts with the catalog facets, not the voice surface.

— the kill-point section that has to be on every honest case study

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build — or that a similar build on your stack would draw from.

E-commerce AI Development

The retail pillar — voice-in-app, recommendations, inventory agents, cart-recovery patterns. Where this voice copilot lives in the larger story.

Flutter App Development

We wrote the GetWidget OSS Flutter UI kit (4.8k★). When you hire us for a Flutter app, you hire the team that maintains the library.

AI Voice Agents

Voice agent architectures · OpenAI Realtime API vs chained vs unified-vendor. The audit picks per workload, not per ideology.

AI Chatbot Development

The chatbot pillar — where a voice surface sometimes belongs and sometimes doesn't. Honest scoping for the cases where typing wins.

All AI Case Studies

Six AI case studies — RAG, agents, voice, and chatbots. Same operator detail across every page.

OpenAI Development

GPT-5.4 family + Realtime API + Codex playbooks. Production integration patterns that work outside the demo video.

Ready to ship

Want a case study like this
for your Flutter app?

Book a $3K fixed-fee audit. We'll review the app's current funnel, scope the voice-engaged cohort comparison, recommend a tap-to-talk UX + audio-transport recipe, project per-turn cost, and tell you honestly whether voice is the right primitive — or whether the catalog facets need work first. About one audit in four ends with `fix the catalog tags, voice comes later.`

Read the ecommerce pillar

30 min, async or live A/B-first scoping Walk-away point in the pilot

An AI chatbot case study, where the chat
is voice.

A Flutter app
losing the desktop crown.

today

with the voice copilot

Six pipeline stages,
one widget on top.

Tap-to-talk, not always-on listening

WebRTC primary, chunked HTTP fallback

Function-call into existing Algolia, no facet rewrite

The voice copilot,
tap to product grid.

Tap-to-talk,
the grid responds.

Named tools,
OSS where it matters.

Production shape,
under the hood.

Per-stage P50 / P95 (ms) · on-device

Per-session and monthly cost math

What the 30-day A/B measured

The funnel,
control vs voice-engaged.

The timeline,
including the week we halted.

Discovery + UX postmortem

Widget build + ephemeral-key mint

Closed alpha · cart abandon caught

Ramp to 50% A/B

Production cutover + handoff

How we know
it works.

Drop the widget into a Scaffold.
Wire two callbacks.

Where this case study
points back to.

E-commerce AI Development

Flutter App Development

AI Voice Agents

AI Chatbot Development

All AI Case Studies

OpenAI Development

Want a case study like this
for your Flutter app?

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

An AI chatbot case study, where the chat is voice.

A Flutter app losing the desktop crown.

today

with the voice copilot

Six pipeline stages, one widget on top.

Tap-to-talk, not always-on listening

WebRTC primary, chunked HTTP fallback

Function-call into existing Algolia, no facet rewrite

The voice copilot, tap to product grid.

Tap-to-talk, the grid responds.

Named tools, OSS where it matters.

Production shape, under the hood.

The funnel, control vs voice-engaged.

The timeline, including the week we halted.

Discovery + UX postmortem

Widget build + ephemeral-key mint

Closed alpha · cart abandon caught

Ramp to 50% A/B

Production cutover + handoff

How we know it works.

Drop the widget into a Scaffold. Wire two callbacks.

Where this case study points back to.

E-commerce AI Development

Flutter App Development

AI Voice Agents

AI Chatbot Development

All AI Case Studies

OpenAI Development

Want a case study like this for your Flutter app?

An AI chatbot case study, where the chat
is voice.

A Flutter app
losing the desktop crown.

Six pipeline stages,
one widget on top.

The voice copilot,
tap to product grid.

Tap-to-talk,
the grid responds.

Named tools,
OSS where it matters.

Production shape,
under the hood.

The funnel,
control vs voice-engaged.

The timeline,
including the week we halted.

How we know
it works.

Drop the widget into a Scaffold.
Wire two callbacks.

Where this case study
points back to.

Want a case study like this
for your Flutter app?