ai case studies

AI case studies.
Receipts, not slideware.

Six anonymized engagements — clinical triage on Claude Sonnet, RAG over 12k product docs, Realtime API voice agents, payer prior-auth drafting, contract review, and a Flutter voice copilot. Each one ships with an eval set, latency budget, kill points, and the math behind the metric. Client names changed at their request; numbers drawn from shadow-mode logs and frozen eval sets unless explicitly noted as published cost math.

Industry
HealthcareSaaSLegalE-commerce
Capability
RAGAI agentsVoice agentsChatbots
Stack
ClaudeOpenAILangChainLangGraphpgvector
Outcome
Time savedDeflectionConversion liftCompliance
six engagements · live + in-flight

Six AI success stories,
anonymized at the client's request.

Click through to the live pilot below for the full operator-detail write-up — eval table, signature architecture diagram, kill-point section, the works. The other five render Phase 2 (same template, different signature SVG per page).

Healthcare · Regional health system Case study

HIPAA-safe clinical triage agent — shipped in 9 weeks

Problem

Pre-triage queue averaging 38–62 min wait at peak. Nurse triage line overflow routing the wrong-acuity patients to ER. PHI-safe AI never piloted.

Approach

FHIR-pulled chart context → PHI redaction → hybrid pgvector + BM25 retrieval over the clinical-pathway corpus → Claude Sonnet 4.6 forced-JSON decision → policy + 2-eye guardrails. Three outcome lanes.

Claude Sonnet 4.6pgvector 0.7FHIR R4LangGraph 0.2Langfuse
Outcome
38–62% pre-triage wait reduction (n=14,200 shadow encounters)
B2B SaaS · Developer tooling Case study

Claude case study — RAG over 12,000 product-docs pages

Problem

Documentation search rated 2.3/5 by users; support ticket volume 41% docs-recoverable (n=1,200). Existing keyword search couldn't reason across nested module hierarchies; old answer-bot hallucinated on synonyms.

Approach

Claude Sonnet 4.6 + Haiku 4.5 router over a hybrid pgvector + Algolia index. voyage-3-large embeddings, bge-reranker-large self-hosted. Forced-JSON answer schema with regex-enforced anchor citations — every claim links to a doc anchor or the validator rejects.

Claude Sonnet 4.6Claude Haiku 4.5pgvector 0.7BAAI bge-reranker
Outcome
≈ 64% docs-recoverable tickets deflected at conf ≥ 0.8 (95% CI · n=3,400)
SaaS · Customer support Case study

OpenAI case study — Realtime API voice agent at $0.10/call

Problem

Tier-1 voice queue averaging 4-minute wait at peak; 5 inbound questions accounted for 62% of call volume. Existing IVR bouncing 80%+ to a human.

Approach

gpt-realtime-2 voice agent over the help-center RAG corpus. p95 580ms first-token, function-calling handoff_to_human when confidence < 0.7, Twilio + Cloudflare edge audio transport. Published $0.10/call cost math vs $4 live-agent baseline.

gpt-realtime-2Whisper-large-v3pgvector 0.7Twilio Voice
Outcome
≈ 38% tier-1 voice deflection (95% CI · n=11,400 calls)
Fintech · Mid-market US bank Case study

Anthropic case study — Claude Sonnet 4.6 fraud agent at a US mid-market bank

Problem

Rules-engine bleeding 18% false-positive rate on 1.2B/yr transactions across card · wire · ACH · RTP. Median analyst review-prep at 8 minutes per flagged case at $14 fully-loaded. Every flag needed a regulator-audit-defensible case note — the binding constraint.

Approach

XGBoost velocity score short-circuits the LLM on the auto-clear band → hybrid pgvector + BM25 retrieval over a 4-yr KYC + case-note corpus → bge-reranker-large self-host → Claude Sonnet 4.6 forced-JSON disposition over AWS PrivateLink → policy-as-code + 2-eye gate → 3 outcome lanes (clear / case-note / regulatory escalate).

Claude Sonnet 4.6Claude Haiku 4.5pgvector 0.7XGBoost 2.0LangGraph 0.2
Outcome
≥ 0.96 precision @ 1% FPR (n=412 eval + 1,840 production · ±0.012 CI)
Legal · Mid-market firm Case study

RAG case study — first-pass MSA review for a mid-market law firm

Problem

Partners spending 6–9 hours per MSA on first-pass review; clause-library drift across 4 practice groups producing inconsistent calls; 11% of post-execution disputes traced to first-pass drift.

Approach

LangChain 0.3 + LangGraph 0.2 orchestrator over a reconciled clause library (1,420 clauses post-reconciliation, down from 1,840). Hybrid pgvector + tsvector BM25 retrieval, bge-reranker-large (Cohere Rerank A/B'd, kept as fallback). Forced-JSON clause-risk schema with regex-enforced policy_id citations.

Claude Sonnet 4.6LangChain 0.3LangGraph 0.2pgvector 0.7
Outcome
≈ 71% first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs)
E-commerce · DTC apparel · Flutter mobile Case study

AI chatbot case study — Flutter voice copilot in a DTC apparel app

Problem

Mobile-app conversion lagging desktop by 18 points across a 1.4M-MAU Flutter app. In-app search UX rated 2.8/5 (n=1,200). Team had failed two prior on-device voice A/B tests — both rejected on trigger UX, third strike on the line.

Approach

Tap-to-talk on-device VAD → WebRTC over Cloudflare-minted ephemeral keys → gpt-realtime-2 streaming with function-calls into the existing Algolia facet index → product grid re-renders live. Surface shipped as a new GFVoiceCopilot widget in the open-source GetWidget Flutter UI kit (4.8k★). 30-day A/B with matched control.

gpt-realtime-2Flutter 3.24GetWidget OSSAlgoliaCloudflare Workers
Outcome
+11.4 pts mobile conversion · voice-engaged sessions (n=42,318 · ±1.6pt CI · 30d A/B)
Ready to ship

Want a case study like this
for your stack?

Book a free audit. We review your highest-ROI candidate workflow, recommend a model + retrieval recipe, project token + run-cost, and tell you whether it's case-study-shaped (or whether you should buy an off-the-shelf platform). No deck, no obligation to build.

See pricing
30 min, async or live Eval-first scoping Walk-away point in the pilot