← all case studies
Healthcare · Regional health system RAG + AI agent · forced JSON
Claude Sonnet 4.6 role Decision model · forced JSON output
pgvector 0.7 role Embedding retrieval over clinical-pathway chunks
FHIR R4 role Chart context · Epic + athenahealth
LangGraph 0.2.x role Agent orchestrator · read-only tools
Langfuse role Per-decision trace · self-hosted in customer VPC
case study · 2026 · anonymized

How we shipped a HIPAA-safe clinical triage agent
in 9 weeks.

A US-based regional health system needed a pre-triage layer that could clear low-acuity self-care, queue everything else for a clinician with evidence attached, and page on-call instantly when a red-flag symptom set fired. We built it on Claude Sonnet 4.6, pgvector, and FHIR R4 — eval-first, BAA-scoped, with a kill point at week 7 that we used.

38–62%
pre-triage wait reduction (95% CI · n=14,200 shadow encounters)
p95 3.1s
end-to-end decision latency · meets <3.5s service target
412
frozen eval items · re-run on every release
9 weeks
discovery to shadow-mode go-live
shipped
9 weeks · 4 engineers · 1 clinical lead
9,500/wk
patient messages across portal · SMS · post-visit
38–62 min
peak pre-triage wait · fat tail to 2 hrs
50–70%
low-acuity addressable by documented self-care pathways
6 + 40 + 24/7
hospitals · clinics · nurse triage line
the problem

A nurse triage line
under load.

Wait times during evening surges were routing wrong-acuity patients to the ER. Clinicians wanted help, not a chatbot.

The client is a US-based regional health system — six hospitals, around forty ambulatory clinics, and a 24/7 nurse triage line that handles roughly 9,500 inbound patient messages per week across portal, SMS, and post-visit follow-up. Like most regional systems, they sit at the awkward middle of the market: too small to staff a 60-seat triage centre during evening peaks, too large for the nurse manager to keep eyes on every queue.

today vs · with the agent

today

Patient
SMS / Portal / Post-visit
Pre-triage queue
38–62 min peak
Nurse triage
outcome
Long wait · sometimes wrong-acuity routing to ER

with the agent

Patient
SMS / Portal / Post-visit
Triage agent (Claude Sonnet 4.6)
forced JSON · cited evidence
Policy + 2-eye guardrail
outcome
Clear · self-care
outcome
Queue · for clinician
outcome
Escalate · stat

The presenting problem in the discovery shadow was specific. Pre-triage queue wait was averaging 38–62 minutes at peak, with a fat tail past two hours. The clinical lead was confident that 50–70% of inbound messages mapped to documented low-acuity self-care pathways — sore throat with no red flags, medication-refill questions, post-op wound checks that were healing fine. Patients waited anyway because there was no triage layer in front of the nurse. Worse, the nurse line was occasionally routing borderline patients home when an ER visit was indicated — not often, but enough that the medical director had named it the binding constraint on the whole program.

They had looked at generic patient-facing chatbots and turned every one of them down. The objections were operator-grade: no autonomous routing on acuity, no PHI leaving the BAA perimeter, no advice generated without grounding in their own clinical-pathway corpus, no metric that wasn't measurable on a frozen eval set. The conversation we walked into was not "should we ship AI" — it was "show us how a triage agent could fail, and tell us how you'd catch it before a patient gets hurt."

That framing decided the engagement. We refused to scope this as a chatbot build. The deliverable was a structured-output triage agent with a clinician override on every non-trivial routing decision, evidence chunks attached to every claim, and a frozen eval set that gated every release. The rest of the page is what we shipped.

the approach

Six pipeline stages,
three outcome lanes.

Every patient message runs the same six-stage pipeline. The agent is allowed to clear, queue, or escalate — nothing else. Forced-JSON output with cited evidence is the only legal shape of a decision.

The architecture below is the production shape, not a marketing diagram. FHIR R4 pulls a scoped slice of the patient's chart (Patient + Encounter + recent Observation resources, scoped via JWT-on-behalf-of the on-call clinician). The PHI redaction pass strips identifiers using a regex pre-pass plus a clinical NER fine-tune on the i2b2 corpus; a reversible token map sits in BAA-scoped Postgres so the reasoning trace stays auditable without ever shipping raw PHI to the model.

Retrieval is hybrid. pgvector 0.7 sits on the embedding side; a BM25 index built from Postgres `tsvector` sits on the lexical side. We fuse them with reciprocal-rank fusion, take the top-40, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer VPC. Every chunk in the corpus carries a pathway-id, so when the model cites a chunk we can trace the citation back to the source pathway document the clinical team actually maintains.

Retrieval params are tuned, not defaulted. Chunks are 480 tokens with 80-token overlap, anchored on sentence boundaries to keep clinical claims intact. Embeddings come from voyage-3-large at 1,024 dimensions, chosen because Voyage offered a BAA at the same price tier as the cheaper voyage-3-lite — and the lite variant dropped recall@5 by four points on our eval set. The BM25 lane uses Postgres tsvector with English stemming over the same chunks. Fusion is RRF with k=60 (the paper default), top-40 from each lane, deduplicated by chunk id, reranked, top-12 to the model.

The decision step is Claude Sonnet 4.6 with `response_format: json_schema`. The model has three read-only tools and zero write tools — it cannot write to the chart, it cannot escalate, it cannot send a message. All it produces is a JSON object: routing decision, acuity band, cited evidence-chunk ids, and a structured rationale. Every claim in the rationale has to point to a chunk id or the schema validator rejects the output and the request retries with a stricter prompt.

three decisions that shaped the build
design decision · 01

Zero write tools for the agent

we rejected
Write back to the chart directly
because
Chart writes are a clinician privilege. The agent surfaces evidence, the clinician owns the action.
design decision · 02

Forced JSON · response_format schema

we rejected
Free-text answer with downstream parser
because
Every claim has to cite an evidence chunk id. The schema validator is the contract; the model can't hand-wave.
design decision · 03

Hybrid pgvector + BM25 retrieval

we rejected
Pure embedding search
because
Clinical pathway docs over-index on rare terms (drug names, ICD codes) that lexical match wins on. Embeddings miss them. Fusion is empirically better on the eval set.

Guardrails live as TypeScript code checked into the same repo as the agent. The policy layer enforces a two-eye rule on anything routed to a clinician queue, refuses to act on chart slices flagged with active pregnancy-without-obstetric-context or pediatric encounters under three years (both routed straight to a human), and per-decision audit-logs the evidence chunks, model version, redaction map, and clinician override (if any). Hover any node in the diagram for the tool inventory and latency budget.

The reason this shape works is the same reason it was scoped this way at week 1: every component has a separately measurable contract. Retrieval is measurable in top-k recall on the eval set. The reranker is measurable in top-1 precision on the held-out slice. The model decision is measurable on the labelled acuity-band correctness. The calibration head is measurable in ECE. The guardrails are measurable in policy-rejection rate vs. clinician-override rate. When something regresses, the per-component metric tells us which stage to look at — not a single end-to-end number that hides which subsystem broke.

We also use Langfuse for per-decision tracing in the customer VPC. Every production decision retains its retrieval candidates, reranker scores, raw model output, parsed JSON, policy-check result, and the final routing — searchable by clinician override status. That trace store is what the clinical lead reviews weekly. It is also what we used to find the week-7 calibration bug; we will get to that in the timeline section below.

under the hood

The triage agent,
end to end.

Every patient message enters at the top. It either clears to a self-care pathway, lands in a clinician queue with structured evidence attached, or escalates stat. Hover any stage to see its tool inventory and latency budget.

outcome Clear low-acuity self-care path · ≈ 62% of pre-triage volume
outcome Queue for clinician structured packet · evidence chunks attached · ≈ 33%
outcome Escalate · stat red-flag symptom set · pages on-call · ≈ 5%

latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 3.1s target

BAA-scoped
no PHI leaves the customer VPC at any point in the pipeline
0
autonomous escalations · clinician sign-off on every queue entry
8 clinicians
in the design council · 3 of them flagged the calibration bug
shadow-first
two weeks running silently next to the existing nurse triage line
the stack

Named tools,
named versions.

Everything in the build is a thing your security team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are all checked into the customer's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON role decision
Claude Haiku 4.5 role routing fallback
pgvector 0.7 role embedding retrieval
BM25 (Postgres tsvector) role lexical retrieval
BAAI bge-reranker-large role rerank
LangGraph 0.2.x role agent orchestrator
FHIR R4 role chart context · Epic + athenahealth
Langfuse role per-decision trace
Cloudflare Workers role edge · BAA-eligible
how it actually runs

Production shape,
under the hood.

The numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

Most clinical-AI case studies stop at the architecture diagram. Ours doesn't, because our buyers don't. The two people who decide whether to sign — the clinical informatics lead and the head of security — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the model vendor's published price card, a frozen eval set with category-level thresholds, and an honest accounting of what runs where for BAA scope. Vendors who don't show this either don't have it or are hiding it. The section below is the version of our pilot that maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms)

stagep50p95tooling
FHIR resource pull92140Epic on-FHIR + athenahealth APIs · cached Patient + scoped Encounter
PHI redaction78120Regex pre-pass + i2b2-fine-tuned clinical NER (DistilBERT base)
Hybrid retrieval112180pgvector cosine top-40 ∥ Postgres tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank240340BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision17402180Anthropic API · response_format json_schema · ~3,400 in / ~480 out tokens
Policy + 2-eye validation1422TypeScript runtime · Zod schema · audit-log write
Total (end-to-end)22803098agent boundary — excludes clinician-side queue render
  1. stage FHIR resource pull
    p50 92
    p95 140
    tooling Epic on-FHIR + athenahealth APIs · cached Patient + scoped Encounter
  2. stage PHI redaction
    p50 78
    p95 120
    tooling Regex pre-pass + i2b2-fine-tuned clinical NER (DistilBERT base)
  3. stage Hybrid retrieval
    p50 112
    p95 180
    tooling pgvector cosine top-40 ∥ Postgres tsvector BM25 top-40 → RRF k=60
  4. stage Cross-encoder rerank
    p50 240
    p95 340
    tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
  5. stage Claude Sonnet 4.6 decision
    p50 1740
    p95 2180
    tooling Anthropic API · response_format json_schema · ~3,400 in / ~480 out tokens
  6. stage Policy + 2-eye validation
    p50 14
    p95 22
    tooling TypeScript runtime · Zod schema · audit-log write
  7. stage Total (end-to-end)
    p50 2280
    p95 3098
    tooling agent boundary — excludes clinician-side queue render

p50/p95 from 30-day rolling window over n ≈ 41,200 production decisions. SLO is p95 ≤ 3,500 ms; current burn ≈ 88%.

The retrieval lane is where most of the per-stage tuning effort went. The corpus is ≈ 1,400 pathway pages chunked to 480 tokens with 80-token overlap, anchored on sentence boundaries — short enough that the reranker score is meaningful, long enough that a clinical claim doesn't get cut in half. We picked voyage-3-large at 1,024 dimensions specifically because Voyage signs a BAA at the same price tier as voyage-3-lite; we tried the lite variant first and recall@5 dropped four points on the eval. The 35% embeddings cost saving wasn't worth shipping a measurably worse retriever. Fusion is reciprocal-rank with k=60 (the paper default; we did not find a better value on the held-out slice), top-40 from each lane, deduplicated by chunk id, reranked with bge-reranker-large, top-12 to the model. Eval-set recall@5 after fusion + rerank is 0.91. Recall@1 is 0.78 — high enough that the model's first cited chunk is almost always load-bearing.

triage/schema/decision.ts typescript
// triage/schema/decision.ts
// Forced-JSON decision schema. Validated client-side too; if the
// model produces something that doesn't parse, we retry once with
// a stricter system prompt, then fail closed (queue for clinician).

import { z } from "zod";

export const TriageDecision = z.object({
  routing: z.enum([
    "clear",      // safe for documented self-care; no clinician needed
    "queue",      // route to nurse queue with this agent's reasoning attached
    "escalate",   // page on-call clinician now (stat criteria)
  ]),
  acuity_band: z.enum(["1-self-care", "2-routine", "3-same-day", "4-urgent", "5-stat"]),
  confidence: z.number().min(0).max(1),
  rationale: z.array(z.object({
    claim:       z.string().min(40).max(420),
    evidence_id: z.string().regex(/^chunk_[a-f0-9]{12}$/),
    pathway_id:  z.string(),
  })).min(1).max(8),
  refused: z.boolean().describe(
    "True if the agent decided it cannot decide — pediatric < 3y, " +
    "active pregnancy without OB context, or any rationale failed to ground."
  ),
});

export type TriageDecision = z.infer<typeof TriageDecision>;
The structured-output schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform — every claim has to cite a retrieved chunk id.
unit economics

Per-decision and monthly cost math

line item$ / decision$ / month (≈ 41k decisions)note
Claude Sonnet 4.6 — input tokens$0.0102$4183,400 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output tokens$0.0072$294480 tokens × $15.00 / 1M
voyage-3-large embeddings (avg query)$0.0004$16≈ 3,300 tokens × $0.12 / 1M
pgvector + RDS db.m6i.large$284BAA-scoped Postgres; embeddings + tsvector
g5.xlarge reranker (24/7)$378BAAI bge-reranker-large self-host
Cloudflare Workers (BAA-eligible)$128edge + audit log shipping
Langfuse self-hosted (t3.medium)$67trace store; 90-day hot / 7-yr cold
All-in monthly≈ $0.0411≈ $1,585vs. ≈ $7,900 / mo to add one triage nurse
  1. line item Claude Sonnet 4.6 — input tokens
    $ / decision $0.0102
    $ / month (≈ 41k decisions) $418
    note 3,400 tokens × $3.00 / 1M
  2. line item Claude Sonnet 4.6 — output tokens
    $ / decision $0.0072
    $ / month (≈ 41k decisions) $294
    note 480 tokens × $15.00 / 1M
  3. line item voyage-3-large embeddings (avg query)
    $ / decision $0.0004
    $ / month (≈ 41k decisions) $16
    note ≈ 3,300 tokens × $0.12 / 1M
  4. line item pgvector + RDS db.m6i.large
    $ / decision
    $ / month (≈ 41k decisions) $284
    note BAA-scoped Postgres; embeddings + tsvector
  5. line item g5.xlarge reranker (24/7)
    $ / decision
    $ / month (≈ 41k decisions) $378
    note BAAI bge-reranker-large self-host
  6. line item Cloudflare Workers (BAA-eligible)
    $ / decision
    $ / month (≈ 41k decisions) $128
    note edge + audit log shipping
  7. line item Langfuse self-hosted (t3.medium)
    $ / decision
    $ / month (≈ 41k decisions) $67
    note trace store; 90-day hot / 7-yr cold
  8. line item All-in monthly
    $ / decision ≈ $0.0411
    $ / month (≈ 41k decisions) ≈ $1,585
    note vs. ≈ $7,900 / mo to add one triage nurse

Token costs use Anthropic's public Sonnet 4.6 pricing as of May 2026 — $3 / 1M input, $15 / 1M output. Infra costs are AWS US-east-2 list price; client paid less under EDP. Payback period from go-live (including the 9-week build at $185k) was ≈ 6.2 months.

eval composition

What's in the frozen 412-item set

categoryitemswhat it checksci-gate threshold
Acuity-decision golds80labelled routing + correct acuity band on real (de-identified) encounters≥ 0.90 precision @ 1% FPR
PHI redaction60spans of PHI correctly redacted; reversible-token map intact≥ 0.99 token recall
Retrieval recall120correct pathway chunk in top-5 after RRF + rerank≥ 0.90 recall@5
Groundedness100every rationale claim points to a retrieved chunk id that supports it≥ 0.93 groundedness
Refusal / adversarial52pediatric < 3y, active pregnancy w/o OB, jailbreak attempts, OOD cases100% refusal on listed must-refuse
  1. category Acuity-decision golds
    items 80
    what it checks labelled routing + correct acuity band on real (de-identified) encounters
    ci-gate threshold ≥ 0.90 precision @ 1% FPR
  2. category PHI redaction
    items 60
    what it checks spans of PHI correctly redacted; reversible-token map intact
    ci-gate threshold ≥ 0.99 token recall
  3. category Retrieval recall
    items 120
    what it checks correct pathway chunk in top-5 after RRF + rerank
    ci-gate threshold ≥ 0.90 recall@5
  4. category Groundedness
    items 100
    what it checks every rationale claim points to a retrieved chunk id that supports it
    ci-gate threshold ≥ 0.93 groundedness
  5. category Refusal / adversarial
    items 52
    what it checks pediatric < 3y, active pregnancy w/o OB, jailbreak attempts, OOD cases
    ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen — items only added, never edited. Clinical lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.

Production ops cadence is also part of the build, not an afterthought. The clinical lead and our on-call engineer hold a weekly override-review meeting where every queued case in which the agent's recommendation differed from the nurse's gets opened — drift that looks systematic (more than three of the same pattern in a week) becomes a JIRA ticket against the eval set and a candidate fine-tune slice. Langfuse trace retention is 90 days hot in the customer VPC plus seven years cold in BAA-scoped S3, matching their HIPAA retention policy. Our on-call rotation runs two engineers a week against a 99.5% pipeline-availability SLO and the 95th-percentile-under-3.5s decision SLO. The security team pulls an audit-log sample every month — model version, retrieval candidates, redaction map, policy-check verdict, clinician override. Nothing in this section is published anywhere else by anyone shipping clinical agents. That's the bar.

9 weeks · honest version

The timeline
including the week we almost cut.

Five stages, milestone-billed. The week-7 shadow run found a calibration bug on borderline-acuity cases that would have hurt patients in production. We halted cutover, re-fit the calibration head, re-ran the eval, and only then promoted to primary. The honest version of `9 weeks` includes the week we sat on our hands.

  1. Weeks 1–2

    Discovery + eval set

    Two weeks shadowing the nurse triage line. 412 frozen eval items written by the clinical lead from real (de-identified) past encounters. Each item carries a labelled correct routing decision and the clinical reasoning behind it. We wrote the harness; clinicians wrote the answers.

    Frozen eval set + acuity-band scoring rubric
  2. Weeks 3–4

    Pathway corpus + retrieval

    Ingested the existing clinical-pathway document set (≈ 1,400 chunked pages) into pgvector 0.7 inside the customer VPC. Built the BM25 sidecar over the same chunks. Reciprocal-rank fusion tuned on a held-out eval slice; cross-encoder rerank added when top-1 recall plateaued.

    Hybrid retrieval at 0.91 top-5 recall on the eval set
  3. Weeks 5–6

    Agent skeleton + guardrails

    LangGraph 0.2.x agent with three read-only tools. Zero write tools by design. Forced-JSON decision via Anthropic's response_format. Policy-as-code in TypeScript shipping next to the agent — every routing decision is gated and audit-logged before it touches a clinician queue.

    End-to-end pipeline behind a feature flag
  4. Week 7

    Shadow run — calibration bug found

    Two weeks of silent shadow against the live nurse triage line. Day 4 the clinical lead flagged a calibration drift on borderline-acuity cases: the model was confident on cases where the correct answer was 'queue for clinician', not 'clear'. We halted cutover, re-fit the calibration head on a fresh slice, and re-ran the eval. The honest version of `shipped on time` includes this step.

    ECE recalibrated from 0.061 → 0.029 on a fresh eval slice
    Walk-away point
  5. Weeks 8–9

    Cutover + clinician training

    Promoted to primary triage with the nurse line in active-standby. Four clinician training sessions on the override flow and the audit-log viewer. PagerDuty wired to the stat-escalation lane. Old nurse line stays on for 30 days post-cutover by policy — every diff between agent + human is logged for review.

    Production cutover with documented metrics + override flow
eval results · 412 frozen items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric
baseline (wk 2)
v1 (wk 5)
v2 (wk 6)
current (live)
target
Triage-acuity precision @ 1% FPR
0.821
0.879
0.904
≥ 0.90
Recall on high-acuity escalations
0.918
0.946
0.962
≥ 0.95
Calibration (ECE)
0.073
0.061
0.029
≤ 0.04
Note groundedness
0.88
0.92
0.95
≥ 0.93
Refusal rate
14.8%
11.2%
9.4%
8–12%
P95 time-to-decision
4.2s
3.4s
3.1s
≤ 3.5s

Sample size for the production wait-time number is n=14,200 patient encounters across the two-week shadow window; the 38–62% reduction range is the 95% confidence interval, not a point estimate. ECE is expected calibration error on the labelled 412-item set. P95 latency is end-to-end from FHIR pull to JSON decision, measured at the agent boundary (excludes clinician-side queue render). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a clinician — by design, not by failure.

Ready to ship

Want a case study like this
for your stack?

Book a $3K fixed-fee audit. We'll review the workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. We'll also tell you if it isn't — about one audit in five ends with `buy the platform, here's the SOW for integration.`

30 min, async or live Eval-first scoping Walk-away point in the pilot