← all case studies
Fintech · Mid-market US bank AI agent · forced JSON · policy-as-code
Claude Sonnet 4.6 role Decision model · forced JSON · severity-bounded enum
Claude Haiku 4.5 role Velocity routing · case-note narration
XGBoost 2.0 role Velocity score · 142 features · auto-clear band
pgvector 0.7 role Hybrid retrieval over a 4-yr KYC + case-note corpus
LangGraph 0.2.x role Agent orchestrator · 3 read tools · 2 write tools
Langfuse role Per-decision trace · BAA-scoped in customer VPC
case study · 2026 · anonymized

An Anthropic case study, with the audit trail
a regulator would accept.

A US mid-market bank's fraud-ops team needed a decision layer that could clear the auto-pass band silently, produce a defensible case note on every queue entry, and escalate regulatory-severity cases to a senior analyst with a two-eye signature on the dispatch. We built it on Claude Sonnet 4.6 and Haiku 4.5 with XGBoost on the velocity short-circuit, hybrid retrieval over four years of KYC and case-note corpus, and a policy-as-code layer that gates every write tool. Eleven weeks, BAA-scoped over AWS PrivateLink, with a kill point at week 7 that we used.

≥ 0.96
precision at 1% FPR · n=412 frozen eval · ±0.012 CI
8.0 → 0.8 min
case-note review-prep per case · n=204 timed sessions
~92%
of routed cases include the right evidence on first read · n=1,840
11 weeks
discovery to shadow-mode go-live · 1 calibration halt at wk 7
shipped
11 weeks · 5 engineers · 1 senior analyst lead · 1 model-risk lead
1.2B / yr
transactions across card · wire · ACH · RTP
18%
false-positive rate on the legacy rules engine
$14 / case
fully-loaded analyst review-prep cost on flagged cases
8 min / case
median analyst review-prep before the agent shipped
the problem

A rules engine
under load.

False-positives at 18% mean the analyst queue is mostly noise. The binding constraint wasn't fraud loss — it was the audit trail behind every cleared case.

The client is a US mid-market bank's fraud-operations team — roughly 1.2 billion auth-boundary transactions per year across card, ACH, wire, and RTP rails, with a 50-seat analyst floor running a hybrid rules + ML overlay that had last been tuned in 2023. Like most mid-market fraud-ops teams, they sit at an awkward operational point: too small to staff the kind of feature-engineering bench that keeps a fully-custom fraud model fresh, too large to outsource the SAR-track decision to a vendor's default library.

today vs · with the agent

today

Auth-boundary stream
Rules engine
300+ static rules · last tuned 2023
Analyst queue
≈ 8 min/case manual write-up
SAR triage
outcome
18% FPR · analyst burnout · audit prep fragile

with the agent

Auth-boundary stream
XGBoost velocity score
skip-LLM band for score < 0.18
Claude Sonnet 4.6 · forced JSON
evidence-cited disposition
Policy + 2-eye + audit log
outcome
Clear · silent · audit row
outcome
Case-note · queue · 0.8 min
outcome
Escalate · 2-eye · SAR-track

The presenting problem was specific. The legacy rules engine was clearing roughly 92% of auth-boundary traffic silently, flagging the rest into an analyst queue with an 18% false-positive rate. The fully-loaded analyst cost per flagged case — including review-prep time, second-look, and the regulator-audit packet write-up — was averaging $14. Median review-prep time was 8 minutes per case. The senior analyst lead had named the binding constraint twice in the discovery shadow: every flag needs a defensible case note in the regulator audit trail. Not eventually. Not on the cases that escalate. Every flag.

That framing matters because most fraud-agent vendors pitch the false-positive number as the win. The bank's compliance officer told us in week one that the false-positive number was a top-line goal but not the binding constraint — the binding constraint was that the team's existing rules engine produced flags the regulator-audit team could only defend by reconstructing the analyst's manual notes from the case-management system. If the model's output had to be reconstructed the same way, the engagement was indefensible.

They had looked at generic fraud-detection vendors and turned every one of them down. The objections were operator-grade. No autonomous dispositions on regulatory severity. No PAN, SSN, or beneficiary-graph data leaving the BAA perimeter. No "explainability score" without a chunk-cited evidence trail. No metric that wasn't measurable on a frozen eval set the senior analyst lead labelled. The conversation we walked into was the same one every regulated-industry AI engagement starts with: show us how the model fails, tell us how the audit packet defends every disposition, and we'll talk about the precision number after that.

the approach

Seven pipeline stages,
three outcome lanes.

Every transaction enters at the top. The XGBoost velocity score short-circuits the LLM on the auto-clear band; everything else runs hybrid retrieval over four years of KYC and case-note corpus, gets reranked, and lets Claude produce a forced-JSON disposition that policy-as-code gates before the runtime dispatches any tool call.

The architecture below is the production shape. The Kafka topic at the top is the bank's existing auth-boundary stream — we did not move it, did not re-process historical data, did not write a sidecar event store. Every transaction enters in flight, runs the XGBoost velocity score in < 9 ms p95, and exits to one of three lanes within the 2.6-second decision SLA. Roughly 92% of traffic short-circuits the LLM entirely on the auto-clear band (velocity score below 0.18) — the math behind that decision is in the unit-economics SpecGrid further down. The remaining ~8% is what the LLM sees.

Retrieval is hybrid. pgvector 0.7 sits on the embedding side over a 4-year KYC + case-note corpus (~ 2.4M chunks at 512 tokens, anchored on sentence boundaries); a BM25 sidecar built on Postgres `tsvector` sits on the lexical side. We fuse with reciprocal-rank fusion at k=60, take the top-40, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer VPC. Every chunk carries a case-id and a redaction-map id, so the audit packet can reconstruct the exact evidence the model saw at decision time.

The model is Claude Sonnet 4.6, called via Anthropic's API over AWS PrivateLink — the BAA is what unlocked PrivateLink, which is what kept the PAN-bearing prompts inside the customer's VPC. We use `response_format: json_schema` so the disposition is bounded to a severity enum, the evidence array is regex-pinned to chunk ids, and the rationale field can't exceed an audit-friendly length. Every claim in the rationale has to point to an evidence-chunk id from the retrieved set or the schema validator rejects the output. Claude Haiku 4.5 drafts the case-note narrative downstream of the Sonnet disposition — same evidence ids, templated by severity.

Retrieval params are tuned, not defaulted. Chunks at 512 tokens with 96-token overlap, anchored on sentence boundaries because case-note narrative has long-form sentences that lose grounding when cut mid-clause. Embeddings are voyage-3-large at 1,024 dimensions; we tested voyage-3-lite first and lost five points of recall@5 on the eval, which wasn't worth the 35% embeddings cost saving. BM25 uses Postgres tsvector with English stemming. RRF at k=60 (the paper default; we did not find a better value on the held-out slice), top-40 from each lane, deduplicated by chunk id, rerank to top-12 going into the model.

three decisions that shaped the build
design decision · 01

Skip the LLM on the auto-clear band

we rejected
Run Claude on every transaction
because
92% of auth-boundary traffic is below the velocity-score threshold. Burning Sonnet tokens on cases the rules engine already cleared is the indefensible line in the cost math; the XGBoost short-circuit is what makes the unit economics work.
design decision · 02

Forced JSON with a severity enum

we rejected
Free-text disposition + downstream parser
because
The regulator audit needs the disposition packet to be reproducible from the trace. A schema-bounded severity enum (low | med | high | regulatory) is what makes the SAR-track decision deterministic; the model can't smuggle a fifth severity into the output.
design decision · 03

Two-eye gate on regulatory severity

we rejected
Auto-route regulatory severity to SAR queue
because
Anything that touches the FinCEN clock starts on a human signature, not a model output. The senior-analyst approval row is checked into the policy file — the runtime refuses to dispatch the escalation tool without it. We accepted a slower escalate path for an audit-defensible one.

Guardrails live as TypeScript policy files checked into the same repo as the agent — one per tool the agent can reach for. The runtime imports the policy at startup and refuses to dispatch any tool call that doesn't pass. The `escalate_case.policy.ts` file below is the actual shape we shipped; it gates the regulatory-severity branch behind a senior-analyst approval, rate-limits per-case to one dispatch, and writes a 7-year retention audit row on every call regardless of outcome. Per-claim evidence-chunk ids ride through the audit log alongside the model version, the retrieved chunks, the velocity score, the reasoning JSON, and the senior analyst's signature when applicable.

The reason this shape works is the same reason we scoped it this way at week 1. Every component has a separately measurable contract. The XGBoost velocity model is measurable in ROC-AUC + ECE on the auto-clear band. Retrieval is measurable in top-k recall on the eval set. The reranker is measurable in top-1 precision on the held-out slice. The decision model is measurable in labelled severity-correctness + groundedness. The case-note generator is measurable in regulator-audit acceptance (the senior analyst lead signs off on a 10% sample weekly). The guardrails are measurable in policy-rejection rate vs. senior-analyst-override rate. When something regresses, the per-component metric tells us which stage to look at — not a single end-to-end number that hides which subsystem broke.

Langfuse runs the trace store inside the customer VPC. Every production decision retains its velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy-check result, case-note draft, and final disposition. The trace store is searchable by senior-analyst-override status and is what the model-risk lead reviews weekly. It is also what we used to find the Black Friday calibration bug at week 7; the timeline section below has the honest version of how that week played out.

under the hood

The fraud agent,
auth to audit.

Every transaction enters at the top. The XGBoost velocity score skips the LLM on the auto-clear band; anything above the threshold runs hybrid retrieval over four years of KYC and case-note corpus, reranks the evidence, and lets Claude Sonnet 4.6 produce a forced-JSON disposition. Hover any stage for its tool surface and latency budget.

outcome · ~93.4% Clear (silent) auto-pass band · audit row written · no analyst touch
outcome · ~5.1% Case-note · queue structured case file · evidence cited · 0.8 min review
outcome · ~1.5% Escalate · regulatory senior-analyst 2-eye gate · SAR-track on confirm

latency budgets are p50/p95 from a 30-day production window · end-to-end p95 inside the 2.6s decision SLA

BAA-scoped
Anthropic over AWS PrivateLink · no PAN leaves the customer VPC
0
autonomous regulatory escalations · senior-analyst signs every one
7-year
audit retention · WORM-equivalent S3 object lock · per BSA / SAR rules
shadow-first
three weeks in silent shadow against the rules engine before any cutover
deterministic replay · synthetic data

A 0.32-second window
at the auth boundary.

Eight rows from a synthetic replay tape — the same shape the production stream sees at ~38k transactions/sec peak. The agent fans out into three lanes per row: silent clear, queued case-note, or senior-analyst escalation. No real PAN, no real merchant; this is a replay viewer, not a live feed.

ts
card
merchant
amount
v-score
decision
reason
14:02:18.041
•••• 4019
Grocery · POS
$42.18
0.09
clear
low-risk merchant · habitual
14:02:18.092
•••• 7124
Online retail
$1,840.00
0.71
case-note
amount p99 · novel beneficiary
14:02:18.137
•••• 3055
Fuel · CRIND
$58.40
0.14
clear
in-pattern · velocity normal
14:02:18.184
•••• 8801
Wire · cross-border
$9,250.00
0.92
escalate
structured-pattern hit · senior-analyst 2-eye
14:02:18.226
•••• 2236
Streaming · sub
$14.99
0.04
clear
recurring · pre-allow
14:02:18.271
•••• 6498
Electronics
$612.00
0.48
case-note
ip-geo drift · low-confidence
14:02:18.318
•••• 5712
Restaurant
$78.25
0.11
clear
habitual · merchant in cohort
14:02:18.366
•••• 9043
Crypto on-ramp
$4,500.00
0.86
escalate
first-seen on-ramp · regulatory routing

replay clock advances 41 ms per row · 7 of 8 rows shown are auto-allow band (vscore < 0.18) in production; replay over-samples flagged rows for legibility

the stack

Named tools,
named versions.

Everything in the build is a thing the model-risk committee can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, policies, and feature definitions are all checked into the bank's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON role decision
Claude Haiku 4.5 role routing + case-note narrative
XGBoost 2.0 role velocity score · 142 features
pgvector 0.7 role embedding retrieval · KYC corpus
BM25 (Postgres tsvector) role lexical retrieval
BAAI bge-reranker-large role cross-encoder rerank · g5.xlarge in-VPC
LangGraph 0.2.x role agent orchestrator
Langfuse role per-decision trace · 90d hot / 7yr cold
AWS PrivateLink role in-VPC Anthropic inference · zero egress
how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

Most fraud-agent case studies stop at the architecture diagram. Ours doesn't, because the two people who decide whether to sign — the model-risk lead and the head of compliance — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the model vendor's published price card, a frozen eval with category-level thresholds, an honest accounting of what runs where for BAA scope, and a regulator-audit retention story. Vendors who don't show this either don't have it or are hiding it. Every number below is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms)

stagep50p95tooling
Kafka consumer + parse818Confluent · ISO 8583 superset · per-tenant partition key
XGBoost velocity score49XGBoost 2.0 · 142 features · auto-clear band short-circuit
Hybrid retrieval3892pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank62138BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision1,4202,080Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
Claude Haiku 4.5 case-note7801,180narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
Policy + 2-eye + audit log1122TypeScript runtime · Zod schema · WORM-equivalent audit row
Total (LLM-routed path)2,3233,537agent boundary · ~8% of traffic; auto-clear path < 50ms total
  1. stage Kafka consumer + parse
    p50 8
    p95 18
    tooling Confluent · ISO 8583 superset · per-tenant partition key
  2. stage XGBoost velocity score
    p50 4
    p95 9
    tooling XGBoost 2.0 · 142 features · auto-clear band short-circuit
  3. stage Hybrid retrieval
    p50 38
    p95 92
    tooling pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
  4. stage Cross-encoder rerank
    p50 62
    p95 138
    tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
  5. stage Claude Sonnet 4.6 decision
    p50 1,420
    p95 2,080
    tooling Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
  6. stage Claude Haiku 4.5 case-note
    p50 780
    p95 1,180
    tooling narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
  7. stage Policy + 2-eye + audit log
    p50 11
    p95 22
    tooling TypeScript runtime · Zod schema · WORM-equivalent audit row
  8. stage Total (LLM-routed path)
    p50 2,323
    p95 3,537
    tooling agent boundary · ~8% of traffic; auto-clear path < 50ms total

p50/p95 from a 30-day rolling window over n ≈ 28,400 LLM-routed decisions / mo (~92% of traffic short-circuits before the LLM call). SLO is p95 ≤ 3,500 ms on the LLM-routed path. Current burn ≈ 101% — we're in active tuning on the reranker timeout to bring the tail in; the SpecGrid above doesn't lie about a number we haven't shipped yet.

The retrieval lane is where the per-stage tuning compounded. The corpus is roughly 2.4 million chunks at 512 tokens each over four years of KYC artefacts and historical case notes; chunk anchoring on sentence boundaries is what kept recall@5 stable as the corpus grew. We picked voyage-3-large at 1,024 dimensions because Voyage offered a BAA at parity pricing to voyage-3-lite, and the lite variant lost five points of recall@5 on the eval. RRF at k=60 (paper default; the held-out slice didn't move on alternatives we tested), top-40 from each lane, deduplicated by chunk id, reranked to top-12 going into the model. Eval-set recall@5 after fusion + rerank is 0.93; recall@1 is 0.81 — high enough that the model's first cited chunk almost always reflects the load-bearing evidence in the disposition.

triage/tools/escalate_case.policy.ts typescript
// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};
The policy file for the regulatory-escalation tool. The runtime imports it at startup and refuses to dispatch a tool call that doesn't pass. Two-eye rule is enforced in code, not a config flag — the same pattern ships on every write tool in the agent.
unit economics

Per-decision and monthly cost math

line item$ / decision$ / month (≈ 28k LLM-routed decisions)note
Claude Sonnet 4.6 — input$0.0096$2693,200 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output$0.0063$176420 tokens × $15.00 / 1M
Claude Haiku 4.5 — case-note$0.0008$221,100 in + 340 out at Haiku pricing
voyage-3-large embeddings$0.0006$17≈ 5,000 tokens × $0.12 / 1M
pgvector + RDS db.r6i.xlarge$612BAA-scoped Postgres · pgvector + tsvector
g5.xlarge reranker (24/7)$378BAAI bge-reranker-large self-host
AWS PrivateLink + endpoints$96Anthropic in-VPC inference
Langfuse self-hosted (t3.large)$104trace store · 90d hot / 7yr cold
All-in monthly≈ $0.061≈ $1,674vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep
  1. line item Claude Sonnet 4.6 — input
    $ / decision $0.0096
    $ / month (≈ 28k LLM-routed decisions) $269
    note 3,200 tokens × $3.00 / 1M
  2. line item Claude Sonnet 4.6 — output
    $ / decision $0.0063
    $ / month (≈ 28k LLM-routed decisions) $176
    note 420 tokens × $15.00 / 1M
  3. line item Claude Haiku 4.5 — case-note
    $ / decision $0.0008
    $ / month (≈ 28k LLM-routed decisions) $22
    note 1,100 in + 340 out at Haiku pricing
  4. line item voyage-3-large embeddings
    $ / decision $0.0006
    $ / month (≈ 28k LLM-routed decisions) $17
    note ≈ 5,000 tokens × $0.12 / 1M
  5. line item pgvector + RDS db.r6i.xlarge
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $612
    note BAA-scoped Postgres · pgvector + tsvector
  6. line item g5.xlarge reranker (24/7)
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $378
    note BAAI bge-reranker-large self-host
  7. line item AWS PrivateLink + endpoints
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $96
    note Anthropic in-VPC inference
  8. line item Langfuse self-hosted (t3.large)
    $ / decision
    $ / month (≈ 28k LLM-routed decisions) $104
    note trace store · 90d hot / 7yr cold
  9. line item All-in monthly
    $ / decision ≈ $0.061
    $ / month (≈ 28k LLM-routed decisions) ≈ $1,674
    note vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

Token costs use Anthropic's public Sonnet 4.6 + Haiku 4.5 pricing as of May 2026 — $3 / 1M input, $15 / 1M output on Sonnet; $0.80 / 1M input, $4 / 1M output on Haiku. Infra costs are AWS US-east-2 list price; the bank paid less under an EDP. The legacy comparison line is the bank's own per-case review-prep cost × the routed-cases volume — the agent doesn't replace analyst time at the decision boundary, it compresses the review-prep half of the case workload.

eval composition

What's in the frozen 412-item set

categoryitemswhat it checksci-gate threshold
Severity-decision golds100labelled disposition + severity band on real (de-id) past cases≥ 0.95 precision @ 1% FPR
Evidence groundedness120every rationale claim points to a retrieved chunk id that supports it≥ 0.93 groundedness
Retrieval recall80correct case + policy chunks in top-5 after RRF + rerank≥ 0.90 recall@5
Refusal / adversarial60structured-pattern hits, jailbreak attempts, OOD merchant categories100% refusal on must-refuse
Calibration golds52confidence-vs-correctness on held-out cases · ECE checkECE ≤ 0.04
  1. category Severity-decision golds
    items 100
    what it checks labelled disposition + severity band on real (de-id) past cases
    ci-gate threshold ≥ 0.95 precision @ 1% FPR
  2. category Evidence groundedness
    items 120
    what it checks every rationale claim points to a retrieved chunk id that supports it
    ci-gate threshold ≥ 0.93 groundedness
  3. category Retrieval recall
    items 80
    what it checks correct case + policy chunks in top-5 after RRF + rerank
    ci-gate threshold ≥ 0.90 recall@5
  4. category Refusal / adversarial
    items 60
    what it checks structured-pattern hits, jailbreak attempts, OOD merchant categories
    ci-gate threshold 100% refusal on must-refuse
  5. category Calibration golds
    items 52
    what it checks confidence-vs-correctness on held-out cases · ECE check
    ci-gate threshold ECE ≤ 0.04

Eval set is frozen — items only added, never edited. The senior analyst lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Black Friday holiday-window slice (added at week 7) became a 5th fold and is now permanent.

Production ops cadence is part of the build, not an afterthought. The senior analyst lead and our on-call engineer hold a weekly override-review meeting where every case in which the agent's disposition differed from the analyst's override gets opened — any drift that looks systematic (more than three of the same pattern in a week) becomes a JIRA ticket against the eval set and a candidate fine-tune slice. Langfuse trace retention is 90 days hot in the customer VPC plus seven years cold in BAA-scoped S3 with WORM-equivalent object lock, matching the BSA's seven-year record-retention requirement and the bank's internal SAR documentation policy. Our on-call rotation runs two engineers a week against a 99.5% pipeline-availability SLO and the 95th-percentile-under-3.5s decision SLO on the LLM-routed path. The model-risk lead pulls a 50-row audit-log sample every month covering velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy verdict, and senior-analyst approval. Nothing in this section is published anywhere else by any vendor shipping fraud agents on Claude — that's the bar.

interactive · drag the threshold

Precision vs FPR,
where you stand the line.

Where the team stands the threshold is a policy choice, not a model property. Drag the marker to see how precision, recall, and per-month false-positive volume move together. We anchor production at 1% FPR — the ops team's documented ceiling for analyst review-prep load.

precision recall

curve fitted from the frozen 412-item eval set · production op-point at 1% FPR · move the thumb with mouse, touch, or arrow keys

11 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-7 Black Friday shadow surfaced a calibration drift on a holiday-shopping velocity pattern that the eval set hadn't seen. We halted cutover, ingested the holiday-window slice as a new eval fold, re-fit the calibration head, and only then promoted to primary screen. The honest version of `11 weeks` includes the week we ran the sweep.

  1. Weeks 1–2

    Discovery + frozen eval set

    Two weeks shadowing the fraud-ops team. The senior analyst lead labelled 412 frozen eval items drawn from 18 months of (de-identified) past cases — each carrying a labelled correct disposition + the rationale + the evidence chunks that should ground the call. We wrote the harness; the ops team wrote the answers. Scoping decision: the deliverable is a structured-output agent, not a chatbot, and the eval gate is non-negotiable.

    412-item frozen eval + severity-band rubric · scope memo signed
  2. Weeks 3–4

    Corpus + velocity score + retrieval

    Ingested four years of KYC artifacts and historical case notes into pgvector 0.7 inside the customer VPC. BM25 sidecar over the same chunks. XGBoost 2.0 velocity model trained against the labelled fraud history with 142 features; calibrated with isotonic regression on a held-out slice. Reciprocal-rank fusion tuned on the eval slice; cross-encoder rerank wired in when top-1 recall plateaued.

    Hybrid retrieval at 0.93 top-5 recall · velocity ECE 0.041
  3. Weeks 5–6

    Agent skeleton + policy-as-code

    LangGraph 0.2.x agent with three read-only tools (case lookup, KYC pull, structured-pattern check) and two write tools (case-note write, escalation dispatch). Every tool carries a policy file in `triage/tools/`. Forced-JSON disposition via Anthropic's `response_format`. Two-eye gate baked into the runtime; senior-analyst approval is what unblocks the regulatory-severity branch.

    End-to-end pipeline behind a feature flag · BAA + PrivateLink wired
  4. Week 7

    Black Friday shadow — calibration drift caught

    Three weeks of silent shadow against the live rules engine. Day 9 was Black Friday, and a holiday-shopping pattern surfaced that nobody had labelled in the eval set — a structurally novel velocity pattern from gift-card top-ups that the model was over-confidently clearing as legit. We halted cutover, ingested the holiday-window slice as a fresh-data fold, re-fit the calibration head, and re-ran the full eval. The honest version of `shipped on time` includes the week we sat on our hands and ran the calibration sweep.

    ECE recalibrated from 0.067 → 0.028 on the Black Friday-augmented eval slice
    Walk-away point
  5. Weeks 8–11

    Cutover + SAR-track integration

    Promoted to primary screen with the rules engine in active-standby. Compliance reviewed the audit-log packet end-to-end; FinCEN SAR-track integration tested against the bank's e-filing path. Four ops-team training sessions on the case-note acceptance flow + the two-eye approval surface. PagerDuty wired to the regulatory-severity lane. Old rules engine stays in active-standby for 60 days post-cutover; every diff between agent + rules logged for the model-risk lead's weekly review.

    Production cutover · SAR-track audited · model-risk committee sign-off
eval results · 412 frozen items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric
baseline (wk 2)
v1 (wk 5)
v2 (wk 6)
current (live)
target
Precision @ 1% FPR
0.812
0.928
0.952
0.962
≥ 0.95
Recall on labelled fraud
0.681
0.812
0.864
0.881
≥ 0.85
Calibration (ECE)
0.094
0.067
0.039
0.028
≤ 0.04
Case-note groundedness
0.88
0.94
0.96
≥ 0.93
Refusal rate
12.4%
10.1%
8.6%
8–12%
P95 time-to-disposition
3.4s
2.9s
2.6s
≤ 3.0s

Sample size for the ≥ 0.96 precision figure is n=412 frozen eval items + the production confirmation slice n ≈ 1,840 cases reviewed by the senior analyst over the first 30 days post-cutover. Confidence interval is ±0.012 on the precision at the 1% FPR op-point. ECE is expected calibration error on the labelled set. P95 is end-to-end on the LLM-routed path (auto-clear path is under 50ms total). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a senior analyst — by design, not by failure. Note: refusal rate v1 → v2 → current is not monotone-improving by design; we tuned the refusal-threshold up in v2 after the senior analyst lead flagged that v1 was clearing borderline cases that should have routed for review.

artefact diff · synthetic case

What the ops team reads
when a case routes for review.

The 8.0 → 0.8 minute delta isn't a tooling story. It's an artefact story. On the left: what an analyst produced manually — narrative-heavy, hard to skim, citations buried in the prose. On the right: the agent's structured packet — same evidence, surfaced as fields the regulator-audit reviewer reads in seconds.

before · manual write-up ≈ 8 min / case

Case # FA-2026-04-11-0917
Reviewer: J. Reyes · 11 Apr 2026 14:22 UTC

Customer (PAN ending 8801) initiated a wire transfer of $9,250.00 USD to a beneficiary account first seen on the platform on 09 Apr 2026. Reviewing prior 90-day activity for this customer shows wire activity concentrated to two prior beneficiaries (relatives, KYC-verified, historical pattern stable). The new beneficiary is an LLC registered in a jurisdiction with elevated SAR-correlation per our internal scoring (referenced in policy doc PL-WIRE-2024 §4.2).

Velocity score on this transaction was 0.92 (model output, see ML-FRAUD-3.4 dashboard). Cross-checking against our case-history corpus, three structurally similar cases have been reviewed in the past 18 months; two were confirmed-fraud, one cleared after additional context. The originating IP geo (Newark, NJ) is consistent with the cardholder's historical pattern.

Recommendation: escalate for regulatory review. Senior analyst sign-off required per the two-eye policy on structured-pattern hits. Note: cardholder not yet contacted — pending compliance lead approval for outbound.

after · agent-generated ≈ 0.8 min / case
{
  "case_id":   "FA-2026-04-11-0917",
  "severity":  "regulatory",
  "decision":  "escalate",
  "confidence": 0.91,

  "evidence": [
    {
      "claim": "Beneficiary first-seen 09 Apr 2026; not in cardholder's 90d graph.",
      "evidence_id": "chunk_a4f0c12b9e44",
      "source": "ledger.beneficiary_first_seen"
    },
    {
      "claim": "LLC jurisdiction matches PL-WIRE-2024 §4.2 elevated-SAR list.",
      "evidence_id": "chunk_71d33e0a4c8b",
      "source": "policy.PL-WIRE-2024"
    },
    {
      "claim": "Velocity score 0.92; 3 structurally similar cases in the corpus.",
      "evidence_id": "chunk_e8b290745f01",
      "source": "ml.velocity + case-history"
    }
  ],

  "two_eye_required": true,
  "approver_role":    "role:senior-analyst",
  "sar_track":        true,
  "audit_retain_yrs": 7
}

both artefacts are synthetic · case-id, beneficiary, and PAN-last-4 are illustrative · the agent packet is what the regulator-audit reviewer reads, not the prose

Ready to ship

Want a case study like this
for your fraud-ops floor?

Book a $3K fixed-fee audit. We'll review the workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped — and whether the regulatory posture is ready to support a build. About one audit in five ends with `the legal posture isn't ready yet, here's the 90-day prep plan.`

Read the fintech pillar
30 min, async or live Eval-first scoping Walk-away point in the pilot