Fintech · Mid-market US bank AI agent · forced JSON · policy-as-code

Claude Sonnet 4.6

Claude Haiku 4.5

XGBoost 2.0

pgvector 0.7

LangGraph 0.2.x

Langfuse

case study · 2026 · anonymized

An Anthropic case study, with the audit trail
a regulator would accept.

A US mid-market bank's fraud-ops team needed a decision layer that could clear the auto-pass band silently, produce a defensible case note on every queue entry, and escalate regulatory-severity cases to a senior analyst with a two-eye signature on the dispatch. We built it on Claude Sonnet 4.6 and Haiku 4.5 with XGBoost on the velocity short-circuit, hybrid retrieval over four years of KYC and case-note corpus, and a policy-as-code layer that gates every write tool. Eleven weeks, BAA-scoped over AWS PrivateLink, with a kill point at week 7 that we used.

≥ 0.96

precision at 1% FPR · n=412 frozen eval · ±0.012 CI

8.0 → 0.8 min

case-note review-prep per case · n=204 timed sessions

~92%

of routed cases include the right evidence on first read · n=1,840

11 weeks

discovery to shadow-mode go-live · 1 calibration halt at wk 7

shipped

11 weeks · 5 engineers · 1 senior analyst lead · 1 model-risk lead

1.2B / yr

transactions across card · wire · ACH · RTP

18%

false-positive rate on the legacy rules engine

$14 / case

fully-loaded analyst review-prep cost on flagged cases

8 min / case

median analyst review-prep before the agent shipped

the problem

A rules engine
under load.

False-positives at 18% mean the analyst queue is mostly noise. The binding constraint wasn't fraud loss — it was the audit trail behind every cleared case.

The client is a US mid-market bank's fraud-operations team — roughly 1.2 billion auth-boundary transactions per year across card, ACH, wire, and RTP rails, with a 50-seat analyst floor running a hybrid rules + ML overlay that had last been tuned in 2023. Like most mid-market fraud-ops teams, they sit at an awkward operational point: too small to staff the kind of feature-engineering bench that keeps a fully-custom fraud model fresh, too large to outsource the SAR-track decision to a vendor's default library.

today vs · with the agent

today

Auth-boundary stream

Rules engine

300+ static rules · last tuned 2023

Analyst queue

≈ 8 min/case manual write-up

SAR triage

outcome

18% FPR · analyst burnout · audit prep fragile

with the agent

Auth-boundary stream

XGBoost velocity score

skip-LLM band for score < 0.18

Claude Sonnet 4.6 · forced JSON

evidence-cited disposition

Policy + 2-eye + audit log

outcome

Clear · silent · audit row

outcome

Case-note · queue · 0.8 min

outcome

Escalate · 2-eye · SAR-track

The presenting problem was specific. The legacy rules engine was clearing roughly 92% of auth-boundary traffic silently, flagging the rest into an analyst queue with an 18% false-positive rate. The fully-loaded analyst cost per flagged case — including review-prep time, second-look, and the regulator-audit packet write-up — was averaging $14. Median review-prep time was 8 minutes per case. The senior analyst lead had named the binding constraint twice in the discovery shadow: every flag needs a defensible case note in the regulator audit trail. Not eventually. Not on the cases that escalate. Every flag.

That framing matters because most fraud-agent vendors pitch the false-positive number as the win. The bank's compliance officer told us in week one that the false-positive number was a top-line goal but not the binding constraint — the binding constraint was that the team's existing rules engine produced flags the regulator-audit team could only defend by reconstructing the analyst's manual notes from the case-management system. If the model's output had to be reconstructed the same way, the engagement was indefensible.

They had looked at generic fraud-detection vendors and turned every one of them down. The objections were operator-grade. No autonomous dispositions on regulatory severity. No PAN, SSN, or beneficiary-graph data leaving the BAA perimeter. No "explainability score" without a chunk-cited evidence trail. No metric that wasn't measurable on a frozen eval set the senior analyst lead labelled. The conversation we walked into was the same one every regulated-industry AI engagement starts with: show us how the model fails, tell us how the audit packet defends every disposition, and we'll talk about the precision number after that.

the approach

Seven pipeline stages,
three outcome lanes.

Every transaction enters at the top. The XGBoost velocity score short-circuits the LLM on the auto-clear band; everything else runs hybrid retrieval over four years of KYC and case-note corpus, gets reranked, and lets Claude produce a forced-JSON disposition that policy-as-code gates before the runtime dispatches any tool call.

The architecture below is the production shape. The Kafka topic at the top is the bank's existing auth-boundary stream — we did not move it, did not re-process historical data, did not write a sidecar event store. Every transaction enters in flight, runs the XGBoost velocity score in < 9 ms p95, and exits to one of three lanes within the 2.6-second decision SLA. Roughly 92% of traffic short-circuits the LLM entirely on the auto-clear band (velocity score below 0.18) — the math behind that decision is in the unit-economics SpecGrid further down. The remaining ~8% is what the LLM sees.

Retrieval is hybrid. pgvector 0.7 sits on the embedding side over a 4-year KYC + case-note corpus (~ 2.4M chunks at 512 tokens, anchored on sentence boundaries); a BM25 sidecar built on Postgres `tsvector` sits on the lexical side. We fuse with reciprocal-rank fusion at k=60, take the top-40, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer VPC. Every chunk carries a case-id and a redaction-map id, so the audit packet can reconstruct the exact evidence the model saw at decision time.

The model is Claude Sonnet 4.6, called via Anthropic's API over AWS PrivateLink — the BAA is what unlocked PrivateLink, which is what kept the PAN-bearing prompts inside the customer's VPC. We use `response_format: json_schema` so the disposition is bounded to a severity enum, the evidence array is regex-pinned to chunk ids, and the rationale field can't exceed an audit-friendly length. Every claim in the rationale has to point to an evidence-chunk id from the retrieved set or the schema validator rejects the output. Claude Haiku 4.5 drafts the case-note narrative downstream of the Sonnet disposition — same evidence ids, templated by severity.

Retrieval params are tuned, not defaulted. Chunks at 512 tokens with 96-token overlap, anchored on sentence boundaries because case-note narrative has long-form sentences that lose grounding when cut mid-clause. Embeddings are voyage-3-large at 1,024 dimensions; we tested voyage-3-lite first and lost five points of recall@5 on the eval, which wasn't worth the 35% embeddings cost saving. BM25 uses Postgres tsvector with English stemming. RRF at k=60 (the paper default; we did not find a better value on the held-out slice), top-40 from each lane, deduplicated by chunk id, rerank to top-12 going into the model.

three decisions that shaped the build

design decision · 01

Skip the LLM on the auto-clear band

we rejected: Run Claude on every transaction
because: 92% of auth-boundary traffic is below the velocity-score threshold. Burning Sonnet tokens on cases the rules engine already cleared is the indefensible line in the cost math; the XGBoost short-circuit is what makes the unit economics work.

design decision · 02

Forced JSON with a severity enum

we rejected: Free-text disposition + downstream parser
because: The regulator audit needs the disposition packet to be reproducible from the trace. A schema-bounded severity enum (low | med | high | regulatory) is what makes the SAR-track decision deterministic; the model can't smuggle a fifth severity into the output.

design decision · 03

Two-eye gate on regulatory severity

we rejected: Auto-route regulatory severity to SAR queue
because: Anything that touches the FinCEN clock starts on a human signature, not a model output. The senior-analyst approval row is checked into the policy file — the runtime refuses to dispatch the escalation tool without it. We accepted a slower escalate path for an audit-defensible one.

Guardrails live as TypeScript policy files checked into the same repo as the agent — one per tool the agent can reach for. The runtime imports the policy at startup and refuses to dispatch any tool call that doesn't pass. The `escalate_case.policy.ts` file below is the actual shape we shipped; it gates the regulatory-severity branch behind a senior-analyst approval, rate-limits per-case to one dispatch, and writes a 7-year retention audit row on every call regardless of outcome. Per-claim evidence-chunk ids ride through the audit log alongside the model version, the retrieved chunks, the velocity score, the reasoning JSON, and the senior analyst's signature when applicable.

The reason this shape works is the same reason we scoped it this way at week 1. Every component has a separately measurable contract. The XGBoost velocity model is measurable in ROC-AUC + ECE on the auto-clear band. Retrieval is measurable in top-k recall on the eval set. The reranker is measurable in top-1 precision on the held-out slice. The decision model is measurable in labelled severity-correctness + groundedness. The case-note generator is measurable in regulator-audit acceptance (the senior analyst lead signs off on a 10% sample weekly). The guardrails are measurable in policy-rejection rate vs. senior-analyst-override rate. When something regresses, the per-component metric tells us which stage to look at — not a single end-to-end number that hides which subsystem broke.

Langfuse runs the trace store inside the customer VPC. Every production decision retains its velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy-check result, case-note draft, and final disposition. The trace store is searchable by senior-analyst-override status and is what the model-risk lead reviews weekly. It is also what we used to find the Black Friday calibration bug at week 7; the timeline section below has the honest version of how that week played out.

under the hood

The fraud agent,
auth to audit.

Every transaction enters at the top. The XGBoost velocity score skips the LLM on the auto-clear band; anything above the threshold runs hybrid retrieval over four years of KYC and case-note corpus, reranks the evidence, and lets Claude Sonnet 4.6 produce a forced-JSON disposition. Hover any stage for its tool surface and latency budget.

outcome · ~93.4% Clear (silent) auto-pass band · audit row written · no analyst touch

outcome · ~5.1% Case-note · queue structured case file · evidence cited · 0.8 min review

outcome · ~1.5% Escalate · regulatory senior-analyst 2-eye gate · SAR-track on confirm

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

latency budgets are p50/p95 from a 30-day production window · end-to-end p95 inside the 2.6s decision SLA

BAA-scoped

Anthropic over AWS PrivateLink · no PAN leaves the customer VPC

autonomous regulatory escalations · senior-analyst signs every one

7-year

audit retention · WORM-equivalent S3 object lock · per BSA / SAR rules

shadow-first

three weeks in silent shadow against the rules engine before any cutover

deterministic replay · synthetic data

A 0.32-second window
at the auth boundary.

Eight rows from a synthetic replay tape — the same shape the production stream sees at ~38k transactions/sec peak. The agent fans out into three lanes per row: silent clear, queued case-note, or senior-analyst escalation. No real PAN, no real merchant; this is a replay viewer, not a live feed.

card

merchant

amount

v-score

decision

reason

14:02:18.041

•••• 4019

Grocery · POS

$42.18

0.09

clear

low-risk merchant · habitual

14:02:18.092

•••• 7124

Online retail

$1,840.00

0.71

case-note

amount p99 · novel beneficiary

14:02:18.137

•••• 3055

Fuel · CRIND

$58.40

0.14

clear

in-pattern · velocity normal

14:02:18.184

•••• 8801

Wire · cross-border

$9,250.00

0.92

escalate

structured-pattern hit · senior-analyst 2-eye

14:02:18.226

•••• 2236

Streaming · sub

$14.99

0.04

clear

recurring · pre-allow

14:02:18.271

•••• 6498

Electronics

$612.00

0.48

case-note

ip-geo drift · low-confidence

14:02:18.318

•••• 5712

Restaurant

$78.25

0.11

clear

habitual · merchant in cohort

14:02:18.366

•••• 9043

Crypto on-ramp

$4,500.00

0.86

escalate

first-seen on-ramp · regulatory routing

replay clock advances 41 ms per row · 7 of 8 rows shown are auto-allow band (vscore < 0.18) in production; replay over-samples flagged rows for legibility

the stack

Named tools,
named versions.

Everything in the build is a thing the model-risk committee can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, policies, and feature definitions are all checked into the bank's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5

XGBoost 2.0

pgvector 0.7

BM25 (Postgres tsvector)

BAAI bge-reranker-large

LangGraph 0.2.x

Langfuse

AWS PrivateLink

how it actually runs

Production shape,
under the hood.

Numbers below are from the current production cut. Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 412-item set the CI gates on.

Most fraud-agent case studies stop at the architecture diagram. Ours doesn't, because the two people who decide whether to sign — the model-risk lead and the head of compliance — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the model vendor's published price card, a frozen eval with category-level thresholds, an honest accounting of what runs where for BAA scope, and a regulator-audit retention story. Vendors who don't show this either don't have it or are hiding it. Every number below is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms)

stage	p50	p95	tooling
Kafka consumer + parse	8	18	Confluent · ISO 8583 superset · per-tenant partition key
XGBoost velocity score	4	9	XGBoost 2.0 · 142 features · auto-clear band short-circuit
Hybrid retrieval	38	92	pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
Cross-encoder rerank	62	138	BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
Claude Sonnet 4.6 decision	1,420	2,080	Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
Claude Haiku 4.5 case-note	780	1,180	narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
Policy + 2-eye + audit log	11	22	TypeScript runtime · Zod schema · WORM-equivalent audit row
Total (LLM-routed path)	2,323	3,537	agent boundary · ~8% of traffic; auto-clear path < 50ms total

stage Kafka consumer + parse
p50 8
p95 18
tooling Confluent · ISO 8583 superset · per-tenant partition key
stage XGBoost velocity score
p50 4
p95 9
tooling XGBoost 2.0 · 142 features · auto-clear band short-circuit
stage Hybrid retrieval
p50 38
p95 92
tooling pgvector cosine top-40 ∥ tsvector BM25 top-40 → RRF k=60
stage Cross-encoder rerank
p50 62
p95 138
tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-12
stage Claude Sonnet 4.6 decision
p50 1,420
p95 2,080
tooling Anthropic API over AWS PrivateLink · ~3,200 in / ~420 out tokens
stage Claude Haiku 4.5 case-note
p50 780
p95 1,180
tooling narrates from Sonnet's evidence ids · ~1,100 in / ~340 out
stage Policy + 2-eye + audit log
p50 11
p95 22
tooling TypeScript runtime · Zod schema · WORM-equivalent audit row
stage Total (LLM-routed path)
p50 2,323
p95 3,537
tooling agent boundary · ~8% of traffic; auto-clear path < 50ms total

p50/p95 from a 30-day rolling window over n ≈ 28,400 LLM-routed decisions / mo (~92% of traffic short-circuits before the LLM call). SLO is p95 ≤ 3,500 ms on the LLM-routed path. Current burn ≈ 101% — we're in active tuning on the reranker timeout to bring the tail in; the SpecGrid above doesn't lie about a number we haven't shipped yet.

The retrieval lane is where the per-stage tuning compounded. The corpus is roughly 2.4 million chunks at 512 tokens each over four years of KYC artefacts and historical case notes; chunk anchoring on sentence boundaries is what kept recall@5 stable as the corpus grew. We picked voyage-3-large at 1,024 dimensions because Voyage offered a BAA at parity pricing to voyage-3-lite, and the lite variant lost five points of recall@5 on the eval. RRF at k=60 (paper default; the held-out slice didn't move on alternatives we tested), top-40 from each lane, deduplicated by chunk id, reranked to top-12 going into the model. Eval-set recall@5 after fusion + rerank is 0.93; recall@1 is 0.81 — high enough that the model's first cited chunk almost always reflects the load-bearing evidence in the disposition.

triage/tools/escalate_case.policy.ts typescript

// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};

// triage/tools/escalate_case.policy.ts
//
// Every write tool the agent can reach for has a policy file.
// The runtime imports these at startup and refuses to dispatch
// any tool call that doesn't pass. Regulatory severity needs a
// senior-analyst signature before the SAR-track is touched.

import { Policy } from "@gw/agent-runtime";

export const escalate_case: Policy = {
  description: "Send a flagged transaction to the human-review queue.",
  inputs: {
    case_id:     "uuid, exists in cases table, has not been escalated",
    severity:    "enum: low | med | high | regulatory",
    evidence:    "array, min 1, items have {claim, evidence_id, source}",
    confidence:  "number in [0,1]; required, no default",
    reasoning:   "string, 80–600 chars, grounded in evidence",
  },
  preconditions: [
    "agent.confidence_calibrated === true",
    "transaction.amount > 0",
    "no_pending_escalation_for(case_id)",
    "every(evidence, e => retrieval.contains(e.evidence_id))",
  ],

  rate_limits: { perAgent: "30/min", perCase: "1" },

  audit: {
    redact:    ["pan", "ssn", "iban", "routing"],
    retain:    "7y",
    store:     "s3:bsa-audit-log/worm",  // WORM-equivalent object lock
    log_shape: ["case_id", "severity", "evidence", "model_version",
                "retrieval_chunks", "policy_verdict", "approver"],
  },

  // Two-eye rule. Regulatory severity needs a senior-analyst sign-off
  // before the runtime dispatches the SAR-track integration.
  approval: {
    required: ({ severity }) => severity === "regulatory",
    approver: "role:senior-analyst",
    deadline_mins: 30, // ages back to the queue with an "aged out" tag
  },
};

The policy file for the regulatory-escalation tool. The runtime imports it at startup and refuses to dispatch a tool call that doesn't pass. Two-eye rule is enforced in code, not a config flag — the same pattern ships on every write tool in the agent.

unit economics

Per-decision and monthly cost math

line item	$ / decision	$ / month (≈ 28k LLM-routed decisions)	note
Claude Sonnet 4.6 — input	$0.0096	$269	3,200 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output	$0.0063	$176	420 tokens × $15.00 / 1M
Claude Haiku 4.5 — case-note	$0.0008	$22	1,100 in + 340 out at Haiku pricing
voyage-3-large embeddings	$0.0006	$17	≈ 5,000 tokens × $0.12 / 1M
pgvector + RDS db.r6i.xlarge	—	$612	BAA-scoped Postgres · pgvector + tsvector
g5.xlarge reranker (24/7)	—	$378	BAAI bge-reranker-large self-host
AWS PrivateLink + endpoints	—	$96	Anthropic in-VPC inference
Langfuse self-hosted (t3.large)	—	$104	trace store · 90d hot / 7yr cold
All-in monthly	≈ $0.061	≈ $1,674	vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

line item Claude Sonnet 4.6 — input
$ / decision $0.0096
$ / month (≈ 28k LLM-routed decisions) $269
note 3,200 tokens × $3.00 / 1M
line item Claude Sonnet 4.6 — output
$ / decision $0.0063
$ / month (≈ 28k LLM-routed decisions) $176
note 420 tokens × $15.00 / 1M
line item Claude Haiku 4.5 — case-note
$ / decision $0.0008
$ / month (≈ 28k LLM-routed decisions) $22
note 1,100 in + 340 out at Haiku pricing
line item voyage-3-large embeddings
$ / decision $0.0006
$ / month (≈ 28k LLM-routed decisions) $17
note ≈ 5,000 tokens × $0.12 / 1M
line item pgvector + RDS db.r6i.xlarge
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $612
note BAA-scoped Postgres · pgvector + tsvector
line item g5.xlarge reranker (24/7)
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $378
note BAAI bge-reranker-large self-host
line item AWS PrivateLink + endpoints
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $96
note Anthropic in-VPC inference
line item Langfuse self-hosted (t3.large)
$ / decision —
$ / month (≈ 28k LLM-routed decisions) $104
note trace store · 90d hot / 7yr cold
line item All-in monthly
$ / decision ≈ $0.061
$ / month (≈ 28k LLM-routed decisions) ≈ $1,674
note vs. ≈ $14 × 6k cases/mo = $84k legacy review-prep

Token costs use Anthropic's public Sonnet 4.6 + Haiku 4.5 pricing as of May 2026 — $3 / 1M input, $15 / 1M output on Sonnet; $0.80 / 1M input, $4 / 1M output on Haiku. Infra costs are AWS US-east-2 list price; the bank paid less under an EDP. The legacy comparison line is the bank's own per-case review-prep cost × the routed-cases volume — the agent doesn't replace analyst time at the decision boundary, it compresses the review-prep half of the case workload.

eval composition

What's in the frozen 412-item set

category	items	what it checks	ci-gate threshold
Severity-decision golds	100	labelled disposition + severity band on real (de-id) past cases	≥ 0.95 precision @ 1% FPR
Evidence groundedness	120	every rationale claim points to a retrieved chunk id that supports it	≥ 0.93 groundedness
Retrieval recall	80	correct case + policy chunks in top-5 after RRF + rerank	≥ 0.90 recall@5
Refusal / adversarial	60	structured-pattern hits, jailbreak attempts, OOD merchant categories	100% refusal on must-refuse
Calibration golds	52	confidence-vs-correctness on held-out cases · ECE check	ECE ≤ 0.04

category Severity-decision golds
items 100
what it checks labelled disposition + severity band on real (de-id) past cases
ci-gate threshold ≥ 0.95 precision @ 1% FPR
category Evidence groundedness
items 120
what it checks every rationale claim points to a retrieved chunk id that supports it
ci-gate threshold ≥ 0.93 groundedness
category Retrieval recall
items 80
what it checks correct case + policy chunks in top-5 after RRF + rerank
ci-gate threshold ≥ 0.90 recall@5
category Refusal / adversarial
items 60
what it checks structured-pattern hits, jailbreak attempts, OOD merchant categories
ci-gate threshold 100% refusal on must-refuse
category Calibration golds
items 52
what it checks confidence-vs-correctness on held-out cases · ECE check
ci-gate threshold ECE ≤ 0.04

Eval set is frozen — items only added, never edited. The senior analyst lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Black Friday holiday-window slice (added at week 7) became a 5th fold and is now permanent.

Production ops cadence is part of the build, not an afterthought. The senior analyst lead and our on-call engineer hold a weekly override-review meeting where every case in which the agent's disposition differed from the analyst's override gets opened — any drift that looks systematic (more than three of the same pattern in a week) becomes a JIRA ticket against the eval set and a candidate fine-tune slice. Langfuse trace retention is 90 days hot in the customer VPC plus seven years cold in BAA-scoped S3 with WORM-equivalent object lock, matching the BSA's seven-year record-retention requirement and the bank's internal SAR documentation policy. Our on-call rotation runs two engineers a week against a 99.5% pipeline-availability SLO and the 95th-percentile-under-3.5s decision SLO on the LLM-routed path. The model-risk lead pulls a 50-row audit-log sample every month covering velocity score, retrieval candidates, reranker scores, raw model output, parsed JSON, policy verdict, and senior-analyst approval. Nothing in this section is published anywhere else by any vendor shipping fraud agents on Claude — that's the bar.

interactive · drag the threshold

Precision vs FPR,
where you stand the line.

Where the team stands the threshold is a policy choice, not a model property. Drag the marker to see how precision, recall, and per-month false-positive volume move together. We anchor production at 1% FPR — the ops team's documented ceiling for analyst review-prep load.

precision recall

curve fitted from the frozen 412-item eval set · production op-point at 1% FPR · move the thumb with mouse, touch, or arrow keys

11 weeks · honest version

The timeline,
including the week we halted.

Five stages, milestone-billed. The week-7 Black Friday shadow surfaced a calibration drift on a holiday-shopping velocity pattern that the eval set hadn't seen. We halted cutover, ingested the holiday-window slice as a new eval fold, re-fit the calibration head, and only then promoted to primary screen. The honest version of `11 weeks` includes the week we ran the sweep.

Weeks 1–2

Discovery + frozen eval set

Two weeks shadowing the fraud-ops team. The senior analyst lead labelled 412 frozen eval items drawn from 18 months of (de-identified) past cases — each carrying a labelled correct disposition + the rationale + the evidence chunks that should ground the call. We wrote the harness; the ops team wrote the answers. Scoping decision: the deliverable is a structured-output agent, not a chatbot, and the eval gate is non-negotiable.

412-item frozen eval + severity-band rubric · scope memo signed
Weeks 3–4

Corpus + velocity score + retrieval

Ingested four years of KYC artifacts and historical case notes into pgvector 0.7 inside the customer VPC. BM25 sidecar over the same chunks. XGBoost 2.0 velocity model trained against the labelled fraud history with 142 features; calibrated with isotonic regression on a held-out slice. Reciprocal-rank fusion tuned on the eval slice; cross-encoder rerank wired in when top-1 recall plateaued.

Hybrid retrieval at 0.93 top-5 recall · velocity ECE 0.041
Weeks 5–6

Agent skeleton + policy-as-code

LangGraph 0.2.x agent with three read-only tools (case lookup, KYC pull, structured-pattern check) and two write tools (case-note write, escalation dispatch). Every tool carries a policy file in `triage/tools/`. Forced-JSON disposition via Anthropic's `response_format`. Two-eye gate baked into the runtime; senior-analyst approval is what unblocks the regulatory-severity branch.

End-to-end pipeline behind a feature flag · BAA + PrivateLink wired
Week 7

Black Friday shadow — calibration drift caught

Three weeks of silent shadow against the live rules engine. Day 9 was Black Friday, and a holiday-shopping pattern surfaced that nobody had labelled in the eval set — a structurally novel velocity pattern from gift-card top-ups that the model was over-confidently clearing as legit. We halted cutover, ingested the holiday-window slice as a fresh-data fold, re-fit the calibration head, and re-ran the full eval. The honest version of `shipped on time` includes the week we sat on our hands and ran the calibration sweep.

ECE recalibrated from 0.067 → 0.028 on the Black Friday-augmented eval slice

Walk-away point
Weeks 8–11

Cutover + SAR-track integration

Promoted to primary screen with the rules engine in active-standby. Compliance reviewed the audit-log packet end-to-end; FinCEN SAR-track integration tested against the bank's e-filing path. Four ops-team training sessions on the case-note acceptance flow + the two-eye approval surface. PagerDuty wired to the regulatory-severity lane. Old rules engine stays in active-standby for 60 days post-cutover; every diff between agent + rules logged for the model-risk lead's weekly review.

Production cutover · SAR-track audited · model-risk committee sign-off

eval results · 412 frozen items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 412. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric

baseline (wk 2)

v1 (wk 5)

v2 (wk 6)

current (live)

target

Precision @ 1% FPR

0.812

0.928

0.952

0.962

≥ 0.95

Recall on labelled fraud

0.681

0.812

0.864

0.881

≥ 0.85

Calibration (ECE)

0.094

0.067

0.039

0.028

≤ 0.04

Case-note groundedness

—

0.88

0.94

0.96

≥ 0.93

Refusal rate

—

12.4%

10.1%

8.6%

8–12%

P95 time-to-disposition

—

3.4s

2.9s

2.6s

≤ 3.0s

Sample size for the ≥ 0.96 precision figure is n=412 frozen eval items + the production confirmation slice n ≈ 1,840 cases reviewed by the senior analyst over the first 30 days post-cutover. Confidence interval is ±0.012 on the precision at the 1% FPR op-point. ECE is expected calibration error on the labelled set. P95 is end-to-end on the LLM-routed path (auto-clear path is under 50ms total). Refusal rate is the share of inputs where the agent legally cannot decide and routes straight to a senior analyst — by design, not by failure. Note: refusal rate v1 → v2 → current is not monotone-improving by design; we tuned the refusal-threshold up in v2 after the senior analyst lead flagged that v1 was clearing borderline cases that should have routed for review.

artefact diff · synthetic case

What the ops team reads
when a case routes for review.

The 8.0 → 0.8 minute delta isn't a tooling story. It's an artefact story. On the left: what an analyst produced manually — narrative-heavy, hard to skim, citations buried in the prose. On the right: the agent's structured packet — same evidence, surfaced as fields the regulator-audit reviewer reads in seconds.

before · manual write-up ≈ 8 min / case

Case # FA-2026-04-11-0917
Reviewer: J. Reyes · 11 Apr 2026 14:22 UTC

Customer (PAN ending 8801) initiated a wire transfer of $9,250.00 USD to a beneficiary account first seen on the platform on 09 Apr 2026. Reviewing prior 90-day activity for this customer shows wire activity concentrated to two prior beneficiaries (relatives, KYC-verified, historical pattern stable). The new beneficiary is an LLC registered in a jurisdiction with elevated SAR-correlation per our internal scoring (referenced in policy doc PL-WIRE-2024 §4.2).

Velocity score on this transaction was 0.92 (model output, see ML-FRAUD-3.4 dashboard). Cross-checking against our case-history corpus, three structurally similar cases have been reviewed in the past 18 months; two were confirmed-fraud, one cleared after additional context. The originating IP geo (Newark, NJ) is consistent with the cardholder's historical pattern.

Recommendation: escalate for regulatory review. Senior analyst sign-off required per the two-eye policy on structured-pattern hits. Note: cardholder not yet contacted — pending compliance lead approval for outbound.

after · agent-generated ≈ 0.8 min / case

{
  "case_id":   "FA-2026-04-11-0917",
  "severity":  "regulatory",
  "decision":  "escalate",
  "confidence": 0.91,

  "evidence": [
    {
      "claim": "Beneficiary first-seen 09 Apr 2026; not in cardholder's 90d graph.",
      "evidence_id": "chunk_a4f0c12b9e44",
      "source": "ledger.beneficiary_first_seen"
    },
    {
      "claim": "LLC jurisdiction matches PL-WIRE-2024 §4.2 elevated-SAR list.",
      "evidence_id": "chunk_71d33e0a4c8b",
      "source": "policy.PL-WIRE-2024"
    },
    {
      "claim": "Velocity score 0.92; 3 structurally similar cases in the corpus.",
      "evidence_id": "chunk_e8b290745f01",
      "source": "ml.velocity + case-history"
    }
  ],

  "two_eye_required": true,
  "approver_role":    "role:senior-analyst",
  "sar_track":        true,
  "audit_retain_yrs": 7
}

both artefacts are synthetic · case-id, beneficiary, and PAN-last-4 are illustrative · the agent packet is what the regulator-audit reviewer reads, not the prose

When NOT to ship this. A fraud agent built on these patterns will produce regulator-audit-defensible failures in any of the following situations — we will turn down the engagement before scoping a pilot:

Autonomous SAR filing is on the scope sheet, full stop. The 30-day FinCEN clock starts on human detection. Any pitch that includes "the agent files the SAR" is the pitch we walk away from. Drafting the packet is fine; filing it is a human signature obligation. We've turned this down twice.
Override patterns aren't measured weekly. If the senior analyst lead is not going to review the agent-vs-analyst diffs weekly for the first six months, the calibration head drifts and nobody catches it. The eval set is necessary, not sufficient. The Black Friday drift at week 7 was caught because the shadow ran for three weeks against a live analyst floor — not because the eval set asked for it.
BAA + AWS PrivateLink (or equivalent) gaps in the deployment plan. No BAA from the model vendor, no in-VPC inference, no WORM-equivalent audit retention — the regulatory posture isn't `we'll figure it out post-launch`. The bank's compliance team had the legal posture committed at week 1, or the pilot didn't get signed. We have walked away from engagements where the legal posture was a TODO.
The team wants a single fraud-score number, not a disposition packet. If the procurement track is "we want a fraud-score API that returns 0–1 per transaction," the buyer is shopping for a model, not an agent. Our shape is structured disposition + evidence + audit packet. Buyers shopping for the score-API shape get a better outcome from a feature store + an off-the-shelf score vendor, and we'll say so in the audit.

— the kill-point section that has to be on every honest case study

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build — or that a similar build on your stack would draw from.

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this fraud build.

Fintech AI

The fintech pillar — KYC ladders, AML/BSA posture, ECOA Reg B, model-risk inventory aligned to SR 11-7. The regulatory context this case study lives inside.

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns. Forced JSON, response_format schema, BAA + AWS PrivateLink deployment.

AI Governance

Policy-as-code, audit-log scaffolding, model-risk inventory templates. The plumbing that made this pilot pass the model-risk committee.

All AI Case Studies

Six AI case studies — RAG, agents, voice, and chatbots. Same operator detail across every page.

AI Consulting

$3K fixed-fee audit. We map the workflow, scope the eval, and tell you whether it's case-study-shaped.

Ready to ship

Want a case study like this
for your fraud-ops floor?

Book a $3K fixed-fee audit. We'll review the workflow, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped — and whether the regulatory posture is ready to support a build. About one audit in five ends with `the legal posture isn't ready yet, here's the 90-day prep plan.`

Read the fintech pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

An Anthropic case study, with the audit trail
a regulator would accept.

A rules engine
under load.

today

with the agent

Seven pipeline stages,
three outcome lanes.

Skip the LLM on the auto-clear band

Forced JSON with a severity enum

Two-eye gate on regulatory severity

The fraud agent,
auth to audit.

A 0.32-second window
at the auth boundary.

Named tools,
named versions.

Production shape,
under the hood.

Per-stage P50 / P95 (ms)

Per-decision and monthly cost math

What's in the frozen 412-item set

Precision vs FPR,
where you stand the line.

The timeline,
including the week we halted.

Discovery + frozen eval set

Corpus + velocity score + retrieval

Agent skeleton + policy-as-code

Black Friday shadow — calibration drift caught

Cutover + SAR-track integration

How we know
it works.

What the ops team reads
when a case routes for review.

Where this case study
points back to.

AI Agent Development

Fintech AI

Claude Development

AI Governance

All AI Case Studies

AI Consulting

Want a case study like this
for your fraud-ops floor?

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

An Anthropic case study, with the audit trail a regulator would accept.

A rules engine under load.

today

with the agent

Seven pipeline stages, three outcome lanes.

Skip the LLM on the auto-clear band

Forced JSON with a severity enum

Two-eye gate on regulatory severity

The fraud agent, auth to audit.

A 0.32-second window at the auth boundary.

Named tools, named versions.

Production shape, under the hood.

Precision vs FPR, where you stand the line.

The timeline, including the week we halted.

Discovery + frozen eval set

Corpus + velocity score + retrieval

Agent skeleton + policy-as-code

Black Friday shadow — calibration drift caught

Cutover + SAR-track integration

How we know it works.

What the ops team reads when a case routes for review.

Where this case study points back to.

AI Agent Development

Fintech AI

Claude Development

AI Governance

All AI Case Studies

AI Consulting

Want a case study like this for your fraud-ops floor?

An Anthropic case study, with the audit trail
a regulator would accept.

A rules engine
under load.

Seven pipeline stages,
three outcome lanes.

The fraud agent,
auth to audit.

A 0.32-second window
at the auth boundary.

Named tools,
named versions.

Production shape,
under the hood.

Precision vs FPR,
where you stand the line.

The timeline,
including the week we halted.

How we know
it works.

What the ops team reads
when a case routes for review.

Where this case study
points back to.

Want a case study like this
for your fraud-ops floor?