The architecture below is the production shape. The Kafka topic at the top is the bank's existing auth-boundary stream — we did not move it, did not re-process historical data, did not write a sidecar event store. Every transaction enters in flight, runs the XGBoost velocity score in < 9 ms p95, and exits to one of three lanes within the 2.6-second decision SLA. Roughly 92% of traffic short-circuits the LLM entirely on the auto-clear band (velocity score below 0.18) — the math behind that decision is in the unit-economics SpecGrid further down. The remaining ~8% is what the LLM sees.
Retrieval is hybrid. pgvector 0.7 sits on the embedding side over a 4-year KYC + case-note corpus (~ 2.4M chunks at 512 tokens, anchored on sentence boundaries); a BM25 sidecar built on Postgres `tsvector` sits on the lexical side. We fuse with reciprocal-rank fusion at k=60, take the top-40, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer VPC. Every chunk carries a case-id and a redaction-map id, so the audit packet can reconstruct the exact evidence the model saw at decision time.
The model is Claude Sonnet 4.6, called via Anthropic's API over AWS PrivateLink — the BAA is what unlocked PrivateLink, which is what kept the PAN-bearing prompts inside the customer's VPC. We use `response_format: json_schema` so the disposition is bounded to a severity enum, the evidence array is regex-pinned to chunk ids, and the rationale field can't exceed an audit-friendly length. Every claim in the rationale has to point to an evidence-chunk id from the retrieved set or the schema validator rejects the output. Claude Haiku 4.5 drafts the case-note narrative downstream of the Sonnet disposition — same evidence ids, templated by severity.
Retrieval params are tuned, not defaulted. Chunks at 512 tokens with 96-token overlap, anchored on sentence boundaries because case-note narrative has long-form sentences that lose grounding when cut mid-clause. Embeddings are voyage-3-large at 1,024 dimensions; we tested voyage-3-lite first and lost five points of recall@5 on the eval, which wasn't worth the 35% embeddings cost saving. BM25 uses Postgres tsvector with English stemming. RRF at k=60 (the paper default; we did not find a better value on the held-out slice), top-40 from each lane, deduplicated by chunk id, rerank to top-12 going into the model.