Legal · Mid-market US law firm RAG + LangChain · forced-JSON clause risk

Claude Sonnet 4.6

LangChain 0.3

LangGraph 0.2

pgvector 0.7

bge-reranker-large

iManage Work

case study · 2026 · anonymized

A RAG case study: first-pass MSA review
for a mid-market law firm.

A US-based mid-market law firm with four practice groups needed a first-pass MSA review layer that could split a contract into clauses, retrieve the matching policies and precedents from a reconciled clause library, flag clause risk against the firm's playbook with cited policy ids, and refuse — out loud — on novel patterns. We built it on Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, a hybrid pgvector + BM25 index, and a forced-JSON clause-risk schema. Nine weeks, partner-shadow-first, with a clause-library drift kill point at week 4 that we paused the build for.

≈ 71%

first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs across 6 months)

p95 62s

MSA wall-clock to full clause-risk report · meets <90s service target

740

frozen clause-eval items · re-run on every release

9 weeks

discovery to production cutover

shipped

9 weeks · 4 engineers · 4 senior counsel (one per practice group)

6–9 hrs

partner first-pass review per MSA · pre-build

180 / yr

MSAs flowing through the 4 practice groups

practice groups · M&A · employment · real estate · IP

11%

post-execution disputes traced to inconsistent first-pass clause calls

the problem

Four practice groups,
four contradictory playbooks.

Partners spending 6–9 hours per MSA on first-pass review. Clause-library drift across practice groups producing inconsistent calls. Downstream dispute rate tracing back to the drift.

The client is a US-based mid-market law firm — roughly seventy attorneys across four practice groups (M&A, employment, real estate, IP), with a vendor-side MSA review pipeline that handled roughly 180 contracts a year across the four groups combined. Like most mid-market firms, they sit at the painful middle of the legal-tech market: too small to fund an Ironclad/Spellbook seat for every partner, too large for the managing partner to hand-review every MSA. The pre-build process was partner-by-partner, playbook-by-playbook, and the playbooks had quietly drifted apart.

today vs · with the agent

today

MSA arrives

Partner reads cover-to-cover

Cross-checks against 4 playbooks

drift across practice groups

Marks up redline by hand

outcome

6–9 hr first-pass per MSA · inconsistent calls across partners · 11% downstream dispute rate

with the agent

MSA upload

Semantic clause split

Hybrid clause-RAG + policy lookup

reconciled clause library

Sonnet 4.6 clause-risk JSON

policy_id + precedent_ids enforced

outcome

Acceptable · partner skim

outcome

Negotiate · redline drafted

outcome

Block / manual review

The presenting symptoms broke down two ways. First, the wall-clock: partners were spending six to nine hours per MSA on first-pass review — reading cover-to-cover, cross-checking each clause against their practice-group playbook, then producing a hand-marked redline. The senior partners doing the most reviews were also the ones billing the highest rates; the math was untenable and getting worse. Second, the quality cost of the drift: a post-engagement audit by the firm's general counsel had traced eleven percent of post-execution disputes back to inconsistent first-pass clause calls — same fact pattern, opposite playbook calls between practice groups, sometimes opposite calls between partners in the same group. The general counsel's exact phrasing in the discovery call was: "the partners trust their own work and trust each other's; what they don't trust is that we're all working from the same playbook anymore."

They had looked at Ironclad, Spellbook, and three smaller LegalTech vendors and turned every one of them down. The objections were operator-grade: no vendor could load the firm's reconciled clause library because there wasn't one to load; no vendor surfaced a `policy_id` against each redline suggestion that partners could verify; no vendor would publicly show the eval data behind their headline numbers; no vendor had a refusal lane that named "this is a novel pattern, do not let the agent decide." The conversation we walked into was not "should we ship LangChain" — it was "show us how a clause-risk agent could miss a hard-no, and tell us how you'd catch it before a partner sends the redline to opposing counsel."

That framing decided the engagement. We refused to scope this as a vendor-tool integration. The deliverable was a structured-output clause-risk agent with a policy-citation contract, a four-band risk output where the bands were defensible to a client, a manual-review refusal lane that was first-class rather than an error case, and a reconciled clause library built by the firm's own senior counsel — not by us. The rest of the page is what we shipped.

the approach

Six stages,
four risk bands.

Every clause runs the same six-stage chain. Stage 4 fans out into two parallel retrieval lanes — the reconciled clause library and the house-policy index — before converging on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs.

The architecture below is the production shape, not a marketing diagram. MSA uploads route through iManage Work via the firm's existing matter-scoped OAuth flow — the agent reads only documents the responsible attorney has matter-level access to, and every read is logged. The semantic clause splitter is the load-bearing detail: each clause is bounded on its section heading (§N.M, with sub-clause merge logic that respects cross-references like `subject to Section 8.3`), tagged with a clause_id that survives editorial reshuffling, and carries the surrounding context window the model needs to reason about it without being polluted by neighboring clauses.

Retrieval is hybrid and fans out into two parallel lanes at stage 4. The clause-RAG lane runs pgvector 0.7 + Postgres tsvector BM25 over the reconciled clause library (1,420 unique reference clauses post-reconciliation, down from 1,840 pre-reconciliation), fuses with reciprocal-rank fusion at k=60, dedupes by clause_id, and reranks with BAAI's bge-reranker-large self-hosted on a single g5.xlarge in the firm's tenant. The policy-lookup lane runs a separate regex-validated index over house policy documents — every policy carries an id of shape `policy_(practice-group)-(NNN)` (e.g. policy_IP-014, policy_MA-203) — and is practice-group-aware: IP clauses route to IP policies first, real estate clauses route to real estate policies first, with cross-practice retrieval as a deliberate fallback rather than a default. Running the two lanes as parallel branches rather than a single fused retrieval is the thing the senior counsel specifically asked for during reconciliation — they wanted the policy citation and the precedent retrieval to be visibly independent, not blended.

During build, we A/B-tested two rerankers — BAAI's bge-reranker-large and Cohere Rerank v3 — on the held-out clause-eval slice. bge won by roughly three points on top-1 precision over the legal corpus; we shipped bge as primary and kept Cohere wired as a runtime fallback so the firm has a swap-out path if bge ever degrades. Both rerankers' top-1 precisions are logged in Langfuse per-decision so the comparison can be re-run on a fresh slice at any time.

The decision step is Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The model has zero write tools — it cannot send anything to counterparties, modify documents in iManage, or finalize a redline. All it produces is a JSON object: the clause_id, the risk band (one of four enum values), an array of rationale entries each tied to a policy_id (regex-enforced via Zod) plus up to eight precedent_ids, and an optional suggested_redline string. Every rationale claim has to cite a policy_id whose regex matches and whose pathway resolves to a live policy document, or the validator rejects. Confidence below 0.8 routes the clause to manual-review regardless of band — partner sees the manual-review marker on the redline draft and reads the clause themselves.

three decisions that shaped the build

design decision · 01

Forced-JSON clause-risk schema with policy-id regex

we rejected: Free-text redline summary
because: Every flagged clause has to cite a policy_id matching the regex policy_(practice-group)-(NNN). The Zod validator is the contract; the model can't suggest a redline without naming the policy it traces to. Partners check the policy, not the model.

design decision · 02

Four risk bands · acceptable / negotiate / block / manual-review

we rejected: Continuous risk score 0–1
because: Partners read bands, not scores. A 0.71 risk score is harder to defend in front of a client than `negotiate per policy IP-014`. The band-based output also gates queue routing — block clauses page the supervising partner; manual-review clauses leave a human-only marker on the redline.

design decision · 03

Reconcile the 4 clause libraries before any agent build

we rejected: Train per-practice-group models on each library as-is
because: Discovery week 4 found that M&A and real estate had contradictory standard indemnification clauses — same fact pattern, opposite playbook calls. An agent trained on the drift would inherit it. We paused the build for a 2-week reconciliation pass with senior counsel from each group, then resumed.

Guardrails live as Zod schemas and TypeScript runtime checks checked into the same monorepo as the agent. The policy layer enforces a never-autonomous rule on any clause routed to a counterparty (the agent produces redlines; partners send them), a block-band notification rule that pages the supervising partner before the partner-final-pass step, and a per-decision audit log that retains the retrieved candidates from both lanes, the reranker scores, the model's raw output, the parsed JSON, the schema-validation verdict, and the partner override (if any) — searchable in Langfuse by clause_id and by partner-override status. The override-review meeting is the cadence that keeps the agent honest: senior counsel from each practice group sits with our on-call engineer once a week, walks the disagreements, and patterns that show up three times become eval-set additions.

Hover any node in the diagram for the tool inventory and per-stage latency budget. Every component has a separately measurable contract — retrieval is measurable in recall@5 on the clause library, policy lookup is measurable in citation accuracy, the reranker is measurable in top-1 precision on the held-out slice, the model decision is measurable on labelled clause-risk-band correctness, the policy-citation accuracy is measurable against the senior-counsel ground-truth, and the manual-review lane is measurable as a rate rather than a failure. When something regresses, the per-component metric tells us which stage to look at — not a single end-to-end number that hides which subsystem moved.

under the hood

The contract-review chain,
clause by clause.

Every clause runs the same six-stage chain. Two retrieval lanes fan out at stage 4 — house policy and the reconciled clause library — then converge on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs. Hover any stage for its tool inventory.

risk · band 1 Acceptable matches house playbook · partner skim only · ≈ 41% of clauses

risk · band 2 Negotiate redline drafted with policy citation · ≈ 38%

risk · band 3 Block violates a hard policy · partner notified before send · ≈ 9%

risk · band 4 Manual review novel pattern · agent refuses · ≈ 12%

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

stage latencies above are per-clause p50 / p95 · a typical MSA runs ≈ 80–140 clauses in parallel batches of 12 · end-to-end MSA wall-clock 38–62s

policy-cited

every flagged clause carries a policy_id (regex-enforced)

autonomous redlines · partner approves every send to the counterparty

4 senior counsel

in the reconciliation council · one per practice group

shadow-first

MSAs reviewed in parallel by agent + partner for the first 6 weeks post-cutover

the stack

Named tools,
named versions.

Everything in the build is a thing your IT director can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, schemas, and policies are all checked into the firm's repo — not ours. Cohere Rerank stays wired as a fallback so the rerank stage has a documented swap-out path.

Claude Sonnet 4.6 Anthropic API · forced JSON

LangChain 0.3 Python

LangGraph 0.2.x Python

voyage-3-large 1,024 dim

pgvector 0.7 Postgres 16

BM25 (Postgres tsvector) Postgres 16

BAAI bge-reranker-large self-hosted g5.xlarge

Cohere Rerank v3 · A/B alternative

Langfuse v3 · self-hosted

iManage Work Cloud · matter-scoped

how it actually runs

Production shape,
under the hood.

Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 740-item clause-eval set the CI gates on.

Most legal-AI case studies stop at the architecture diagram. Ours doesn't, because our buyers don't. The two people who decide whether to sign — the firm's general counsel and the managing partner of the busiest practice group — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the model vendor's published price card, a frozen eval set with category-level thresholds, and an honest accounting of what runs where for privilege scope. Vendors who don't show this either don't have it or are hiding it. The section below maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-clause P50 / P95 (ms)

stage	p50	p95	tooling
MSA intake + iManage pull	280	720	iManage Work · matter-scoped OAuth · document hashed + version-pinned
Semantic clause split	410	880	Heading-aware splitter · §N.M boundary + sub-clause merge
LangChain orchestrator step	36	88	LangGraph 0.2 chain · shared scratchpad · clause-by-clause
Clause-RAG retrieval + rerank	372	620	pgvector + tsvector RRF k=60 → bge-reranker-large top-6
Policy lookup (parallel)	22	58	Practice-group-aware regex-validated policy index
Sonnet 4.6 clause-risk decision	1600	2400	Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
Validator + audit log	18	32	Zod schema · policy_id regex check · Langfuse trace write
Total (per-clause end-to-end)	2738	4798	agent boundary — MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

stage MSA intake + iManage pull
p50 280
p95 720
tooling iManage Work · matter-scoped OAuth · document hashed + version-pinned
stage Semantic clause split
p50 410
p95 880
tooling Heading-aware splitter · §N.M boundary + sub-clause merge
stage LangChain orchestrator step
p50 36
p95 88
tooling LangGraph 0.2 chain · shared scratchpad · clause-by-clause
stage Clause-RAG retrieval + rerank
p50 372
p95 620
tooling pgvector + tsvector RRF k=60 → bge-reranker-large top-6
stage Policy lookup (parallel)
p50 22
p95 58
tooling Practice-group-aware regex-validated policy index
stage Sonnet 4.6 clause-risk decision
p50 1600
p95 2400
tooling Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
stage Validator + audit log
p50 18
p95 32
tooling Zod schema · policy_id regex check · Langfuse trace write
stage Total (per-clause end-to-end)
p50 2738
p95 4798
tooling agent boundary — MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

p50/p95 from a 6-month rolling window over n ≈ 18,400 per-clause decisions (180 MSAs × ≈ 102 clauses average). SLO is p95 ≤ 90s wall-clock per MSA; current burn ≈ 69%.

The retrieval lanes are where the most per-stage tuning effort went. The reconciled clause library carries 1,420 reference clauses; each is chunked at the clause boundary (no sub-clause splits — a clause is the atomic unit of legal reasoning here) and embedded with voyage-3-large at 1,024 dimensions. We chose voyage-3-large after running a four-way bake-off on the held-out eval slice against OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, and a self-hosted bge-large fine-tuned on the firm's prior MSAs — voyage was Pareto-best on recall@5 and matched on price tier. The lexical lane is Postgres tsvector with English stemming over the same clauses; fusion is RRF with k=60 (paper default; we did not find a better value on the held-out slice). For policy lookup, the index is much smaller (340 policy documents across all four practice groups post-reconciliation) but the retrieval is shape-validated — policy_id has to match the regex `^policy_[A-Z]4-\\d5$` or the agent fails closed.

contracts/schema/clause-risk.ts typescript

// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;

// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;

The clause-risk schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform — every flagged clause has to name a policy_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then routes the clause to manual review.

clause-risk output · sample MSA

What the partner sees

Six clauses from a sample MSA, rendered with the same risk-band tinting and expandable rationale that partners see in the firm's tooling. Click any clause to expand the policy citation, precedent ids, and (where present) the agent's suggested redline.

sample MSA · vendor agreement · 6 of 102 clauses shown

2acceptable
2negotiate
1block
1manual
102total clauses

§ 3.1 Term & renewal — auto-renewal with 60-day notice acceptable

Auto-renewal with 60-day opt-out notice matches the firm's standard MSA playbook for the vendor practice. No deviation flagged.

policy citation

policy_MA-104

precedents

prec_msa-2024-118 · prec_msa-2024-217
§ 7.2 Indemnification — IP infringement (third-party claims) negotiate

Counterparty's draft caps IP-infringement indemnity at 1× annual fees and excludes injunctive relief. House playbook requires uncapped IP-infringement indemnity plus injunctive coverage for the licensed product.

policy citation

policy_IP-014

precedents

prec_msa-2024-088 · prec_msa-2025-031 · prec_msa-2025-104

suggested redline

Replace cap with: `Vendor shall indemnify, defend and hold harmless Customer from any third-party IP-infringement claim arising from the Service, without cap. Injunctive relief is included within the scope of this indemnity.`
§ 9.4 Limitation of liability — consequential-damages waiver block

Mutual consequential-damages waiver is acceptable as a category; however, this draft also waives consequential damages for breach of confidentiality and data-protection obligations — both of which the firm's policy carves out from any waiver. Partner notification required before send.

policy citation

policy_MA-203

precedents

prec_msa-2024-141 · prec_msa-2025-052

suggested redline

Add carve-out: `Nothing in this Section limits liability for breach of Sections 11 (Confidentiality) or 12 (Data Protection), or for indemnification obligations under Section 7.`
§ 14.5 Governing law — Delaware (with international arbitration) acceptable

Delaware governing law + ICC international arbitration seat in Singapore matches the firm's cross-border SaaS playbook. No deviation flagged.

policy citation

policy_MA-307

precedents

prec_msa-2024-202
§ 16.2 Data residency — multi-region with carve-out for AI training manual review

Clause contains a sub-paragraph permitting vendor to use customer data for `model improvement` outside the named regions. This is a novel clause shape not present in any of the 1,420 reconciled-library reference clauses; agent refuses to issue a band and routes to manual review.

policy citation

policy_IP-021

precedents
§ 19.1 Assignment — change-of-control consent negotiate

Draft permits assignment to any affiliate without consent. House playbook requires written consent for any assignment outside the same parent group, plus a 30-day cure period for change-of-control events.

policy citation

policy_MA-118

precedents

prec_msa-2025-019 · prec_msa-2025-077

suggested redline

Replace with: `Neither party may assign this Agreement without the other party's prior written consent, which shall not be unreasonably withheld. Change of control triggers a 30-day cure window.`

sample only — anchor_ids, policy_ids, and precedent_ids are illustrative and follow the shape the schema enforces · live surface renders all clauses, not just the 6 shown

unit economics

Per-MSA and monthly cost math

line item	$ / MSA	$ / month (≈ 30 MSAs)	note
Claude Sonnet 4.6 — input tokens	$0.857	$25.71	102 clauses × 2,800 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output tokens	$0.795	$23.86	102 clauses × 520 tokens × $15.00 / 1M
voyage-3-large embeddings (per clause)	$0.037	$1.10	102 clauses × ≈ 3,000 tokens × $0.12 / 1M
pgvector + RDS db.m6i.large	—	$284	Postgres 16 in firm tenant · clause library + policy index
g5.xlarge reranker (24/7)	—	$378	BAAI bge-reranker-large self-host · Cohere fallback wired
LangChain · LangGraph runtime	—	$94	Python on Fargate · 2 vCPU · per-clause parallelism = 12
Langfuse self-hosted (t3.medium)	—	$67	trace store · 90-day hot / 7-yr cold
iManage Work connector	—	$0	uses firm's existing iManage Cloud seat
All-in monthly (≈ 30 MSAs)	≈ $1.69	≈ $874	vs. ≈ 200 partner hours saved at firm rates

line item Claude Sonnet 4.6 — input tokens
$ / MSA $0.857
$ / month (≈ 30 MSAs) $25.71
note 102 clauses × 2,800 tokens × $3.00 / 1M
line item Claude Sonnet 4.6 — output tokens
$ / MSA $0.795
$ / month (≈ 30 MSAs) $23.86
note 102 clauses × 520 tokens × $15.00 / 1M
line item voyage-3-large embeddings (per clause)
$ / MSA $0.037
$ / month (≈ 30 MSAs) $1.10
note 102 clauses × ≈ 3,000 tokens × $0.12 / 1M
line item pgvector + RDS db.m6i.large
$ / MSA —
$ / month (≈ 30 MSAs) $284
note Postgres 16 in firm tenant · clause library + policy index
line item g5.xlarge reranker (24/7)
$ / MSA —
$ / month (≈ 30 MSAs) $378
note BAAI bge-reranker-large self-host · Cohere fallback wired
line item LangChain · LangGraph runtime
$ / MSA —
$ / month (≈ 30 MSAs) $94
note Python on Fargate · 2 vCPU · per-clause parallelism = 12
line item Langfuse self-hosted (t3.medium)
$ / MSA —
$ / month (≈ 30 MSAs) $67
note trace store · 90-day hot / 7-yr cold
line item iManage Work connector
$ / MSA —
$ / month (≈ 30 MSAs) $0
note uses firm's existing iManage Cloud seat
line item All-in monthly (≈ 30 MSAs)
$ / MSA ≈ $1.69
$ / month (≈ 30 MSAs) ≈ $874
note vs. ≈ 200 partner hours saved at firm rates

Token costs use Anthropic's published May-2026 Sonnet 4.6 pricing — $3 / 1M input, $15 / 1M output. Infra costs are AWS US-east-2 list price (firm's tenant). Per-MSA token cost assumes the median 102-clause MSA observed in the eval set; range across the 180 production MSAs in the 6-month sample is $0.94 (62 clauses) to $2.83 (164 clauses). Payback period from go-live, including the 9-week build at $215k, is ≈ 4.4 months at the firm's published blended partner rate against the ≈ 71% time saved on partner-signed-off MSAs.

eval composition

What's in the frozen 740-item clause-eval set

category	items	what it checks	ci-gate threshold
M&A clauses (golds)	320	labelled risk band + policy_id + suggested redline · senior-counsel signed	≥ 0.88 band precision · ≥ 0.90 policy accuracy
Employment clauses (golds)	180	labelled risk band + policy_id · employment senior-counsel signed	≥ 0.88 band precision
Real estate clauses (golds)	140	labelled risk band + policy_id · real estate senior-counsel signed	≥ 0.88 band precision
IP clauses (golds)	100	labelled risk band + policy_id · IP senior-counsel signed	≥ 0.88 band precision
Block-clause must-catch	—	subset across all 4 groups · catch every hard-no clause (must)	≥ 0.95 block recall
Manual-review (novel patterns)	—	deliberately novel clauses · agent must refuse, not guess	100% refusal on listed must-refuse

category M&A clauses (golds)
items 320
what it checks labelled risk band + policy_id + suggested redline · senior-counsel signed
ci-gate threshold ≥ 0.88 band precision · ≥ 0.90 policy accuracy
category Employment clauses (golds)
items 180
what it checks labelled risk band + policy_id · employment senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category Real estate clauses (golds)
items 140
what it checks labelled risk band + policy_id · real estate senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category IP clauses (golds)
items 100
what it checks labelled risk band + policy_id · IP senior-counsel signed
ci-gate threshold ≥ 0.88 band precision
category Block-clause must-catch
items —
what it checks subset across all 4 groups · catch every hard-no clause (must)
ci-gate threshold ≥ 0.95 block recall
category Manual-review (novel patterns)
items —
what it checks deliberately novel clauses · agent must refuse, not guess
ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen — items only added, never edited. Senior counsel from the relevant practice group signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Block-catch and manual-review subsets are sub-categories overlapping the 740-item count.

Production ops cadence is part of the build, not an afterthought. The firm's general counsel and our on-call engineer hold a weekly override-review meeting where every clause in which the partner overrode the agent's call gets opened — drift that looks systematic becomes a JIRA ticket against the eval set and a candidate prompt or retrieval tweak; sometimes it becomes a policy reconciliation pass for the practice group involved. Langfuse trace retention is 90 days hot in the firm's tenant plus seven years cold in tenant-scoped S3, matching the firm's privilege retention policy. The IT director pulls an audit-log sample every month — model version, retrieved candidates from both lanes, reranker scores, policy-id used, partner override. None of this is published anywhere else by anyone shipping legal agents. That's the bar.

9 weeks · honest version

The timeline
including the two weeks we paused.

Five stages, milestone-billed. The week-4 build turned up a clause-library drift problem — the M&A and real estate playbooks contradicted each other on the same fact pattern, and an agent trained on the drift would inherit it. We halted the build for a 2-week reconciliation pass with senior counsel from each practice group, then resumed. The honest version of `9 weeks` is the 7 weeks of build plus the 2 weeks of pause.

Weeks 1–2

Discovery + eval set

Two weeks shadowing partners across the four practice groups. The managing partner of each group sat in the design council. We sampled 60 MSAs from the prior 18 months, anonymized them, and the four senior partners labelled each clause with the correct risk band + policy citation + suggested redline. That sample became the frozen 740-item clause-eval set: 320 from M&A, 180 from employment, 140 from real estate, and 100 from IP.

Frozen 740-item eval set + per-practice-group rubric
Week 3

Clause library + dual-index build

Ingested each practice group's existing clause library — 1,840 reference clauses across the four groups — into pgvector 0.7 with embedding via voyage-3-large at 1,024 dimensions. Built the Postgres tsvector BM25 sidecar over the same corpus. RRF fusion tuned on a held-out eval slice; cross-encoder rerank A/B-tested between bge-reranker-large and Cohere Rerank v3. bge won on the legal corpus by ≈ 3 points top-1 precision; Cohere stayed wired as a fallback.

Hybrid retrieval at 0.92 recall@5 across all four practice corpora
Week 4

Clause-library drift — paused for reconciliation

Building the per-clause review chain in LangGraph turned up a structural problem: M&A's standard indemnification language and real estate's contradicted each other on the same fact pattern (joint-and-several vs several-only for sub-tenancy indemnities). Employment and IP had two more such contradictions. An agent trained on the drift would inherit it. We halted the build, convened a 2-week reconciliation pass with senior counsel from each practice group, and produced a single reconciled clause library — 1,420 clauses after dedupe and reconciliation, down from 1,840. Cost two weeks of wall-clock; bought eighteen months of build-on-firm-ground.

Reconciled clause library · 1,420 unique reference clauses · sign-off from all 4 practice groups

Walk-away point
Weeks 5–7

LangChain agent + forced-JSON clause-risk model

LangChain 0.3 orchestrator wraps the LangGraph clause-by-clause chain. Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The schema is the contract — every flagged clause has to name a policy_id matching the regex; precedent_ids are bounded 0–8; suggested_redline is optional and only renders if the partner expands the clause card. Confidence < 0.8 routes to manual-review; the agent never produces an autonomous redline.

End-to-end review pipeline behind a partner-only beta flag
Weeks 8–9

Shadow cutover + partner-override review

Promoted to first-pass review with partners running in parallel for the first 6 weeks (every MSA reviewed both by the agent and by the partner; outputs compared in Langfuse). After week 6 of shadow, the metric held — partner-override rate fell to 9.2% from a baseline 14% in week 1 — and the firm cut over to agent-first first-pass with partner final-pass. The override-review meeting runs weekly with senior counsel from each group; patterns that show up three times become eval-set additions.

Production cutover · partner-override-review cadence locked

eval results · 740 frozen clause items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 740. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live partner-shadow numbers are within ±1.5 points across all rows over the last 30 days.

metric

baseline (pre-build)

v1 (wk 5)

v2 (wk 7 post-reconciliation)

current (live, wk 36)

target

Clause-risk band precision (4-class)

—

0.84

0.86

0.91

≥ 0.88

Block-clause recall (catch all hard-no)

—

0.92

0.95

0.97

≥ 0.95

Policy-citation accuracy (cited the right policy)

—

0.79

0.88

0.93

≥ 0.90

Partner-override rate (live shadow)

14.0%

12.4%

10.1%

9.2%

≤ 12%

Manual-review refusal rate (by design)

—

8.4%

11.6%

12.0%

10–14%

P95 wall-clock per MSA (full report)

—

78s

68s

62s

≤ 90s

Sample size for the headline time-saved number (≈ 71% first-pass MSA review time saved) is n=180 partner-signed-off MSAs across a 6-month rolling window; the figure is a 95% confidence interval, not a point estimate. Partner-override rate is the share of clauses where the partner overrode the agent's risk band on the live shadow slice — by-design, not by failure. Manual-review refusal rate is the share of clauses the agent legally refuses to band (novel patterns, score-margin failures, off-corpus clauses) and routes straight to a partner — also by-design. Latency is end-to-end MSA wall-clock from upload to full clause-risk report, measured at the agent boundary.

When NOT to ship this. A contract-review agent built on these patterns will mislead partners or counterparties in any of the following situations — we will turn down the engagement before scoping a pilot:

The clause library hasn't been reconciled. If practice groups maintain contradictory playbooks and senior counsel won't sit down for a reconciliation pass, the agent inherits the drift and produces confidently inconsistent calls. The reconciliation is not optional; we will pause the build, like we did at week 4, until it happens.
Partner sign-off cadence isn't real. If partners aren't going to review every redline before it ships to counterparties for the first six months, an agent's wrong-but-confident clause call lands in opposing counsel's inbox unedited. We require a partner-final-pass workflow at week 1 or the pilot doesn't get signed.
Privilege deployment isn't agreed at week 1. Vendor-side data retention, training opt-out, audit-log retention, and the choice between Anthropic Bedrock with customer-managed keys versus direct Anthropic API — these are not post-launch decisions. ABA Op. 512 and FRE 502(b) scope is decided at week 1 with the firm's compliance lead, or the engagement doesn't start. Pushing privileged ring-1 work-product through a default-commercial API tier is not a defensible posture.
Override-review meeting isn't on the calendar. The agent is honest about being wrong because senior counsel from each practice group walks the disagreements weekly. If that meeting won't happen — if the firm wants to ship the agent and walk away — calibration drifts within months and nobody catches it. The eval set is necessary, not sufficient.

— the kill-point section that has to be on every honest case study

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build — or that a similar build on your stack would draw from.

Legal AI Development

The legal pillar — privilege-aware AI, FRE 502 audit-log scaffolding, ABA Op. 512 citation-chain logging across practice groups.

Intelligent Document Processing

The IDP pillar — multi-modal extraction, schema-validated outputs, confidence routing across document types.

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this contract-review build.

Claude Development

Sonnet 4.6 + Haiku 4.5 integration patterns. Forced JSON, Constitutional-AI posture, BAA-eligible deployment options.

All AI Case Studies

Six AI case studies — RAG, agents, voice, and chatbots. Same operator detail across every page.

AI Governance

Policy-as-code, audit-log scaffolding, privilege-aware deployment. The plumbing that made this pilot pass the firm's risk review.

Ready to ship

Want a case study like this
for your firm's stack?

Book a $3K fixed-fee audit. We'll review the clause library, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you need a reconciliation pass before any agent build — here's the SOW for that.`

Read the legal pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

A RAG case study: first-pass MSA review
for a mid-market law firm.

Four practice groups,
four contradictory playbooks.

today

with the agent

Six stages,
four risk bands.

Forced-JSON clause-risk schema with policy-id regex

Four risk bands · acceptable / negotiate / block / manual-review

Reconcile the 4 clause libraries before any agent build

The contract-review chain,
clause by clause.

Named tools,
named versions.

Production shape,
under the hood.

Per-clause P50 / P95 (ms)

What the partner sees

Per-MSA and monthly cost math

What's in the frozen 740-item clause-eval set

The timeline
including the two weeks we paused.

Discovery + eval set

Clause library + dual-index build

Clause-library drift — paused for reconciliation

LangChain agent + forced-JSON clause-risk model

Shadow cutover + partner-override review

How we know
it works.

Where this case study
points back to.

Legal AI Development

Intelligent Document Processing

AI Agent Development

Claude Development

All AI Case Studies

AI Governance

Want a case study like this
for your firm's stack?

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

A RAG case study: first-pass MSA review for a mid-market law firm.

Four practice groups, four contradictory playbooks.

today

with the agent

Six stages, four risk bands.

Forced-JSON clause-risk schema with policy-id regex

Four risk bands · acceptable / negotiate / block / manual-review

Reconcile the 4 clause libraries before any agent build

The contract-review chain, clause by clause.

Named tools, named versions.

Production shape, under the hood.

The timeline including the two weeks we paused.

Discovery + eval set

Clause library + dual-index build

Clause-library drift — paused for reconciliation

LangChain agent + forced-JSON clause-risk model

Shadow cutover + partner-override review

How we know it works.

Where this case study points back to.

Legal AI Development

Intelligent Document Processing

AI Agent Development

Claude Development

All AI Case Studies

AI Governance

Want a case study like this for your firm's stack?

A RAG case study: first-pass MSA review
for a mid-market law firm.

Four practice groups,
four contradictory playbooks.

Six stages,
four risk bands.

The contract-review chain,
clause by clause.

Named tools,
named versions.

Production shape,
under the hood.

The timeline
including the two weeks we paused.

How we know
it works.

Where this case study
points back to.

Want a case study like this
for your firm's stack?