← all case studies
Legal · Mid-market US law firm RAG + LangChain · forced-JSON clause risk
Claude Sonnet 4.6 role Clause-risk model · forced JSON
LangChain 0.3 role Orchestrator + retrieval glue
LangGraph 0.2 role Clause-by-clause chain · shared scratchpad
pgvector 0.7 role Embedding retrieval over reconciled clause library
bge-reranker-large role Cross-encoder rerank · A/B'd vs Cohere
iManage Work role Document management · matter-scoped pull
case study · 2026 · anonymized

A RAG case study: first-pass MSA review
for a mid-market law firm.

A US-based mid-market law firm with four practice groups needed a first-pass MSA review layer that could split a contract into clauses, retrieve the matching policies and precedents from a reconciled clause library, flag clause risk against the firm's playbook with cited policy ids, and refuse — out loud — on novel patterns. We built it on Claude Sonnet 4.6, LangChain 0.3, LangGraph 0.2, a hybrid pgvector + BM25 index, and a forced-JSON clause-risk schema. Nine weeks, partner-shadow-first, with a clause-library drift kill point at week 4 that we paused the build for.

≈ 71%
first-pass MSA review time saved · partner-signed-off (95% CI · n=180 MSAs across 6 months)
p95 62s
MSA wall-clock to full clause-risk report · meets <90s service target
740
frozen clause-eval items · re-run on every release
9 weeks
discovery to production cutover
shipped
9 weeks · 4 engineers · 4 senior counsel (one per practice group)
6–9 hrs
partner first-pass review per MSA · pre-build
180 / yr
MSAs flowing through the 4 practice groups
4
practice groups · M&A · employment · real estate · IP
11%
post-execution disputes traced to inconsistent first-pass clause calls
the problem

Four practice groups,
four contradictory playbooks.

Partners spending 6–9 hours per MSA on first-pass review. Clause-library drift across practice groups producing inconsistent calls. Downstream dispute rate tracing back to the drift.

The client is a US-based mid-market law firm — roughly seventy attorneys across four practice groups (M&A, employment, real estate, IP), with a vendor-side MSA review pipeline that handled roughly 180 contracts a year across the four groups combined. Like most mid-market firms, they sit at the painful middle of the legal-tech market: too small to fund an Ironclad/Spellbook seat for every partner, too large for the managing partner to hand-review every MSA. The pre-build process was partner-by-partner, playbook-by-playbook, and the playbooks had quietly drifted apart.

today vs · with the agent

today

MSA arrives
Partner reads cover-to-cover
Cross-checks against 4 playbooks
drift across practice groups
Marks up redline by hand
outcome
6–9 hr first-pass per MSA · inconsistent calls across partners · 11% downstream dispute rate

with the agent

MSA upload
Semantic clause split
Hybrid clause-RAG + policy lookup
reconciled clause library
Sonnet 4.6 clause-risk JSON
policy_id + precedent_ids enforced
outcome
Acceptable · partner skim
outcome
Negotiate · redline drafted
outcome
Block / manual review

The presenting symptoms broke down two ways. First, the wall-clock: partners were spending six to nine hours per MSA on first-pass review — reading cover-to-cover, cross-checking each clause against their practice-group playbook, then producing a hand-marked redline. The senior partners doing the most reviews were also the ones billing the highest rates; the math was untenable and getting worse. Second, the quality cost of the drift: a post-engagement audit by the firm's general counsel had traced eleven percent of post-execution disputes back to inconsistent first-pass clause calls — same fact pattern, opposite playbook calls between practice groups, sometimes opposite calls between partners in the same group. The general counsel's exact phrasing in the discovery call was: "the partners trust their own work and trust each other's; what they don't trust is that we're all working from the same playbook anymore."

They had looked at Ironclad, Spellbook, and three smaller LegalTech vendors and turned every one of them down. The objections were operator-grade: no vendor could load the firm's reconciled clause library because there wasn't one to load; no vendor surfaced a `policy_id` against each redline suggestion that partners could verify; no vendor would publicly show the eval data behind their headline numbers; no vendor had a refusal lane that named "this is a novel pattern, do not let the agent decide." The conversation we walked into was not "should we ship LangChain" — it was "show us how a clause-risk agent could miss a hard-no, and tell us how you'd catch it before a partner sends the redline to opposing counsel."

That framing decided the engagement. We refused to scope this as a vendor-tool integration. The deliverable was a structured-output clause-risk agent with a policy-citation contract, a four-band risk output where the bands were defensible to a client, a manual-review refusal lane that was first-class rather than an error case, and a reconciled clause library built by the firm's own senior counsel — not by us. The rest of the page is what we shipped.

the approach

Six stages,
four risk bands.

Every clause runs the same six-stage chain. Stage 4 fans out into two parallel retrieval lanes — the reconciled clause library and the house-policy index — before converging on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs.

The architecture below is the production shape, not a marketing diagram. MSA uploads route through iManage Work via the firm's existing matter-scoped OAuth flow — the agent reads only documents the responsible attorney has matter-level access to, and every read is logged. The semantic clause splitter is the load-bearing detail: each clause is bounded on its section heading (§N.M, with sub-clause merge logic that respects cross-references like `subject to Section 8.3`), tagged with a clause_id that survives editorial reshuffling, and carries the surrounding context window the model needs to reason about it without being polluted by neighboring clauses.

Retrieval is hybrid and fans out into two parallel lanes at stage 4. The clause-RAG lane runs pgvector 0.7 + Postgres tsvector BM25 over the reconciled clause library (1,420 unique reference clauses post-reconciliation, down from 1,840 pre-reconciliation), fuses with reciprocal-rank fusion at k=60, dedupes by clause_id, and reranks with BAAI's bge-reranker-large self-hosted on a single g5.xlarge in the firm's tenant. The policy-lookup lane runs a separate regex-validated index over house policy documents — every policy carries an id of shape `policy_(practice-group)-(NNN)` (e.g. policy_IP-014, policy_MA-203) — and is practice-group-aware: IP clauses route to IP policies first, real estate clauses route to real estate policies first, with cross-practice retrieval as a deliberate fallback rather than a default. Running the two lanes as parallel branches rather than a single fused retrieval is the thing the senior counsel specifically asked for during reconciliation — they wanted the policy citation and the precedent retrieval to be visibly independent, not blended.

During build, we A/B-tested two rerankers — BAAI's bge-reranker-large and Cohere Rerank v3 — on the held-out clause-eval slice. bge won by roughly three points on top-1 precision over the legal corpus; we shipped bge as primary and kept Cohere wired as a runtime fallback so the firm has a swap-out path if bge ever degrades. Both rerankers' top-1 precisions are logged in Langfuse per-decision so the comparison can be re-run on a fresh slice at any time.

The decision step is Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The model has zero write tools — it cannot send anything to counterparties, modify documents in iManage, or finalize a redline. All it produces is a JSON object: the clause_id, the risk band (one of four enum values), an array of rationale entries each tied to a policy_id (regex-enforced via Zod) plus up to eight precedent_ids, and an optional suggested_redline string. Every rationale claim has to cite a policy_id whose regex matches and whose pathway resolves to a live policy document, or the validator rejects. Confidence below 0.8 routes the clause to manual-review regardless of band — partner sees the manual-review marker on the redline draft and reads the clause themselves.

three decisions that shaped the build
design decision · 01

Forced-JSON clause-risk schema with policy-id regex

we rejected
Free-text redline summary
because
Every flagged clause has to cite a policy_id matching the regex policy_(practice-group)-(NNN). The Zod validator is the contract; the model can't suggest a redline without naming the policy it traces to. Partners check the policy, not the model.
design decision · 02

Four risk bands · acceptable / negotiate / block / manual-review

we rejected
Continuous risk score 0–1
because
Partners read bands, not scores. A 0.71 risk score is harder to defend in front of a client than `negotiate per policy IP-014`. The band-based output also gates queue routing — block clauses page the supervising partner; manual-review clauses leave a human-only marker on the redline.
design decision · 03

Reconcile the 4 clause libraries before any agent build

we rejected
Train per-practice-group models on each library as-is
because
Discovery week 4 found that M&A and real estate had contradictory standard indemnification clauses — same fact pattern, opposite playbook calls. An agent trained on the drift would inherit it. We paused the build for a 2-week reconciliation pass with senior counsel from each group, then resumed.

Guardrails live as Zod schemas and TypeScript runtime checks checked into the same monorepo as the agent. The policy layer enforces a never-autonomous rule on any clause routed to a counterparty (the agent produces redlines; partners send them), a block-band notification rule that pages the supervising partner before the partner-final-pass step, and a per-decision audit log that retains the retrieved candidates from both lanes, the reranker scores, the model's raw output, the parsed JSON, the schema-validation verdict, and the partner override (if any) — searchable in Langfuse by clause_id and by partner-override status. The override-review meeting is the cadence that keeps the agent honest: senior counsel from each practice group sits with our on-call engineer once a week, walks the disagreements, and patterns that show up three times become eval-set additions.

Hover any node in the diagram for the tool inventory and per-stage latency budget. Every component has a separately measurable contract — retrieval is measurable in recall@5 on the clause library, policy lookup is measurable in citation accuracy, the reranker is measurable in top-1 precision on the held-out slice, the model decision is measurable on labelled clause-risk-band correctness, the policy-citation accuracy is measurable against the senior-counsel ground-truth, and the manual-review lane is measurable as a rate rather than a failure. When something regresses, the per-component metric tells us which stage to look at — not a single end-to-end number that hides which subsystem moved.

under the hood

The contract-review chain,
clause by clause.

Every clause runs the same six-stage chain. Two retrieval lanes fan out at stage 4 — house policy and the reconciled clause library — then converge on a forced-JSON clause-risk model. The four risk bands at the bottom are the only legal outputs. Hover any stage for its tool inventory.

risk · band 1 Acceptable matches house playbook · partner skim only · ≈ 41% of clauses
risk · band 2 Negotiate redline drafted with policy citation · ≈ 38%
risk · band 3 Block violates a hard policy · partner notified before send · ≈ 9%
risk · band 4 Manual review novel pattern · agent refuses · ≈ 12%

stage latencies above are per-clause p50 / p95 · a typical MSA runs ≈ 80–140 clauses in parallel batches of 12 · end-to-end MSA wall-clock 38–62s

policy-cited
every flagged clause carries a policy_id (regex-enforced)
0
autonomous redlines · partner approves every send to the counterparty
4 senior counsel
in the reconciliation council · one per practice group
shadow-first
MSAs reviewed in parallel by agent + partner for the first 6 weeks post-cutover
the stack

Named tools,
named versions.

Everything in the build is a thing your IT director can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, schemas, and policies are all checked into the firm's repo — not ours. Cohere Rerank stays wired as a fallback so the rerank stage has a documented swap-out path.

Claude Sonnet 4.6 Anthropic API · forced JSON role clause-risk decision model
LangChain 0.3 Python role orchestrator + retrieval glue
LangGraph 0.2.x Python role clause-by-clause chain · shared scratchpad
voyage-3-large 1,024 dim role embeddings · clause library + policy index
pgvector 0.7 Postgres 16 role embedding retrieval
BM25 (Postgres tsvector) Postgres 16 role lexical retrieval · RRF fusion
BAAI bge-reranker-large self-hosted g5.xlarge role cross-encoder rerank · ship
Cohere Rerank v3 · A/B alternative role rerank · loser on legal corpus (kept as fallback)
Langfuse v3 · self-hosted role per-decision trace · partner-override review
iManage Work Cloud · matter-scoped role DMS · OAuth-on-behalf access
how it actually runs

Production shape,
under the hood.

Latency is measured at the agent boundary; cost math uses Anthropic's published Sonnet 4.6 pricing as of May 2026; eval composition is the frozen 740-item clause-eval set the CI gates on.

Most legal-AI case studies stop at the architecture diagram. Ours doesn't, because our buyers don't. The two people who decide whether to sign — the firm's general counsel and the managing partner of the busiest practice group — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the model vendor's published price card, a frozen eval set with category-level thresholds, and an honest accounting of what runs where for privilege scope. Vendors who don't show this either don't have it or are hiding it. The section below maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-clause P50 / P95 (ms)

stagep50p95tooling
MSA intake + iManage pull280720iManage Work · matter-scoped OAuth · document hashed + version-pinned
Semantic clause split410880Heading-aware splitter · §N.M boundary + sub-clause merge
LangChain orchestrator step3688LangGraph 0.2 chain · shared scratchpad · clause-by-clause
Clause-RAG retrieval + rerank372620pgvector + tsvector RRF k=60 → bge-reranker-large top-6
Policy lookup (parallel)2258Practice-group-aware regex-validated policy index
Sonnet 4.6 clause-risk decision16002400Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
Validator + audit log1832Zod schema · policy_id regex check · Langfuse trace write
Total (per-clause end-to-end)27384798agent boundary — MSAs batch 12 clauses in parallel for ≈ 62s wall-clock
  1. stage MSA intake + iManage pull
    p50 280
    p95 720
    tooling iManage Work · matter-scoped OAuth · document hashed + version-pinned
  2. stage Semantic clause split
    p50 410
    p95 880
    tooling Heading-aware splitter · §N.M boundary + sub-clause merge
  3. stage LangChain orchestrator step
    p50 36
    p95 88
    tooling LangGraph 0.2 chain · shared scratchpad · clause-by-clause
  4. stage Clause-RAG retrieval + rerank
    p50 372
    p95 620
    tooling pgvector + tsvector RRF k=60 → bge-reranker-large top-6
  5. stage Policy lookup (parallel)
    p50 22
    p95 58
    tooling Practice-group-aware regex-validated policy index
  6. stage Sonnet 4.6 clause-risk decision
    p50 1600
    p95 2400
    tooling Anthropic API · response_format json_schema · ~2,800 in / ~520 out tokens
  7. stage Validator + audit log
    p50 18
    p95 32
    tooling Zod schema · policy_id regex check · Langfuse trace write
  8. stage Total (per-clause end-to-end)
    p50 2738
    p95 4798
    tooling agent boundary — MSAs batch 12 clauses in parallel for ≈ 62s wall-clock

p50/p95 from a 6-month rolling window over n ≈ 18,400 per-clause decisions (180 MSAs × ≈ 102 clauses average). SLO is p95 ≤ 90s wall-clock per MSA; current burn ≈ 69%.

The retrieval lanes are where the most per-stage tuning effort went. The reconciled clause library carries 1,420 reference clauses; each is chunked at the clause boundary (no sub-clause splits — a clause is the atomic unit of legal reasoning here) and embedded with voyage-3-large at 1,024 dimensions. We chose voyage-3-large after running a four-way bake-off on the held-out eval slice against OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, and a self-hosted bge-large fine-tuned on the firm's prior MSAs — voyage was Pareto-best on recall@5 and matched on price tier. The lexical lane is Postgres tsvector with English stemming over the same clauses; fusion is RRF with k=60 (paper default; we did not find a better value on the held-out slice). For policy lookup, the index is much smaller (340 policy documents across all four practice groups post-reconciliation) but the retrieval is shape-validated — policy_id has to match the regex `^policy_[A-Z]4-\\d5$` or the agent fails closed.

contracts/schema/clause-risk.ts typescript
// contracts/schema/clause-risk.ts
// Forced-JSON clause-risk schema. Validated on every model output;
// if Sonnet produces something that doesn't parse, we retry once
// with a stricter system prompt, then fail closed (manual-review).

import { z } from "zod";

export const ClauseRisk = z.object({
  clause_id: z.string(),
  risk: z.enum(["acceptable", "negotiate", "block", "manual-review"]),
  rationale: z.array(z.object({
    claim: z.string().min(20).max(420),
    policy_citation: z.string().regex(/^policy_[A-Z]{2,4}-\d{3,5}$/),
    precedent_ids: z.array(z.string()).min(0).max(8),
  })).min(1),
  suggested_redline: z.string().optional(),
});

export type ClauseRisk = z.infer<typeof ClauseRisk>;
The clause-risk schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform — every flagged clause has to name a policy_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then routes the clause to manual review.
clause-risk output · sample MSA

What the partner sees

Six clauses from a sample MSA, rendered with the same risk-band tinting and expandable rationale that partners see in the firm's tooling. Click any clause to expand the policy citation, precedent ids, and (where present) the agent's suggested redline.

sample MSA · vendor agreement · 6 of 102 clauses shown
  • 2acceptable
  • 2negotiate
  • 1block
  • 1manual
  • 102total clauses
  1. § 3.1 Term & renewal — auto-renewal with 60-day notice acceptable

    Auto-renewal with 60-day opt-out notice matches the firm's standard MSA playbook for the vendor practice. No deviation flagged.

    policy citation
    policy_MA-104
    precedents
    prec_msa-2024-118 · prec_msa-2024-217
  2. § 7.2 Indemnification — IP infringement (third-party claims) negotiate

    Counterparty's draft caps IP-infringement indemnity at 1× annual fees and excludes injunctive relief. House playbook requires uncapped IP-infringement indemnity plus injunctive coverage for the licensed product.

    policy citation
    policy_IP-014
    precedents
    prec_msa-2024-088 · prec_msa-2025-031 · prec_msa-2025-104
    suggested redline
    Replace cap with: `Vendor shall indemnify, defend and hold harmless Customer from any third-party IP-infringement claim arising from the Service, without cap. Injunctive relief is included within the scope of this indemnity.`
  3. § 9.4 Limitation of liability — consequential-damages waiver block

    Mutual consequential-damages waiver is acceptable as a category; however, this draft also waives consequential damages for breach of confidentiality and data-protection obligations — both of which the firm's policy carves out from any waiver. Partner notification required before send.

    policy citation
    policy_MA-203
    precedents
    prec_msa-2024-141 · prec_msa-2025-052
    suggested redline
    Add carve-out: `Nothing in this Section limits liability for breach of Sections 11 (Confidentiality) or 12 (Data Protection), or for indemnification obligations under Section 7.`
  4. § 14.5 Governing law — Delaware (with international arbitration) acceptable

    Delaware governing law + ICC international arbitration seat in Singapore matches the firm's cross-border SaaS playbook. No deviation flagged.

    policy citation
    policy_MA-307
    precedents
    prec_msa-2024-202
  5. § 16.2 Data residency — multi-region with carve-out for AI training manual review

    Clause contains a sub-paragraph permitting vendor to use customer data for `model improvement` outside the named regions. This is a novel clause shape not present in any of the 1,420 reconciled-library reference clauses; agent refuses to issue a band and routes to manual review.

    policy citation
    policy_IP-021
    precedents
  6. § 19.1 Assignment — change-of-control consent negotiate

    Draft permits assignment to any affiliate without consent. House playbook requires written consent for any assignment outside the same parent group, plus a 30-day cure period for change-of-control events.

    policy citation
    policy_MA-118
    precedents
    prec_msa-2025-019 · prec_msa-2025-077
    suggested redline
    Replace with: `Neither party may assign this Agreement without the other party's prior written consent, which shall not be unreasonably withheld. Change of control triggers a 30-day cure window.`

sample only — anchor_ids, policy_ids, and precedent_ids are illustrative and follow the shape the schema enforces · live surface renders all clauses, not just the 6 shown

unit economics

Per-MSA and monthly cost math

line item$ / MSA$ / month (≈ 30 MSAs)note
Claude Sonnet 4.6 — input tokens$0.857$25.71102 clauses × 2,800 tokens × $3.00 / 1M
Claude Sonnet 4.6 — output tokens$0.795$23.86102 clauses × 520 tokens × $15.00 / 1M
voyage-3-large embeddings (per clause)$0.037$1.10102 clauses × ≈ 3,000 tokens × $0.12 / 1M
pgvector + RDS db.m6i.large$284Postgres 16 in firm tenant · clause library + policy index
g5.xlarge reranker (24/7)$378BAAI bge-reranker-large self-host · Cohere fallback wired
LangChain · LangGraph runtime$94Python on Fargate · 2 vCPU · per-clause parallelism = 12
Langfuse self-hosted (t3.medium)$67trace store · 90-day hot / 7-yr cold
iManage Work connector$0uses firm's existing iManage Cloud seat
All-in monthly (≈ 30 MSAs)≈ $1.69≈ $874vs. ≈ 200 partner hours saved at firm rates
  1. line item Claude Sonnet 4.6 — input tokens
    $ / MSA $0.857
    $ / month (≈ 30 MSAs) $25.71
    note 102 clauses × 2,800 tokens × $3.00 / 1M
  2. line item Claude Sonnet 4.6 — output tokens
    $ / MSA $0.795
    $ / month (≈ 30 MSAs) $23.86
    note 102 clauses × 520 tokens × $15.00 / 1M
  3. line item voyage-3-large embeddings (per clause)
    $ / MSA $0.037
    $ / month (≈ 30 MSAs) $1.10
    note 102 clauses × ≈ 3,000 tokens × $0.12 / 1M
  4. line item pgvector + RDS db.m6i.large
    $ / MSA
    $ / month (≈ 30 MSAs) $284
    note Postgres 16 in firm tenant · clause library + policy index
  5. line item g5.xlarge reranker (24/7)
    $ / MSA
    $ / month (≈ 30 MSAs) $378
    note BAAI bge-reranker-large self-host · Cohere fallback wired
  6. line item LangChain · LangGraph runtime
    $ / MSA
    $ / month (≈ 30 MSAs) $94
    note Python on Fargate · 2 vCPU · per-clause parallelism = 12
  7. line item Langfuse self-hosted (t3.medium)
    $ / MSA
    $ / month (≈ 30 MSAs) $67
    note trace store · 90-day hot / 7-yr cold
  8. line item iManage Work connector
    $ / MSA
    $ / month (≈ 30 MSAs) $0
    note uses firm's existing iManage Cloud seat
  9. line item All-in monthly (≈ 30 MSAs)
    $ / MSA ≈ $1.69
    $ / month (≈ 30 MSAs) ≈ $874
    note vs. ≈ 200 partner hours saved at firm rates

Token costs use Anthropic's published May-2026 Sonnet 4.6 pricing — $3 / 1M input, $15 / 1M output. Infra costs are AWS US-east-2 list price (firm's tenant). Per-MSA token cost assumes the median 102-clause MSA observed in the eval set; range across the 180 production MSAs in the 6-month sample is $0.94 (62 clauses) to $2.83 (164 clauses). Payback period from go-live, including the 9-week build at $215k, is ≈ 4.4 months at the firm's published blended partner rate against the ≈ 71% time saved on partner-signed-off MSAs.

eval composition

What's in the frozen 740-item clause-eval set

categoryitemswhat it checksci-gate threshold
M&A clauses (golds)320labelled risk band + policy_id + suggested redline · senior-counsel signed≥ 0.88 band precision · ≥ 0.90 policy accuracy
Employment clauses (golds)180labelled risk band + policy_id · employment senior-counsel signed≥ 0.88 band precision
Real estate clauses (golds)140labelled risk band + policy_id · real estate senior-counsel signed≥ 0.88 band precision
IP clauses (golds)100labelled risk band + policy_id · IP senior-counsel signed≥ 0.88 band precision
Block-clause must-catchsubset across all 4 groups · catch every hard-no clause (must)≥ 0.95 block recall
Manual-review (novel patterns)deliberately novel clauses · agent must refuse, not guess100% refusal on listed must-refuse
  1. category M&A clauses (golds)
    items 320
    what it checks labelled risk band + policy_id + suggested redline · senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision · ≥ 0.90 policy accuracy
  2. category Employment clauses (golds)
    items 180
    what it checks labelled risk band + policy_id · employment senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  3. category Real estate clauses (golds)
    items 140
    what it checks labelled risk band + policy_id · real estate senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  4. category IP clauses (golds)
    items 100
    what it checks labelled risk band + policy_id · IP senior-counsel signed
    ci-gate threshold ≥ 0.88 band precision
  5. category Block-clause must-catch
    items
    what it checks subset across all 4 groups · catch every hard-no clause (must)
    ci-gate threshold ≥ 0.95 block recall
  6. category Manual-review (novel patterns)
    items
    what it checks deliberately novel clauses · agent must refuse, not guess
    ci-gate threshold 100% refusal on listed must-refuse

Eval set is frozen — items only added, never edited. Senior counsel from the relevant practice group signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry. Block-catch and manual-review subsets are sub-categories overlapping the 740-item count.

Production ops cadence is part of the build, not an afterthought. The firm's general counsel and our on-call engineer hold a weekly override-review meeting where every clause in which the partner overrode the agent's call gets opened — drift that looks systematic becomes a JIRA ticket against the eval set and a candidate prompt or retrieval tweak; sometimes it becomes a policy reconciliation pass for the practice group involved. Langfuse trace retention is 90 days hot in the firm's tenant plus seven years cold in tenant-scoped S3, matching the firm's privilege retention policy. The IT director pulls an audit-log sample every month — model version, retrieved candidates from both lanes, reranker scores, policy-id used, partner override. None of this is published anywhere else by anyone shipping legal agents. That's the bar.

9 weeks · honest version

The timeline
including the two weeks we paused.

Five stages, milestone-billed. The week-4 build turned up a clause-library drift problem — the M&A and real estate playbooks contradicted each other on the same fact pattern, and an agent trained on the drift would inherit it. We halted the build for a 2-week reconciliation pass with senior counsel from each practice group, then resumed. The honest version of `9 weeks` is the 7 weeks of build plus the 2 weeks of pause.

  1. Weeks 1–2

    Discovery + eval set

    Two weeks shadowing partners across the four practice groups. The managing partner of each group sat in the design council. We sampled 60 MSAs from the prior 18 months, anonymized them, and the four senior partners labelled each clause with the correct risk band + policy citation + suggested redline. That sample became the frozen 740-item clause-eval set: 320 from M&A, 180 from employment, 140 from real estate, and 100 from IP.

    Frozen 740-item eval set + per-practice-group rubric
  2. Week 3

    Clause library + dual-index build

    Ingested each practice group's existing clause library — 1,840 reference clauses across the four groups — into pgvector 0.7 with embedding via voyage-3-large at 1,024 dimensions. Built the Postgres tsvector BM25 sidecar over the same corpus. RRF fusion tuned on a held-out eval slice; cross-encoder rerank A/B-tested between bge-reranker-large and Cohere Rerank v3. bge won on the legal corpus by ≈ 3 points top-1 precision; Cohere stayed wired as a fallback.

    Hybrid retrieval at 0.92 recall@5 across all four practice corpora
  3. Week 4

    Clause-library drift — paused for reconciliation

    Building the per-clause review chain in LangGraph turned up a structural problem: M&A's standard indemnification language and real estate's contradicted each other on the same fact pattern (joint-and-several vs several-only for sub-tenancy indemnities). Employment and IP had two more such contradictions. An agent trained on the drift would inherit it. We halted the build, convened a 2-week reconciliation pass with senior counsel from each practice group, and produced a single reconciled clause library — 1,420 clauses after dedupe and reconciliation, down from 1,840. Cost two weeks of wall-clock; bought eighteen months of build-on-firm-ground.

    Reconciled clause library · 1,420 unique reference clauses · sign-off from all 4 practice groups
    Walk-away point
  4. Weeks 5–7

    LangChain agent + forced-JSON clause-risk model

    LangChain 0.3 orchestrator wraps the LangGraph clause-by-clause chain. Claude Sonnet 4.6 with `response_format: json_schema` set to the ClauseRisk shape. The schema is the contract — every flagged clause has to name a policy_id matching the regex; precedent_ids are bounded 0–8; suggested_redline is optional and only renders if the partner expands the clause card. Confidence < 0.8 routes to manual-review; the agent never produces an autonomous redline.

    End-to-end review pipeline behind a partner-only beta flag
  5. Weeks 8–9

    Shadow cutover + partner-override review

    Promoted to first-pass review with partners running in parallel for the first 6 weeks (every MSA reviewed both by the agent and by the partner; outputs compared in Langfuse). After week 6 of shadow, the metric held — partner-override rate fell to 9.2% from a baseline 14% in week 1 — and the firm cut over to agent-first first-pass with partner final-pass. The override-review meeting runs weekly with senior counsel from each group; patterns that show up three times become eval-set additions.

    Production cutover · partner-override-review cadence locked
eval results · 740 frozen clause items

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 740. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live partner-shadow numbers are within ±1.5 points across all rows over the last 30 days.

metric
baseline (pre-build)
v1 (wk 5)
v2 (wk 7 post-reconciliation)
current (live, wk 36)
target
Clause-risk band precision (4-class)
0.84
0.86
0.91
≥ 0.88
Block-clause recall (catch all hard-no)
0.92
0.95
0.97
≥ 0.95
Policy-citation accuracy (cited the right policy)
0.79
0.88
0.93
≥ 0.90
Partner-override rate (live shadow)
14.0%
12.4%
10.1%
9.2%
≤ 12%
Manual-review refusal rate (by design)
8.4%
11.6%
12.0%
10–14%
P95 wall-clock per MSA (full report)
78s
68s
62s
≤ 90s

Sample size for the headline time-saved number (≈ 71% first-pass MSA review time saved) is n=180 partner-signed-off MSAs across a 6-month rolling window; the figure is a 95% confidence interval, not a point estimate. Partner-override rate is the share of clauses where the partner overrode the agent's risk band on the live shadow slice — by-design, not by failure. Manual-review refusal rate is the share of clauses the agent legally refuses to band (novel patterns, score-margin failures, off-corpus clauses) and routes straight to a partner — also by-design. Latency is end-to-end MSA wall-clock from upload to full clause-risk report, measured at the agent boundary.

Ready to ship

Want a case study like this
for your firm's stack?

Book a $3K fixed-fee audit. We'll review the clause library, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you need a reconciliation pass before any agent build — here's the SOW for that.`

Read the legal pillar
30 min, async or live Eval-first scoping Walk-away point in the pilot