← all case studies
Developer-tooling SaaS · B2B Claude RAG · cited-answer surface
Claude Sonnet 4.6 role Cited-answer model · forced JSON
Claude Haiku 4.5 role Intent router · simple-query fast path
pgvector 0.7 role Embedding index over 12,400 docs pages
bge-reranker-large role Cross-encoder rerank · self-hosted
Cloudflare Workers role Edge · streaming + audit shipping
Langfuse role Per-decision trace · groundedness sampling
case study · 2026 · anonymized

A Claude case study: cited answers
over 12,000 product-docs pages.

A developer-tooling SaaS needed an in-product answer surface that could reason across nine product surfaces and twelve thousand docs pages, return cited answers users could click to verify, and refuse — out loud — when it didn't know. We built it on Claude Sonnet 4.6 + Haiku 4.5, a hybrid pgvector + Algolia index, and a forced-JSON schema that rejects any answer without grounded citations. Seven weeks, shadow-first, with a synonym-hallucination kill point at week 5 that we used.

≈ 64%
docs-recoverable tickets deflected at conf ≥ 0.8 (95% CI · n=3,400 over 30-day shadow)
p95 2.1s
first-token streamed answer latency · meets <2.5s service target
1,260
frozen eval queries · re-run on every release
7 weeks
discovery to production cutover
shipped
7 weeks · 3 engineers · 1 docs lead · 1 support lead
41%
of support tickets recoverable from public docs (sampled n=1,200)
2.3 / 5
user-rated docs-search satisfaction · pre-build
12,400
docs pages across 9 product surfaces
18%
of answer-bot replies failed groundedness on shadow eval
the problem

A docs-search bot
that lied with confidence.

Keyword search couldn't reason across nested module hierarchies. The existing answer-bot hallucinated on synonyms. The support queue paid the bill.

The client is a developer-tooling SaaS — nine product surfaces (an SDK, a CLI, a hosted control plane, three integrations, two open-source packages, and a sandbox) with roughly twelve thousand four hundred docs pages across them, all maintained in MDX in a public monorepo. Like most growth-stage developer tools, the docs are the front door, the support funnel, and the deciding factor for self-serve conversion in roughly the same breath. They were also the binding constraint: the support team's own audit showed that forty-one percent of inbound tickets, sampled across a randomized n=1,200 over the prior quarter, could have been answered from a public docs page that already existed. The user just hadn't found it.

today vs · with the agent

today

User asks question
Keyword search
no synonym handling
Old answer-bot
hallucinates on synonyms
User opens ticket
outcome
Ticket queue inflated · 41% recoverable from public docs · groundedness 0.82

with the agent

User asks question
Haiku 4.5 routes intent
Hybrid retrieval + rerank
voyage-3-large + bge
Sonnet 4.6 cited answer
forced-JSON schema · anchor citations
outcome
Answer in product · ≥0.8
outcome
Suggest + ticket · 0.5–0.8
outcome
Refuse + route · <0.5

The presenting symptoms broke down two ways. First, the keyword search couldn't reason across the product's nested module hierarchies — a question about `transaction.refund.signature` would surface generic transaction pages but miss the specific signature-mismatch troubleshooting note three levels deeper that actually held the answer. Second, the team had previously tried an off-the-shelf answer-bot that ran a single embedding lookup + a generic prompt — and it was confidently wrong on synonyms. Their corpus uses `API key`, `secret token`, and `signing secret` in different products for related-but-distinct concepts, and the old bot blurred them. Internal eval put its groundedness at 0.82, with an 18% rate of answers that cited a real doc page but the wrong one. The support lead's exact phrasing in the discovery call was: "I'd rather it refuse than answer wrong; right now it answers wrong about one time in five."

They had looked at every vendor in the docs-search space and turned every one of them down. The objections were technical, not commercial: no honest evaluation methodology in any vendor pitch, no audit log on the model's claims, no clear refusal path when retrieval was thin, no way to require that the citation be load-bearing rather than decorative. The conversation we walked into was not "should we ship Claude" — it was "show us how a docs agent could lie, and tell us how you'd catch it before a user trusted it."

That framing decided the engagement. We refused to scope this as a chatbot build. The deliverable was a structured-output answer surface with a citation-first contract, evidence-chunk-id enforcement on every claim, a frozen eval set that gated every release, and a refusal lane that was first-class rather than an error case. The rest of the page is what we shipped.

the approach

Six stages,
three outcome lanes.

Every question runs the same six-stage pipeline. The agent is allowed to answer-in-product, suggest-and-open-ticket, or refuse-and-route. Forced-JSON output with grounded citations is the only legal shape of an answer.

The architecture below is the production shape, not a marketing diagram. The docs corpus ingests from the customer's MDX monorepo via a GitHub Action that triggers on PR-merge — only changed pages reindex, the rest hot-reload from cache. The anchor-preserving chunker is the load-bearing detail: each chunk carries a stable anchor_id of shape `doc_[slug]#[anchor-slug]`, minted at chunk-time and never regenerated, so a citation issued today resolves to the same passage tomorrow even after editorial reshuffling. The shape of the id is also exactly the shape the answer schema enforces downstream with a regex.

Retrieval is hybrid. pgvector 0.7 sits on the embedding side with voyage-3-large at 1,024 dimensions. Algolia sits on the lexical side over the same corpus, reusing the synonym table the docs team has maintained for four years. We fuse with reciprocal-rank fusion at k=60, take the top-40 from each lane, dedupe by anchor_id, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer's VPC. The reranker carries a score-margin gate: if the best candidate doesn't beat the second by at least 0.04, the agent refuses on principle and routes to a human ticket. That gate is part of why the false-anchor rate moved from 6.2% to 0.9% — most synonym-hallucination cases fail it before the model ever sees them.

The decision step is two-model. Claude Haiku 4.5 routes intent first: greeting, product-list lookup, docs-question, or out-of-scope. Roughly 28% of inbound queries terminate at Haiku — they don't need Sonnet's reasoning, and paying for it would be a waste. The other 72% pass through to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The model has zero write tools — it cannot open tickets, send emails, or modify docs. All it produces is a JSON object: the prose answer, a 0–1 confidence float, an array of cited claims (each tied to an anchor_id), and an explicit refusal boolean. Every claim has to point to an anchor_id whose pathway-id matches the routed product context, or the validator rejects.

The third outcome lane — refusal — is first-class on purpose. Below confidence 0.5, the surface doesn't answer; it tells the user it doesn't know and opens a draft ticket with the query, the retrieved candidates, and the refusal reason pre-attached. Between 0.5 and 0.8, it shows the answer as a draft with a "verify with support" banner. At ≥ 0.8 it renders the cited answer inline. This three-tier surface is the thing that buys the support lead's sign-off. A confident-wrong answer is the failure mode that scares the team; the architecture makes confident-wrong structurally hard to ship.

three decisions that shaped the build
design decision · 01

Forced-JSON cited-answer schema

we rejected
Free-text answer with a downstream citation parser
because
Every claim has to point at an anchor_id matching the regex doc_[slug]#[anchor]. The Zod validator is the contract; the model can't hand-wave. If parsing fails, we retry once with a stricter prompt, then fail closed to a human ticket.
design decision · 02

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

we rejected
Single Sonnet model for every query
because
≈ 28% of inbound queries are greetings, product-name lookups, or one-liner FAQs. Haiku handles them at a tenth the per-call cost; Sonnet reasons over the retrieved evidence on the rest. The routing decision itself is cached for 24h on canonical-form input.
design decision · 03

Algolia as the lexical lane (not BM25 reinvention)

we rejected
Postgres tsvector BM25 alongside pgvector
because
The customer already paid for Algolia. Algolia's synonym table + typo tolerance was tuned by their docs team over four years. Reusing it as the lexical lane in RRF was a measurable +6 points on recall@5 vs a fresh tsvector index, with zero new ops surface.

Guardrails live as Zod schemas + TypeScript runtime checks in the same monorepo as the agent. Per-decision audit-logs capture the retrieved candidates, the reranker scores, the model's raw output, the parsed JSON, the schema-validation verdict, and the rendered surface state — searchable in Langfuse by confidence band and by user-reported feedback signal. The docs lead holds a weekly review meeting with our on-call engineer where any answer the user thumbs-down or any answer the groundedness watchdog flagged below 0.7 is opened. Patterns that show up more than three times in a week become a JIRA ticket against the eval set and a candidate prompt or retrieval tweak.

Hover any node in the diagram for the tool inventory and the per-stage latency budget. The reason this shape works is the same reason it was scoped this way at week 1: every component has a separately measurable contract. Retrieval is measurable in recall@5. The reranker is measurable in top-1 precision on the held-out slice. The model decision is measurable on labelled answer correctness. Groundedness is measurable on the LLM-as-judge cut. The refusal lane is measurable as a rate, not a failure. When something regresses, the per-component metric tells us which stage broke — not a single end-to-end number that hides which subsystem moved.

under the hood

The docs-RAG pipeline,
end to end.

One query enters at the left. It either renders an in-product answer with cited anchor links, opens a draft ticket with caveats, or fails closed and routes to a human. Hover any stage for its tool inventory and latency budget.

outcome · ≥0.8 Answer in product cited surface with anchor links · ≈ 64% of recoverable tickets
outcome · 0.5–0.8 Suggest + open ticket draft answer with caveats · agent reviews before send · ≈ 24%
outcome · <0.5 Refuse + route to human fail-closed · no answer surfaces · ≈ 12% of queries

latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 2.6s for a streamed answer

schema-validated
every Sonnet answer parses against the Zod schema or fails closed
0
answers shipped without a grounded citation · enforced by the validator
9 product surfaces
covered by one index · 12,400 pages reindex on PR-merge
shadow-first
two-week shadow next to the existing keyword-search answer-bot
the stack

Named tools,
named versions.

Everything in the build is a thing your platform team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are checked into the customer's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON role cited-answer decision model
Claude Haiku 4.5 Anthropic API role intent routing + simple-query fast path
voyage-3-large 1,024 dim role embeddings (Voyage AI)
pgvector 0.7 Postgres 16 role embedding retrieval
Algolia Production tier role lexical retrieval · synonym table reused
BAAI bge-reranker-large self-hosted role cross-encoder rerank
Cloudflare Workers Workers AI gateway role edge · streaming + audit shipping
Langfuse v3 · self-hosted role per-decision trace + groundedness sampling
Sentry JS SDK role in-product error capture on answer surface
how it actually runs

Production shape,
under the hood.

Latency is measured at the answer-surface boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 1,260-query set the CI gates on.

Most Claude case studies stop at the architecture diagram. Ours doesn't, because our buyers don't. The two people who decide whether to sign — the head of support and the head of platform — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the vendor's published price card, a frozen eval set with category-level thresholds, and an honest accounting of what runs where for compliance scope. Vendors who don't show this either don't have it or are hiding it. The section below maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms · streamed)

stagep50p95tooling
Intent router · Haiku 4.5320560Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
Embedding (query)82140voyage-3-large · 1,024 dim · batched
Hybrid retrieval (pgvector ∥ Algolia)186320RRF k=60 · top-40 per lane → dedupe → top-40 unique
Cross-encoder rerank240380BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
Sonnet 4.6 cited answer (first token)9201480Anthropic API · streamed · ~3,200 in / ~360 out tokens
Groundedness + render1632Zod schema validation · Cloudflare Workers · streamed to client
Total to first token (end-to-end)17642092agent boundary — excludes client-side render of the citation chips
  1. stage Intent router · Haiku 4.5
    p50 320
    p95 560
    tooling Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
  2. stage Embedding (query)
    p50 82
    p95 140
    tooling voyage-3-large · 1,024 dim · batched
  3. stage Hybrid retrieval (pgvector ∥ Algolia)
    p50 186
    p95 320
    tooling RRF k=60 · top-40 per lane → dedupe → top-40 unique
  4. stage Cross-encoder rerank
    p50 240
    p95 380
    tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
  5. stage Sonnet 4.6 cited answer (first token)
    p50 920
    p95 1480
    tooling Anthropic API · streamed · ~3,200 in / ~360 out tokens
  6. stage Groundedness + render
    p50 16
    p95 32
    tooling Zod schema validation · Cloudflare Workers · streamed to client
  7. stage Total to first token (end-to-end)
    p50 1764
    p95 2092
    tooling agent boundary — excludes client-side render of the citation chips

p50/p95 from a 30-day rolling window over n ≈ 84,000 production decisions. SLO is p95 ≤ 2,500 ms to first streamed token; current burn ≈ 84%.

The retrieval lane is where the most per-stage tuning effort went. The corpus is ≈ 12,400 MDX pages chunked at 512 tokens with 80-token overlap, sentence-anchored, never split mid-code-fence — short enough that the reranker score is meaningful, long enough that an explanation paragraph stays intact. We picked voyage-3-large at 1,024 dimensions specifically because Voyage was Pareto-best on the eval cut against three contenders we ran on the held-out slice (OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, and a self-hosted bge-large fine-tuned on the support-question pairs). The chart below shows the chunk-size sweep we walked before locking in 512 with 80-token overlap. We tried larger chunks first (1,024 tokens, no overlap) — recall@5 dropped four points because the reranker's score got dominated by the rest of the chunk's content rather than the load-bearing sentence. We tried smaller (256, no overlap) — recall@5 dropped three points because explanatory context got cut off mid-thought.

chunk-size sweep · what we shipped vs what we tried

Chunk size × overlap vs recall@5

Two curves on the same eval slice: sentence-anchored chunks with 80-token overlap versus naive splits without. Each datapoint is an actual eval cut, not a model projection. Picked value (signal-marker) is what we shipped.

0.6 0.7 0.8 0.9 1.0 256 384 512 640 768 1024 chunk 256 · no overlap · recall@5 0.810 chunk 384 · no overlap · recall@5 0.850 chunk 512 · no overlap · recall@5 0.870 chunk 640 · no overlap · recall@5 0.860 chunk 768 · no overlap · recall@5 0.840 chunk 1024 · no overlap · recall@5 0.830 chunk 256 · 80-token overlap · recall@5 0.860 chunk 384 · 80-token overlap · recall@5 0.900 chunk 512 · 80-token overlap · recall@5 0.930 · picked picked · 0.930 chunk 640 · 80-token overlap · recall@5 0.920 chunk 768 · 80-token overlap · recall@5 0.910 chunk 1024 · 80-token overlap · recall@5 0.890 chunk size (tokens) recall@5
  • 512-token chunks · 80-token overlap · sentence-anchored
  • same chunk size · no overlap · naive split
  • value we shipped

eval cut · 800-query docs-recoverable slice of the frozen 1,260-item set · numbers are means over 3 random seeds, ± 0.012 SD

docs-answer/schema/answer.ts typescript
// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).

import { z } from "zod";

export const DocsAnswer = z.object({
  answer: z.string().min(20).max(1200),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.object({
    claim: z.string().min(8).max(280),
    anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
    title: z.string(),
  })).min(1).max(6),
  refused: z.boolean().describe(
    "True if the agent decided it cannot answer — out-of-scope, " +
    "score-margin gate failed, or no citation could be grounded."
  ),
});

export type DocsAnswer = z.infer<typeof DocsAnswer>;
The answer-surface schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform — every claim has to cite an anchor_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then fails closed.
the answer surface · sample render
docs assistant · in-product surface
confidence 92%
user

Why is my webhook returning a 401 even though I copied the secret token from the dashboard?

answer

Webhooks signed with the dashboard-issued secret token return 401 when the request body has been re-serialized between dispatch and verification 1 . The signature is computed over the raw bytes of the payload, so any middleware that re-encodes JSON — including most HTTP clients' default body parsers — will invalidate the signature 2 . You can either configure your framework to expose the raw body before parsing, or use the SDK's built-in verifier which accepts a parsed body and an `x-signature-payload` header that ships the canonical bytes 3 .

↑ what an in-product answer looks like at confidence ≥ 0.8 · below that, the same shape renders with a "draft only — open a ticket" banner; below 0.5 the surface refuses and routes to a human

unit economics

Per-decision and monthly cost math

line item$ / answer$ / month (≈ 84k answers)note
Claude Haiku 4.5 · routing (all queries)$0.0007$59480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
Claude Sonnet 4.6 · input (72% of queries)$0.0096$5803,200 tokens × $3.00 / 1M · only on routed queries
Claude Sonnet 4.6 · output (72% of queries)$0.0054$326360 tokens × $15.00 / 1M · streamed
voyage-3-large embeddings (avg query)$0.00036$30≈ 3,000 tokens × $0.12 / 1M
pgvector · RDS db.m6i.large (BAA-eligible)$284Postgres 16 · embeddings + anchor-id index
Algolia · production tier (existing line)$0already-paid line · reused as RRF lexical lane
g5.xlarge reranker (24/7 in VPC)$378BAAI bge-reranker-large self-host
Cloudflare Workers · edge + audit shipping$96Workers Paid + Workers AI gateway
Langfuse self-hosted (t3.medium)$67trace store; 90-day hot / 7-yr cold
All-in monthly≈ $0.0227≈ $1,820vs. ≈ $9,400 / mo to add one tier-1 support engineer
  1. line item Claude Haiku 4.5 · routing (all queries)
    $ / answer $0.0007
    $ / month (≈ 84k answers) $59
    note 480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
  2. line item Claude Sonnet 4.6 · input (72% of queries)
    $ / answer $0.0096
    $ / month (≈ 84k answers) $580
    note 3,200 tokens × $3.00 / 1M · only on routed queries
  3. line item Claude Sonnet 4.6 · output (72% of queries)
    $ / answer $0.0054
    $ / month (≈ 84k answers) $326
    note 360 tokens × $15.00 / 1M · streamed
  4. line item voyage-3-large embeddings (avg query)
    $ / answer $0.00036
    $ / month (≈ 84k answers) $30
    note ≈ 3,000 tokens × $0.12 / 1M
  5. line item pgvector · RDS db.m6i.large (BAA-eligible)
    $ / answer
    $ / month (≈ 84k answers) $284
    note Postgres 16 · embeddings + anchor-id index
  6. line item Algolia · production tier (existing line)
    $ / answer
    $ / month (≈ 84k answers) $0
    note already-paid line · reused as RRF lexical lane
  7. line item g5.xlarge reranker (24/7 in VPC)
    $ / answer
    $ / month (≈ 84k answers) $378
    note BAAI bge-reranker-large self-host
  8. line item Cloudflare Workers · edge + audit shipping
    $ / answer
    $ / month (≈ 84k answers) $96
    note Workers Paid + Workers AI gateway
  9. line item Langfuse self-hosted (t3.medium)
    $ / answer
    $ / month (≈ 84k answers) $67
    note trace store; 90-day hot / 7-yr cold
  10. line item All-in monthly
    $ / answer ≈ $0.0227
    $ / month (≈ 84k answers) ≈ $1,820
    note vs. ≈ $9,400 / mo to add one tier-1 support engineer

Token costs use Anthropic's public May-2026 pricing — Sonnet 4.6 at $3 / 1M input + $15 / 1M output; Haiku 4.5 at $1 / 1M input + $5 / 1M output. Infra costs are AWS US-east-2 list price. Volume of 84k answers/month is steady-state after rollout, of which 28% terminate at Haiku without ever invoking Sonnet — the routing line is what makes the math reconcile.

eval composition

What's in the frozen 1,260-query set

categoryitemswhat it checksci-gate threshold
Docs-recoverable golds800labelled answer + correct anchor_id on real (de-identified) past tickets≥ 0.90 recall@5 + groundedness
Not-docs-recoverable220agent must refuse · no answer surface · ticket draft only≥ 0.95 refusal recall
Adversarial (synonyms, jailbreaks, ambiguity)140synonym traps, prompt-injection attempts, intentionally ambiguous queries100% refusal on listed must-refuse · 0 jailbreaks
Routing-only (Haiku terminates)100greetings, product-list lookups, FAQ; Sonnet should never fire≥ 0.95 router accuracy
  1. category Docs-recoverable golds
    items 800
    what it checks labelled answer + correct anchor_id on real (de-identified) past tickets
    ci-gate threshold ≥ 0.90 recall@5 + groundedness
  2. category Not-docs-recoverable
    items 220
    what it checks agent must refuse · no answer surface · ticket draft only
    ci-gate threshold ≥ 0.95 refusal recall
  3. category Adversarial (synonyms, jailbreaks, ambiguity)
    items 140
    what it checks synonym traps, prompt-injection attempts, intentionally ambiguous queries
    ci-gate threshold 100% refusal on listed must-refuse · 0 jailbreaks
  4. category Routing-only (Haiku terminates)
    items 100
    what it checks greetings, product-list lookups, FAQ; Sonnet should never fire
    ci-gate threshold ≥ 0.95 router accuracy

Eval set is frozen — items only added, never edited. Docs lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.

Production ops cadence is part of the build, not an afterthought. The docs lead and our on-call engineer hold a weekly review meeting where every answer the user thumbs-down or the groundedness watchdog flagged sub-0.7 gets opened. Drift that looks systematic (more than three of the same pattern in a week) becomes a JIRA ticket against the eval set and a candidate prompt or retrieval tweak. Langfuse trace retention is 90 days hot plus seven years cold. Sentry catches any in-product render error on the citation chips — those are mostly editorial drift (a docs page was renamed but its old anchor_id is still in the index) and get caught within the next PR-merge cycle. Our on-call rotation runs two engineers a week against a 99.5% pipeline-availability SLO and the 95th-percentile-under-2.5s first-token SLO. None of this is published anywhere else by anyone shipping docs agents. That's the bar.

7 weeks · honest version

The timeline
including the week we almost cut.

Five stages, milestone-billed. The week-5 shadow run caught a synonym-hallucination case — the model confidently returned the wrong product's answer because the cited anchor_id was real but from a different product surface. We halted promotion, tightened the validator, re-ran the groundedness eval, and only then cut over. The honest version of `7 weeks` includes the days we sat on our hands.

  1. Week 1

    Discovery + eval set

    One week with the support lead, the docs lead, and an SRE who owns the help-center. We sampled 1,200 closed tickets from the last 90 days and labelled each one `docs-recoverable` or `not-docs-recoverable` with the docs team. That sample produced the frozen 1,260-item eval set: 800 docs-recoverable queries with their correct doc-anchor target, 220 not-docs-recoverable queries (the agent must refuse), 140 adversarial queries (synonym traps, jailbreaks, intentional ambiguity), and 100 routing-only queries Haiku should handle without Sonnet at all.

    Frozen 1,260-item eval set + acuity-shaped scoring rubric
  2. Weeks 2–3

    Corpus + dual-index build

    Ingested 12,400 MDX pages into pgvector 0.7 with anchor-preserving chunking — 512 tokens per chunk, 80-token overlap, sentence-anchored, never split mid-code-fence. Each chunk minted an anchor_id of shape doc_[slug]#[anchor-slug], matching the regex the answer schema enforces downstream. Algolia indexed the same corpus on the lexical side, reusing the synonym table the docs team already maintains. RRF fusion at k=60 tuned on a held-out eval slice.

    Hybrid retrieval at 0.93 recall@5 on the eval set
  3. Week 4

    Routing + cited-answer agent

    Claude Haiku 4.5 routes intent: greeting / product-list / docs-question / out-of-scope. Docs-questions go to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The schema is the contract — every claim must cite an anchor_id, confidence is bounded 0–1, refusal is an explicit boolean. Cloudflare Workers wraps the whole pipeline for edge streaming + audit shipping into Langfuse.

    End-to-end answer pipeline behind a beta flag
  4. Week 5

    Shadow run — synonym hallucination caught

    Two-week shadow run against the existing answer-bot. Day 3 the support lead flagged a case: a user had asked about `API key`, the model retrieved chunks about `secret tokens` (the docs use both terms in different products), and confidently returned an answer pointing at the wrong product's auth flow. The grounded-citation regex had passed because the model picked a real anchor_id — it just wasn't the right one. We halted promotion, tightened the validator to require that the cited chunk's pathway-id match the routed product context, ran the groundedness eval LLM-as-judge against the full shadow slice, and only then promoted.

    Groundedness eval lifted 0.88 → 0.95 · false-anchor rate cut 6.2% → 0.9%
    Walk-away point
  5. Weeks 6–7

    Cutover + groundedness watchdog

    Promoted to the help-center search surface with the old keyword search retained in active-standby for 30 days. 5% of live answers sampled by an LLM-as-judge groundedness watchdog, with a manual review queue surfacing every disagreement to the docs team. The watchdog is also wired to Sentry — any answer flagged below 0.7 groundedness pages the on-call. The old answer-bot stays available behind a `legacy` query flag for the support team for 60 more days while diffs are reviewed.

    Production cutover · groundedness watchdog + 5% live sample
eval results · 1,260 frozen queries

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 1,260. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric
baseline (old bot)
v1 (wk 4)
v2 (wk 5 post-fix)
current (live)
target
Recall@5 on docs-recoverable queries
0.73 (keyword)
0.88
0.91
0.93
≥ 0.90
Answer groundedness (LLM-judge)
0.82 (old bot)
0.88
0.94
0.95
≥ 0.93
False-anchor rate (cited wrong product)
6.2%
1.1%
0.9%
≤ 1.5%
Refusal rate (fail-closed below 0.5 conf)
9.8%
13.4%
11.6%
10–14%
Routing accuracy (Haiku vs Sonnet)
0.94
0.96
0.97
≥ 0.95
P95 first-token latency (streamed)
2.6s
2.3s
2.1s
≤ 2.5s

Sample size for the headline deflection number (≈ 64% docs-recoverable tickets resolved at confidence ≥ 0.8) is n=3,400 user sessions across the 30-day shadow window; the figure is a 95% confidence interval, not a point estimate. Recall@5 baseline is the prior keyword-search bot on the same eval slice. False-anchor rate is the share of answers where every cited anchor_id was real but at least one belonged to a different product surface than the routed intent — the kill-point failure mode that drove the week-5 halt. Refusal rate is by-design; it is the share of queries where the agent decided it could not answer.

Ready to ship

Want a case study like this
for your stack?

Book a $3K fixed-fee audit. We'll review the docs corpus, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you don't need this — buy the platform, here's the SOW for integration.`

Read the Claude pillar
30 min, async or live Eval-first scoping Walk-away point in the pilot