Developer-tooling SaaS · B2B Claude RAG · cited-answer surface

Claude Sonnet 4.6

Claude Haiku 4.5

pgvector 0.7

bge-reranker-large

Cloudflare Workers

Langfuse

case study · 2026 · anonymized

A Claude case study: cited answers
over 12,000 product-docs pages.

A developer-tooling SaaS needed an in-product answer surface that could reason across nine product surfaces and twelve thousand docs pages, return cited answers users could click to verify, and refuse — out loud — when it didn't know. We built it on Claude Sonnet 4.6 + Haiku 4.5, a hybrid pgvector + Algolia index, and a forced-JSON schema that rejects any answer without grounded citations. Seven weeks, shadow-first, with a synonym-hallucination kill point at week 5 that we used.

≈ 64%

docs-recoverable tickets deflected at conf ≥ 0.8 (95% CI · n=3,400 over 30-day shadow)

p95 2.1s

first-token streamed answer latency · meets <2.5s service target

1,260

frozen eval queries · re-run on every release

7 weeks

discovery to production cutover

shipped

7 weeks · 3 engineers · 1 docs lead · 1 support lead

41%

of support tickets recoverable from public docs (sampled n=1,200)

2.3 / 5

user-rated docs-search satisfaction · pre-build

12,400

docs pages across 9 product surfaces

18%

of answer-bot replies failed groundedness on shadow eval

the problem

A docs-search bot
that lied with confidence.

Keyword search couldn't reason across nested module hierarchies. The existing answer-bot hallucinated on synonyms. The support queue paid the bill.

The client is a developer-tooling SaaS — nine product surfaces (an SDK, a CLI, a hosted control plane, three integrations, two open-source packages, and a sandbox) with roughly twelve thousand four hundred docs pages across them, all maintained in MDX in a public monorepo. Like most growth-stage developer tools, the docs are the front door, the support funnel, and the deciding factor for self-serve conversion in roughly the same breath. They were also the binding constraint: the support team's own audit showed that forty-one percent of inbound tickets, sampled across a randomized n=1,200 over the prior quarter, could have been answered from a public docs page that already existed. The user just hadn't found it.

today vs · with the agent

today

User asks question

Keyword search

no synonym handling

Old answer-bot

hallucinates on synonyms

User opens ticket

outcome

Ticket queue inflated · 41% recoverable from public docs · groundedness 0.82

with the agent

User asks question

Haiku 4.5 routes intent

Hybrid retrieval + rerank

voyage-3-large + bge

Sonnet 4.6 cited answer

forced-JSON schema · anchor citations

outcome

Answer in product · ≥0.8

outcome

Suggest + ticket · 0.5–0.8

outcome

Refuse + route · <0.5

The presenting symptoms broke down two ways. First, the keyword search couldn't reason across the product's nested module hierarchies — a question about `transaction.refund.signature` would surface generic transaction pages but miss the specific signature-mismatch troubleshooting note three levels deeper that actually held the answer. Second, the team had previously tried an off-the-shelf answer-bot that ran a single embedding lookup + a generic prompt — and it was confidently wrong on synonyms. Their corpus uses `API key`, `secret token`, and `signing secret` in different products for related-but-distinct concepts, and the old bot blurred them. Internal eval put its groundedness at 0.82, with an 18% rate of answers that cited a real doc page but the wrong one. The support lead's exact phrasing in the discovery call was: "I'd rather it refuse than answer wrong; right now it answers wrong about one time in five."

They had looked at every vendor in the docs-search space and turned every one of them down. The objections were technical, not commercial: no honest evaluation methodology in any vendor pitch, no audit log on the model's claims, no clear refusal path when retrieval was thin, no way to require that the citation be load-bearing rather than decorative. The conversation we walked into was not "should we ship Claude" — it was "show us how a docs agent could lie, and tell us how you'd catch it before a user trusted it."

That framing decided the engagement. We refused to scope this as a chatbot build. The deliverable was a structured-output answer surface with a citation-first contract, evidence-chunk-id enforcement on every claim, a frozen eval set that gated every release, and a refusal lane that was first-class rather than an error case. The rest of the page is what we shipped.

the approach

Six stages,
three outcome lanes.

Every question runs the same six-stage pipeline. The agent is allowed to answer-in-product, suggest-and-open-ticket, or refuse-and-route. Forced-JSON output with grounded citations is the only legal shape of an answer.

The architecture below is the production shape, not a marketing diagram. The docs corpus ingests from the customer's MDX monorepo via a GitHub Action that triggers on PR-merge — only changed pages reindex, the rest hot-reload from cache. The anchor-preserving chunker is the load-bearing detail: each chunk carries a stable anchor_id of shape `doc_[slug]#[anchor-slug]`, minted at chunk-time and never regenerated, so a citation issued today resolves to the same passage tomorrow even after editorial reshuffling. The shape of the id is also exactly the shape the answer schema enforces downstream with a regex.

Retrieval is hybrid. pgvector 0.7 sits on the embedding side with voyage-3-large at 1,024 dimensions. Algolia sits on the lexical side over the same corpus, reusing the synonym table the docs team has maintained for four years. We fuse with reciprocal-rank fusion at k=60, take the top-40 from each lane, dedupe by anchor_id, and rerank with BAAI's bge-reranker-large self-hosted on a single g5.xlarge inside the customer's VPC. The reranker carries a score-margin gate: if the best candidate doesn't beat the second by at least 0.04, the agent refuses on principle and routes to a human ticket. That gate is part of why the false-anchor rate moved from 6.2% to 0.9% — most synonym-hallucination cases fail it before the model ever sees them.

The decision step is two-model. Claude Haiku 4.5 routes intent first: greeting, product-list lookup, docs-question, or out-of-scope. Roughly 28% of inbound queries terminate at Haiku — they don't need Sonnet's reasoning, and paying for it would be a waste. The other 72% pass through to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The model has zero write tools — it cannot open tickets, send emails, or modify docs. All it produces is a JSON object: the prose answer, a 0–1 confidence float, an array of cited claims (each tied to an anchor_id), and an explicit refusal boolean. Every claim has to point to an anchor_id whose pathway-id matches the routed product context, or the validator rejects.

The third outcome lane — refusal — is first-class on purpose. Below confidence 0.5, the surface doesn't answer; it tells the user it doesn't know and opens a draft ticket with the query, the retrieved candidates, and the refusal reason pre-attached. Between 0.5 and 0.8, it shows the answer as a draft with a "verify with support" banner. At ≥ 0.8 it renders the cited answer inline. This three-tier surface is the thing that buys the support lead's sign-off. A confident-wrong answer is the failure mode that scares the team; the architecture makes confident-wrong structurally hard to ship.

three decisions that shaped the build

design decision · 01

Forced-JSON cited-answer schema

we rejected: Free-text answer with a downstream citation parser
because: Every claim has to point at an anchor_id matching the regex doc_[slug]#[anchor]. The Zod validator is the contract; the model can't hand-wave. If parsing fails, we retry once with a stricter prompt, then fail closed to a human ticket.

design decision · 02

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

we rejected: Single Sonnet model for every query
because: ≈ 28% of inbound queries are greetings, product-name lookups, or one-liner FAQs. Haiku handles them at a tenth the per-call cost; Sonnet reasons over the retrieved evidence on the rest. The routing decision itself is cached for 24h on canonical-form input.

design decision · 03

Algolia as the lexical lane (not BM25 reinvention)

we rejected: Postgres tsvector BM25 alongside pgvector
because: The customer already paid for Algolia. Algolia's synonym table + typo tolerance was tuned by their docs team over four years. Reusing it as the lexical lane in RRF was a measurable +6 points on recall@5 vs a fresh tsvector index, with zero new ops surface.

Guardrails live as Zod schemas + TypeScript runtime checks in the same monorepo as the agent. Per-decision audit-logs capture the retrieved candidates, the reranker scores, the model's raw output, the parsed JSON, the schema-validation verdict, and the rendered surface state — searchable in Langfuse by confidence band and by user-reported feedback signal. The docs lead holds a weekly review meeting with our on-call engineer where any answer the user thumbs-down or any answer the groundedness watchdog flagged below 0.7 is opened. Patterns that show up more than three times in a week become a JIRA ticket against the eval set and a candidate prompt or retrieval tweak.

Hover any node in the diagram for the tool inventory and the per-stage latency budget. The reason this shape works is the same reason it was scoped this way at week 1: every component has a separately measurable contract. Retrieval is measurable in recall@5. The reranker is measurable in top-1 precision on the held-out slice. The model decision is measurable on labelled answer correctness. Groundedness is measurable on the LLM-as-judge cut. The refusal lane is measurable as a rate, not a failure. When something regresses, the per-component metric tells us which stage broke — not a single end-to-end number that hides which subsystem moved.

under the hood

The docs-RAG pipeline,
end to end.

One query enters at the left. It either renders an in-product answer with cited anchor links, opens a draft ticket with caveats, or fails closed and routes to a human. Hover any stage for its tool inventory and latency budget.

outcome · ≥0.8 Answer in product cited surface with anchor links · ≈ 64% of recoverable tickets

outcome · 0.5–0.8 Suggest + open ticket draft answer with caveats · agent reviews before send · ≈ 24%

outcome · <0.5 Refuse + route to human fail-closed · no answer surfaces · ≈ 12% of queries

tool inventory

Hover or focus a stage on the left to see its tool surface, latency budget, and the data it touches.

latency budgets above are p50/p95 on the production traffic mix · end-to-end p95 inside 2.6s for a streamed answer

schema-validated

every Sonnet answer parses against the Zod schema or fails closed

answers shipped without a grounded citation · enforced by the validator

9 product surfaces

covered by one index · 12,400 pages reindex on PR-merge

shadow-first

two-week shadow next to the existing keyword-search answer-bot

the stack

Named tools,
named versions.

Everything in the build is a thing your platform team can write a question about. Nothing in the build is `our proprietary AI`. Vendor swap-out cost is bounded because the eval set, prompts, and policies are checked into the customer's repo — not ours.

Claude Sonnet 4.6 Anthropic API · forced JSON

Claude Haiku 4.5 Anthropic API

voyage-3-large 1,024 dim

pgvector 0.7 Postgres 16

Algolia Production tier

BAAI bge-reranker-large self-hosted

Cloudflare Workers Workers AI gateway

Langfuse v3 · self-hosted

Sentry JS SDK

how it actually runs

Production shape,
under the hood.

Latency is measured at the answer-surface boundary; cost math uses Anthropic's published Sonnet 4.6 + Haiku 4.5 pricing as of May 2026; eval composition is the frozen 1,260-query set the CI gates on.

Most Claude case studies stop at the architecture diagram. Ours doesn't, because our buyers don't. The two people who decide whether to sign — the head of support and the head of platform — open a case study and look for specific things: per-stage latency with p95 not just p50, a token-cost line that ties to the vendor's published price card, a frozen eval set with category-level thresholds, and an honest accounting of what runs where for compliance scope. Vendors who don't show this either don't have it or are hiding it. The section below maps directly to those questions. Every number is reproducible from a Langfuse trace, a Postgres `EXPLAIN ANALYZE`, or a published vendor price page.

latency budget

Per-stage P50 / P95 (ms · streamed)

stage	p50	p95	tooling
Intent router · Haiku 4.5	320	560	Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
Embedding (query)	82	140	voyage-3-large · 1,024 dim · batched
Hybrid retrieval (pgvector ∥ Algolia)	186	320	RRF k=60 · top-40 per lane → dedupe → top-40 unique
Cross-encoder rerank	240	380	BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
Sonnet 4.6 cited answer (first token)	920	1480	Anthropic API · streamed · ~3,200 in / ~360 out tokens
Groundedness + render	16	32	Zod schema validation · Cloudflare Workers · streamed to client
Total to first token (end-to-end)	1764	2092	agent boundary — excludes client-side render of the citation chips

stage Intent router · Haiku 4.5
p50 320
p95 560
tooling Anthropic API · ~480 in / ~60 out tokens · 24h canonical-form cache
stage Embedding (query)
p50 82
p95 140
tooling voyage-3-large · 1,024 dim · batched
stage Hybrid retrieval (pgvector ∥ Algolia)
p50 186
p95 320
tooling RRF k=60 · top-40 per lane → dedupe → top-40 unique
stage Cross-encoder rerank
p50 240
p95 380
tooling BAAI/bge-reranker-large · g5.xlarge in customer VPC · top-8
stage Sonnet 4.6 cited answer (first token)
p50 920
p95 1480
tooling Anthropic API · streamed · ~3,200 in / ~360 out tokens
stage Groundedness + render
p50 16
p95 32
tooling Zod schema validation · Cloudflare Workers · streamed to client
stage Total to first token (end-to-end)
p50 1764
p95 2092
tooling agent boundary — excludes client-side render of the citation chips

p50/p95 from a 30-day rolling window over n ≈ 84,000 production decisions. SLO is p95 ≤ 2,500 ms to first streamed token; current burn ≈ 84%.

The retrieval lane is where the most per-stage tuning effort went. The corpus is ≈ 12,400 MDX pages chunked at 512 tokens with 80-token overlap, sentence-anchored, never split mid-code-fence — short enough that the reranker score is meaningful, long enough that an explanation paragraph stays intact. We picked voyage-3-large at 1,024 dimensions specifically because Voyage was Pareto-best on the eval cut against three contenders we ran on the held-out slice (OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, and a self-hosted bge-large fine-tuned on the support-question pairs). The chart below shows the chunk-size sweep we walked before locking in 512 with 80-token overlap. We tried larger chunks first (1,024 tokens, no overlap) — recall@5 dropped four points because the reranker's score got dominated by the rest of the chunk's content rather than the load-bearing sentence. We tried smaller (256, no overlap) — recall@5 dropped three points because explanatory context got cut off mid-thought.

chunk-size sweep · what we shipped vs what we tried

Chunk size × overlap vs recall@5

Two curves on the same eval slice: sentence-anchored chunks with 80-token overlap versus naive splits without. Each datapoint is an actual eval cut, not a model projection. Picked value (signal-marker) is what we shipped.

512-token chunks · 80-token overlap · sentence-anchored
same chunk size · no overlap · naive split
value we shipped

eval cut · 800-query docs-recoverable slice of the frozen 1,260-item set · numbers are means over 3 random seeds, ± 0.012 SD

docs-answer/schema/answer.ts typescript

// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).

import { z } from "zod";

export const DocsAnswer = z.object({
  answer: z.string().min(20).max(1200),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.object({
    claim: z.string().min(8).max(280),
    anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
    title: z.string(),
  })).min(1).max(6),
  refused: z.boolean().describe(
    "True if the agent decided it cannot answer — out-of-scope, " +
    "score-margin gate failed, or no citation could be grounded."
  ),
});

export type DocsAnswer = z.infer<typeof DocsAnswer>;

// docs-answer/schema/answer.ts
// Forced-JSON answer schema. Every claim must point to a doc-anchor id;
// the parser rejects answers without grounded evidence and we retry once
// with a stricter prompt, then fail closed (route to human support).

import { z } from "zod";

export const DocsAnswer = z.object({
  answer: z.string().min(20).max(1200),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.object({
    claim: z.string().min(8).max(280),
    anchor_id: z.string().regex(/^doc_[a-z0-9-]+#[a-z0-9-]+$/),
    title: z.string(),
  })).min(1).max(6),
  refused: z.boolean().describe(
    "True if the agent decided it cannot answer — out-of-scope, " +
    "score-margin gate failed, or no citation could be grounded."
  ),
});

export type DocsAnswer = z.infer<typeof DocsAnswer>;

The answer-surface schema. Claude Sonnet 4.6 with response_format: json_schema can't return anything that doesn't conform — every claim has to cite an anchor_id matching the regex, or the validator rejects and the agent retries once with a stricter prompt, then fails closed.

the answer surface · sample render

docs assistant · in-product surface

confidence 92%

user

Why is my webhook returning a 401 even though I copied the secret token from the dashboard?

answer

Webhooks signed with the dashboard-issued secret token return 401 when the request body has been re-serialized between dispatch and verification ¹ . The signature is computed over the raw bytes of the payload, so any middleware that re-encodes JSON — including most HTTP clients' default body parsers — will invalidate the signature ² . You can either configure your framework to expose the raw body before parsing, or use the SDK's built-in verifier which accepts a parsed body and an `x-signature-payload` header that ships the canonical bytes ³ .

↑ what an in-product answer looks like at confidence ≥ 0.8 · below that, the same shape renders with a "draft only — open a ticket" banner; below 0.5 the surface refuses and routes to a human

unit economics

Per-decision and monthly cost math

line item	$ / answer	$ / month (≈ 84k answers)	note
Claude Haiku 4.5 · routing (all queries)	$0.0007	$59	480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
Claude Sonnet 4.6 · input (72% of queries)	$0.0096	$580	3,200 tokens × $3.00 / 1M · only on routed queries
Claude Sonnet 4.6 · output (72% of queries)	$0.0054	$326	360 tokens × $15.00 / 1M · streamed
voyage-3-large embeddings (avg query)	$0.00036	$30	≈ 3,000 tokens × $0.12 / 1M
pgvector · RDS db.m6i.large (BAA-eligible)	—	$284	Postgres 16 · embeddings + anchor-id index
Algolia · production tier (existing line)	—	$0	already-paid line · reused as RRF lexical lane
g5.xlarge reranker (24/7 in VPC)	—	$378	BAAI bge-reranker-large self-host
Cloudflare Workers · edge + audit shipping	—	$96	Workers Paid + Workers AI gateway
Langfuse self-hosted (t3.medium)	—	$67	trace store; 90-day hot / 7-yr cold
All-in monthly	≈ $0.0227	≈ $1,820	vs. ≈ $9,400 / mo to add one tier-1 support engineer

line item Claude Haiku 4.5 · routing (all queries)
$ / answer $0.0007
$ / month (≈ 84k answers) $59
note 480 tokens × $1.00 / 1M + 60 × $5.00 / 1M
line item Claude Sonnet 4.6 · input (72% of queries)
$ / answer $0.0096
$ / month (≈ 84k answers) $580
note 3,200 tokens × $3.00 / 1M · only on routed queries
line item Claude Sonnet 4.6 · output (72% of queries)
$ / answer $0.0054
$ / month (≈ 84k answers) $326
note 360 tokens × $15.00 / 1M · streamed
line item voyage-3-large embeddings (avg query)
$ / answer $0.00036
$ / month (≈ 84k answers) $30
note ≈ 3,000 tokens × $0.12 / 1M
line item pgvector · RDS db.m6i.large (BAA-eligible)
$ / answer —
$ / month (≈ 84k answers) $284
note Postgres 16 · embeddings + anchor-id index
line item Algolia · production tier (existing line)
$ / answer —
$ / month (≈ 84k answers) $0
note already-paid line · reused as RRF lexical lane
line item g5.xlarge reranker (24/7 in VPC)
$ / answer —
$ / month (≈ 84k answers) $378
note BAAI bge-reranker-large self-host
line item Cloudflare Workers · edge + audit shipping
$ / answer —
$ / month (≈ 84k answers) $96
note Workers Paid + Workers AI gateway
line item Langfuse self-hosted (t3.medium)
$ / answer —
$ / month (≈ 84k answers) $67
note trace store; 90-day hot / 7-yr cold
line item All-in monthly
$ / answer ≈ $0.0227
$ / month (≈ 84k answers) ≈ $1,820
note vs. ≈ $9,400 / mo to add one tier-1 support engineer

Token costs use Anthropic's public May-2026 pricing — Sonnet 4.6 at $3 / 1M input + $15 / 1M output; Haiku 4.5 at $1 / 1M input + $5 / 1M output. Infra costs are AWS US-east-2 list price. Volume of 84k answers/month is steady-state after rollout, of which 28% terminate at Haiku without ever invoking Sonnet — the routing line is what makes the math reconcile.

eval composition

What's in the frozen 1,260-query set

category	items	what it checks	ci-gate threshold
Docs-recoverable golds	800	labelled answer + correct anchor_id on real (de-identified) past tickets	≥ 0.90 recall@5 + groundedness
Not-docs-recoverable	220	agent must refuse · no answer surface · ticket draft only	≥ 0.95 refusal recall
Adversarial (synonyms, jailbreaks, ambiguity)	140	synonym traps, prompt-injection attempts, intentionally ambiguous queries	100% refusal on listed must-refuse · 0 jailbreaks
Routing-only (Haiku terminates)	100	greetings, product-list lookups, FAQ; Sonnet should never fire	≥ 0.95 router accuracy

category Docs-recoverable golds
items 800
what it checks labelled answer + correct anchor_id on real (de-identified) past tickets
ci-gate threshold ≥ 0.90 recall@5 + groundedness
category Not-docs-recoverable
items 220
what it checks agent must refuse · no answer surface · ticket draft only
ci-gate threshold ≥ 0.95 refusal recall
category Adversarial (synonyms, jailbreaks, ambiguity)
items 140
what it checks synonym traps, prompt-injection attempts, intentionally ambiguous queries
ci-gate threshold 100% refusal on listed must-refuse · 0 jailbreaks
category Routing-only (Haiku terminates)
items 100
what it checks greetings, product-list lookups, FAQ; Sonnet should never fire
ci-gate threshold ≥ 0.95 router accuracy

Eval set is frozen — items only added, never edited. Docs lead signs off any addition. CI fails the release if any category drops more than 1 point from the prior cut; release engineer can over-ride with a signed CHANGELOG entry.

Production ops cadence is part of the build, not an afterthought. The docs lead and our on-call engineer hold a weekly review meeting where every answer the user thumbs-down or the groundedness watchdog flagged sub-0.7 gets opened. Drift that looks systematic (more than three of the same pattern in a week) becomes a JIRA ticket against the eval set and a candidate prompt or retrieval tweak. Langfuse trace retention is 90 days hot plus seven years cold. Sentry catches any in-product render error on the citation chips — those are mostly editorial drift (a docs page was renamed but its old anchor_id is still in the index) and get caught within the next PR-merge cycle. Our on-call rotation runs two engineers a week against a 99.5% pipeline-availability SLO and the 95th-percentile-under-2.5s first-token SLO. None of this is published anywhere else by anyone shipping docs agents. That's the bar.

7 weeks · honest version

The timeline
including the week we almost cut.

Five stages, milestone-billed. The week-5 shadow run caught a synonym-hallucination case — the model confidently returned the wrong product's answer because the cited anchor_id was real but from a different product surface. We halted promotion, tightened the validator, re-ran the groundedness eval, and only then cut over. The honest version of `7 weeks` includes the days we sat on our hands.

Week 1

Discovery + eval set

One week with the support lead, the docs lead, and an SRE who owns the help-center. We sampled 1,200 closed tickets from the last 90 days and labelled each one `docs-recoverable` or `not-docs-recoverable` with the docs team. That sample produced the frozen 1,260-item eval set: 800 docs-recoverable queries with their correct doc-anchor target, 220 not-docs-recoverable queries (the agent must refuse), 140 adversarial queries (synonym traps, jailbreaks, intentional ambiguity), and 100 routing-only queries Haiku should handle without Sonnet at all.

Frozen 1,260-item eval set + acuity-shaped scoring rubric
Weeks 2–3

Corpus + dual-index build

Ingested 12,400 MDX pages into pgvector 0.7 with anchor-preserving chunking — 512 tokens per chunk, 80-token overlap, sentence-anchored, never split mid-code-fence. Each chunk minted an anchor_id of shape doc_[slug]#[anchor-slug], matching the regex the answer schema enforces downstream. Algolia indexed the same corpus on the lexical side, reusing the synonym table the docs team already maintains. RRF fusion at k=60 tuned on a held-out eval slice.

Hybrid retrieval at 0.93 recall@5 on the eval set
Week 4

Routing + cited-answer agent

Claude Haiku 4.5 routes intent: greeting / product-list / docs-question / out-of-scope. Docs-questions go to Claude Sonnet 4.6 with `response_format: json_schema` set to the DocsAnswer shape. The schema is the contract — every claim must cite an anchor_id, confidence is bounded 0–1, refusal is an explicit boolean. Cloudflare Workers wraps the whole pipeline for edge streaming + audit shipping into Langfuse.

End-to-end answer pipeline behind a beta flag
Week 5

Shadow run — synonym hallucination caught

Two-week shadow run against the existing answer-bot. Day 3 the support lead flagged a case: a user had asked about `API key`, the model retrieved chunks about `secret tokens` (the docs use both terms in different products), and confidently returned an answer pointing at the wrong product's auth flow. The grounded-citation regex had passed because the model picked a real anchor_id — it just wasn't the right one. We halted promotion, tightened the validator to require that the cited chunk's pathway-id match the routed product context, ran the groundedness eval LLM-as-judge against the full shadow slice, and only then promoted.

Groundedness eval lifted 0.88 → 0.95 · false-anchor rate cut 6.2% → 0.9%

Walk-away point
Weeks 6–7

Cutover + groundedness watchdog

Promoted to the help-center search surface with the old keyword search retained in active-standby for 30 days. 5% of live answers sampled by an LLM-as-judge groundedness watchdog, with a manual review queue surfacing every disagreement to the docs team. The watchdog is also wired to Sentry — any answer flagged below 0.7 groundedness pages the on-call. The old answer-bot stays available behind a `legacy` query flag for the support team for 60 more days while diffs are reviewed.

Production cutover · groundedness watchdog + 5% live sample

eval results · 1,260 frozen queries

How we know
it works.

The eval set is frozen. Every model change, prompt change, retrieval change, and policy change re-runs the full 1,260. Nothing ships if any metric red-lights against its target. Numbers below are from the current production cut and the frozen eval slice; live shadow-traffic numbers are within ±2% across all rows over the last 30 days.

metric

baseline (old bot)

v1 (wk 4)

v2 (wk 5 post-fix)

current (live)

target

Recall@5 on docs-recoverable queries

0.73 (keyword)

0.88

0.91

0.93

≥ 0.90

Answer groundedness (LLM-judge)

0.82 (old bot)

0.88

0.94

0.95

≥ 0.93

False-anchor rate (cited wrong product)

—

6.2%

1.1%

0.9%

≤ 1.5%

Refusal rate (fail-closed below 0.5 conf)

—

9.8%

13.4%

11.6%

10–14%

Routing accuracy (Haiku vs Sonnet)

—

0.94

0.96

0.97

≥ 0.95

P95 first-token latency (streamed)

—

2.6s

2.3s

2.1s

≤ 2.5s

Sample size for the headline deflection number (≈ 64% docs-recoverable tickets resolved at confidence ≥ 0.8) is n=3,400 user sessions across the 30-day shadow window; the figure is a 95% confidence interval, not a point estimate. Recall@5 baseline is the prior keyword-search bot on the same eval slice. False-anchor rate is the share of answers where every cited anchor_id was real but at least one belonged to a different product surface than the routed intent — the kill-point failure mode that drove the week-5 halt. Refusal rate is by-design; it is the share of queries where the agent decided it could not answer.

When NOT to ship this. A Claude RAG agent built on these patterns will mislead users in any of the following situations — we will turn down the engagement before scoping a pilot:

The docs corpus isn't load-bearing. If under 30% of inbound questions are recoverable from existing docs, retrieval has nothing to retrieve. The right answer is to write the docs first; an agent over thin docs is a hallucination engine.
Anchor stability isn't guaranteed. If the docs team renames or restructures pages weekly without a redirect contract, citation links rot and the agent stops being verifiable. We require a docs-side contract that anchor_ids are immutable — or that a renamed anchor produces a redirect that the index honors.
Confident-wrong answers can hurt users. A docs agent has a narrow blast radius — wrong answers cost a support ticket. A regulated-domain agent (medical, legal, financial advice) has a wider one. We don't ship the same architecture into those domains without the regulated-industry guardrails — those builds get HIPAA-scoped (or SEC-scoped, or HIPAA-equivalent) deployment patterns and a different kill-point profile.
The team won't review the watchdog feed. If the docs lead and support lead aren't going to open the weekly Langfuse review for the first six months, the groundedness watchdog drifts and nobody catches it. The eval set is necessary, not sufficient. Human review of disagreement is the thing that keeps the agent honest after it ships.

— the kill-point section that has to be on every honest case study

keep reading

Where this case study
points back to.

Each link below covers a pillar that fed into this build — or that a similar build on your stack would draw from.

Claude Development

The Claude pillar — Sonnet 4.6 + Haiku 4.5 integration patterns, forced JSON, prompt caching, BAA-eligible deployment paths.

AI Agent Development

The agent pillar — ReAct, plan-and-execute, hierarchical multi-agent recipes. Same eval-first loop used on this docs RAG build.

AI Chatbot Development

Production chatbots on Claude + GPT — RAG, guardrails, citation-first answer surfaces.

All AI Case Studies

Six AI case studies — RAG, agents, voice, and chatbots. Same operator detail across every page.

AI Consulting

$3K fixed-fee audit. We map the workflow, scope the eval set, and tell you whether it's case-study-shaped.

AI Integration Services

Plug Claude into Zendesk, Intercom, Salesforce, Slack, and your existing docs source.

Ready to ship

Want a case study like this
for your stack?

Book a $3K fixed-fee audit. We'll review the docs corpus, scope the eval set, recommend a model + retrieval recipe, project token + run-cost, and tell you honestly whether it's case-study-shaped. About one audit in five ends with `you don't need this — buy the platform, here's the SOW for integration.`

Read the Claude pillar

30 min, async or live Eval-first scoping Walk-away point in the pilot

A Claude case study: cited answers
over 12,000 product-docs pages.

A docs-search bot
that lied with confidence.

today

with the agent

Six stages,
three outcome lanes.

Forced-JSON cited-answer schema

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

Algolia as the lexical lane (not BM25 reinvention)

The docs-RAG pipeline,
end to end.

Named tools,
named versions.

Production shape,
under the hood.

Per-stage P50 / P95 (ms · streamed)

Chunk size × overlap vs recall@5

Per-decision and monthly cost math

What's in the frozen 1,260-query set

The timeline
including the week we almost cut.

Discovery + eval set

Corpus + dual-index build

Routing + cited-answer agent

Shadow run — synonym hallucination caught

Cutover + groundedness watchdog

How we know
it works.

Where this case study
points back to.

Claude Development

AI Agent Development

AI Chatbot Development

All AI Case Studies

AI Consulting

AI Integration Services

Want a case study like this
for your stack?

Talk to an engineer, not a salesperson.

Thanks —
we'll reply within 24 working hours.

A Claude case study: cited answers over 12,000 product-docs pages.

A docs-search bot that lied with confidence.

today

with the agent

Six stages, three outcome lanes.

Forced-JSON cited-answer schema

Two-model routing — Haiku 4.5 first, Sonnet 4.6 on real questions

Algolia as the lexical lane (not BM25 reinvention)

The docs-RAG pipeline, end to end.

Named tools, named versions.

Production shape, under the hood.

The timeline including the week we almost cut.

Discovery + eval set

Corpus + dual-index build

Routing + cited-answer agent

Shadow run — synonym hallucination caught

Cutover + groundedness watchdog

How we know it works.

Where this case study points back to.

Claude Development

AI Agent Development

AI Chatbot Development

All AI Case Studies

AI Consulting

AI Integration Services

Want a case study like this for your stack?

A Claude case study: cited answers
over 12,000 product-docs pages.

A docs-search bot
that lied with confidence.

Six stages,
three outcome lanes.

The docs-RAG pipeline,
end to end.

Named tools,
named versions.

Production shape,
under the hood.

The timeline
including the week we almost cut.

How we know
it works.

Where this case study
points back to.

Want a case study like this
for your stack?