ai chatbot development · live

AI chatbot development.
Production conversational AI, model-agnostic.

AI chatbot development services for customer service, support, and ecommerce. We ship RAG-grounded chatbots on Claude Sonnet 4.6, Haiku 4.5, and GPT-4o-mini — to your web widget, WhatsApp, voice, or Slack/Teams. Eval-gated, guardrailed, token-optimized. First chatbot live in 30 days, behind a feature flag.

See the anatomy
turn · liveOne chatbot turn
  1. 01ClassifyIntent
  2. 02RetrieveRAG
  3. 03ToolFunction
  4. 04GenerateReply
  5. 05GuardrailPolicy gate
  6. 06LogEval
30 days
first production chatbot live, behind a flag
RAG-first
every chatbot grounded in your real data
$3K
audit-to-roadmap before any chatbot build starts
Model-agnostic
Claude + GPT, picked per turn — not per contract
ai chatbot services · what we build

Six chatbot patterns we ship
for real revenue + ops teams.

Every customer service chatbot, ecommerce chatbot, and internal Slack chatbot below has been shipped from this exact playbook. Each one comes with an eval suite, audit logging, confidence gating, and a per-turn cost target — not a polished demo.

Customer service chatbots — tier-1 deflection

The crown-jewel use case. RAG-grounded chatbot over your help center + ticket history that handles password resets, order status, refund eligibility, and policy lookups. Confidence gate at 0.7; anything below escalates to a human with the AI's draft attached. We've shipped these to 30–45% tier-1 deflection on Zendesk and Intercom queues.

Ecommerce chatbots — product Q+A and order ops

Conversational AI grounded in your Shopify / WooCommerce catalog. Product recommendations from natural-language criteria, order status ("where's my package?"), return initiation, size and fit Q+A. Function calls into your OMS + 3PL APIs so the bot can act, not just describe.

Internal Slack / Teams chatbots — knowledge agents

Chatbots inside Slack or Microsoft Teams that retrieve from Notion, Confluence, Drive, or your code repo. Onboarding Q+A, policy lookups, on-call runbooks. Built on Sonnet 4.6 + tool use for the 6+ step internal workflows where a one-shot answer isn't enough.

WhatsApp + voice chatbots — outside the website

Where the user actually is. WhatsApp Business chatbots (Meta Cloud API) for international support and lead capture. Voice chatbots on Twilio / Vapi over the OpenAI Realtime API or Deepgram + Sonnet 4.6 for sub-second voice. Latency profiles differ — we'll tell you which channel makes sense before we build.

RAG chatbots — over your private corpus

When the answer lives in a 10,000-document Drive folder, a contract library, or a research archive. Retrieval-augmented chatbot with pgvector or Pinecone, top-k 5 retrieval re-ranked by bge-reranker-v2, eval-tested on your real question set before launch. We measure groundedness, not just BLEU.

Lead-capture + qualification chatbots

Website chatbot that runs a structured intake (BANT, MEDDIC, or your custom rubric), drops qualified leads into HubSpot / Salesforce with a transcript, and books a meeting via Cal.com or Chili Piper. Less marketing-deck, more pipeline — built for revenue teams that actually want call shows, not vanity completion rates.

chatbot anatomy

What actually happens
in a single chatbot turn.

Six stages every production chatbot turn moves through — from user message to logged outcome. Skip any one and you get the demo competitors ship instead of a chatbot that deflects tier-1 traffic. Each stage carries its own latency budget, model pick, and failure mode. Hover any stage for the operator detail.

  1. 01ClassifyIntent + routingHaiku 4.5 or GPT-4o-mini · ~120ms · 60-token system prompt$0.0001 / turn
  2. 02RetrieveRAG over your datapgvector / Pinecone · top-k 5 · re-ranked with bge-reranker~800 in tokens
  3. 03Tool callFunction executionZendesk · Stripe · Shopify · your API · timeout 4s0–N tool calls
  4. 04GenerateCompose the replySonnet 4.6 with retrieved context + tool results · streamed~600 out tokens
  5. 05GuardrailSafety + policy gatePII scrub · refusal rules · confidence threshold 0.7fail-closed
  6. 06LogTrace + evalLangfuse / Helicone · prompt · tokens · latency · verdictevaled nightly

Latencies and token counts are typical production traces from shipped chatbots. Your eval set sets the real budgets.

watch the bot work

What 'production chatbot'
actually looks like, live.

A real customer-service replay — left side, a chat surface streaming. Right side, the same six stages from the anatomy lighting up as each turn fires. This is what 'shipped' looks like, not what the demo deck shows.

pipeline loop restarts in 3…
01
Classify
Intent + routing
02
Retrieve
RAG over your data
03
Tool
Function execution
04
Generate
Compose the reply
05
Guardrail
Safety + policy gate
06
Log
Trace + eval

Illustrative per-decision economics — typical engagement-band figures, not literal client numbers.

  1. Customer: Where's my order #4421?
  2. Bot: Looking it up — one sec.
  3. Bot: Order #4421 shipped yesterday via FedEx — tracking 7898… delivery estimated Thursday. Anything else?
  4. Customer: Can I change the address?
  5. Bot: I can change it if it's still pre-transit. Confirming with FedEx now…
  6. Bot: Address change locked in. New address: 22 Spring St, NYC 10012. Confirm?
  7. Bot: Logged to Langfuse — trace ID lf_7c2f. We're done!
deployment channels

Eight channels we ship,
with the failure modes named.

Most chatbot vendors will quietly say yes to any channel. We won't. Pick a channel to see the deployment surface, latency profile, the actual stack, and — the part competitors hide — where it fails. Channel mix is decided in the audit, not in the sales call: web widget, WhatsApp, voice, Slack + Teams, Discord, Telegram, Instagram + Messenger, and SMS + iMessage.

Deployment

Embedded floating widget on your marketing site, in-product help center, or post-login dashboard. We ship a Preact-based widget that loads under 35KB gzip and streams responses token-by-token. Same widget surface across desktop and mobile.

Latency profile

<800ms first token · streamed · p50 1.4s end-to-end

Stack we ship
Preact widgetSSE streamingpgvectorSonnet 4.6Cloudflare Workers
Where this fails

Web widget bounces hit ~70% of sessions in B2C. If your customers aren't already on your website (e.g. they're in WhatsApp or your mobile app), the web widget is the wrong channel. Pick the channel your buyers already live in.

automation graph · live

And when the answer
needs to do something.

The chatbot reply is half the story. When a turn fires a tool, an automation graph kicks in — classify, lookup, decide, act. Here's a real WhatsApp refund flow playing out node-by-node, with a branching decision that escalates to a human when policy says it should.

running

Built on n8n, LangGraph, or custom — depending on your stack. Cost chips are illustrative per-decision economics.

  1. WhatsApp inbound — Customer message received
  2. Classify intent — Haiku 4.5 → refund_request · $0.0002
  3. Lookup order — Admin API · #4421 · $0.0008
  4. Order < 30 days? — Refund-window check
  5. Issue refund — refunds.create · $48.99 · $0.0011 (branch A)
  6. Send confirmation — WhatsApp template (branch A)
  7. Escalate to human — Slack #cx-escalations · $0.0006 (branch B)
ai agent vs chatbot

When you need a chatbot,
and when you need an agent.

The naming has drifted — every vendor calls everything an “AI agent” now. The honest distinction: chatbots are scoped and short-turn; agents are multi-step and long-horizon. Most teams asking for an agent need a chatbot first. Per-dimension honest comparison below.

Dimension
You're here Chatbot Single-turn or short-turn · grounded · scoped
AI agent Multi-step · planning · long-horizon
Turn structure How the system handles a user request.
Chatbot User asks → 1 tool call max → reply. Predictable latency.
AI agent Multi-step plan → tool → observe → re-plan. Variable latency.
Best for Where each system shines.
Chatbot Customer service · support · FAQ · lead qualification
AI agent Research · ops automation · multi-system orchestration
Latency budget What the user is willing to wait.
Chatbot Sub-2s. Users abandon at 3s on chat.
AI agent 10s–10min acceptable if the result is high-value.
Failure mode How each tends to go wrong.
Chatbot Hallucinated answer when retrieval misses. Guardrail catches most.
AI agent Tool-call drift on long traces. Needs eval + retry policy.
Cost per turn Typical production economics.
Chatbot $0.001–$0.01 per turn at scale (routed + cached)
AI agent $0.05–$2 per task (multi-step, multi-model)
Build complexity Engineering effort to ship.
Chatbot 4–6 weeks for a production chatbot with RAG + eval
AI agent 8–12 weeks for a production agent with stable tool use

Generalizations from shipped client work. Specifics vary per workload; we benchmark on your eval before recommending.

model stack we ship

The three models behind a chatbot,
picked per stage not per vendor.

A production chatbot is not one model — it's a routed pipeline. Cheap classify in front, grounded generate in the middle, cheap log + eval at the back. Here's the default chatbot stack we ship; we'll re-pick per workflow if your eval data demands it.

chatbot token economics

How we cut a chatbot bill
without making it dumber.

Five tactics stacked, in order of impact for chatbots. Most chatbot pilots see effective per-turn cost drop to 6–10% of the naive baseline at the same eval-suite quality. This optimization pass is included in every chatbot pilot, post-cutover.

01 Raw Send every turn to Sonnet 4.6 with full context, no caching.
100%
02 Route Haiku 4.5 / GPT-4o-mini for intent classify; Sonnet 4.6 only for generate.
38%
03 Cache Anthropic prompt caching on system prompt + tool definitions (5-min TTL).
14%
04 RAG trim Re-rank top-k 5 docs, drop the bottom three before the generate call.
9%
05 Summarize Compress old turns into 200-token gists once conversation > 8 turns.
6%
Naive baseline 100% of the bill
What we ship 6% same eval quality
chatbot build playbook

How we ship a production chatbot
in 4–6 weeks, flagged + evaled.

Four stages, milestone-billed, with a walk-away point at the retrieval baseline. Most chatbot failures happen because the team skipped the eval set or skipped retrieval tuning — both are in week 1 and week 2 here, not bolted on at the end.

  1. Week 1

    Eval set + scope

    We harvest 50–200 real questions from your ticket archive (or run a structured user interview if you're greenfield) and build the eval set the chatbot will be measured against. Scope locked: channels, knowledge sources, tool surface, escalation rules.

    Eval set + scope doc + channel pick
  2. Week 2

    RAG corpus + retrieval tuning

    Ingest your docs into pgvector or Pinecone, run chunking experiments (semantic vs fixed-size, header-aware vs not), tune top-k and re-ranker, and score retrieval against the eval set. Most chatbot quality issues are retrieval issues, fixed here.

    Retrieval precision + recall baseline
    Walk-away point
  3. Weeks 3–4

    Build + guardrail + flag

    Wire the full anatomy: classifier → retrieval → tool use → generate → guardrail → log. Behind a feature flag, in your repo (or ours, your call). PII scrub, refusal rules, confidence gate, audit-log every turn. Channel-specific UI shipped in parallel.

    Production chatbot live behind a flag
  4. Weeks 5–6

    Eval + rollout + token pass

    Shadow mode against your existing channel for 2 weeks. Score on the eval set, score on real traffic, score on cost. Roll out at 10% → 50% → 100% if numbers hold. Run the token-optimization pass — most chatbots see 60–85% cost reduction at the same quality.

    Full rollout + monthly cost target
rag chatbot · production turn

The full anatomy in code,
three models. One reply line.

The same chatbot turn — classify → retrieve → tool → generate → guardrail → log — across Sonnet 4.6, Haiku 4.5, and GPT-4o-mini. Pick a model on the left; the model= line swaps and the per-turn cost stat updates. This is how we choose: run your eval, then look at the bill.

78 lines of code
$0.003 per turn · Sonnet
1.4s p50 latency
chatbot/turn.py Python
from anthropic import Anthropic
client = Anthropic()

def chat_turn(user_msg: str, history: list[dict]) -> dict:
    # 1. Intent classify with Haiku 4.5 (~$0.0001 / turn)
    intent = classify_intent(user_msg)

    # 2. RAG retrieve from pgvector (top-k 5, re-ranked)
    docs = retrieve(query=user_msg, k=5, rerank=True)

    # 3. Tool-aware generate — switch the reply model:
    response = client.messages.create(
        model="claude-sonnet-4.6",
        max_tokens=600,
        system=SYSTEM_PROMPT + format_docs(docs),
        tools=[zendesk_create_ticket, order_status_lookup],
        messages=history + [{"role": "user", "content": user_msg}],
    )

    # 4. Guardrail: confidence + PII + policy
    verdict = guardrail.check(response, intent=intent)
    if verdict.action == "escalate":
        return handoff_to_human(response, verdict)

    # 5. Log to Langfuse for nightly eval
    langfuse.log(trace_id, response, verdict, tokens=response.usage)
    return response
Real production workflow with the names changed. Lives in your repo.
engagement models

Three ways to start.
Audit, pilot, or continuous.

Same pricing as our other engagements. Most clients begin with the audit to scope channels + scope retrieval, run a 4–6 week pilot on the highest-ROI channel, then move to monthly for the next 2–3.

1–2 weeks

Chatbot audit

Find the chatbot workflow worth shipping before you commit a budget.

$3K fixed
  • Existing chatbot review (if any) — usage, drop-off, escalation rate
  • Per-channel recommendation (web · WhatsApp · voice · Slack/Teams)
  • Model + RAG architecture pick with token-cost projection
  • Eval-set design: 50–200 questions from your ticket archive
  • 90-day chatbot roadmap with named workflows
Most teams start here
4–6 weeks

Chatbot pilot

One chatbot shipped end-to-end on your highest-ROI channel — with eval data, not a demo.

$10–25K fixed price
  • Eval set + RAG corpus tuning against your real questions
  • Production build: classify → retrieve → tool → generate → guardrail → log
  • Deployment to your chosen channel (web · WhatsApp · voice · Slack/Teams)
  • Shadow-mode metrics vs your baseline (human agent or legacy bot)
  • Token-optimization pass post-cutover (routing + caching + RAG trim)
  • Walk-away point — if deflection won't move, no phase 2
Monthly

Continuous chatbot team

Embedded squad shipping the next chatbot channel + tuning the live one.

from $5K per month
  • PM + chatbot engineer + ops analyst, embedded
  • Monthly cost-of-ownership + deflection report
  • Eval drift, retrieval precision, refusal-rate monitoring
  • New channel rollouts on cadence (WhatsApp, voice, Teams)
  • Cancel any month — no annual contract
Talk to us
Your repo, your data Claude + OpenAI + open-source RAG-first, eval-gated Model-agnostic, openly
capability patterns

Chatbots we've shipped.
Same anatomy, different channels.

Three anonymized chatbot capability patterns drawn from real engagements. Named references shared under NDA once we know what you're building.

B2B SaaS · Support Pattern

Tier-1 customer service chatbot

Problem

Inbound Zendesk queue averaging 6-hour first-response time; tier-1 reps spending 60% of time on password resets, billing questions, and feature-availability lookups.

Approach

Web-widget chatbot with RAG over the help center + ticket history. Haiku 4.5 classifier, Sonnet 4.6 generate, function calls into Zendesk for ticket creation. Confidence gate at 0.7; sub-threshold escalates with a drafted reply attached for the agent. Voice-channel sibling: see the published openai realtime voice agent case study for the same deflection pattern on inbound voice at $0.10/call.

Claude Sonnet 4.6Haiku 4.5pgvectorZendesk APILangfuse
Outcome
42% tier-1 deflection at 8 weeks
Read the full case study
Ecommerce · D2C Pattern

WhatsApp ecommerce chatbot

Problem

International D2C brand getting product Q+A and 'where's my order' inquiries via WhatsApp from 14 countries — manual reply queue 18 hours long at peak.

Approach

WhatsApp Cloud API chatbot grounded in Shopify catalog + 3PL tracking data. Multilingual via Sonnet 4.6 native multilingual; function calls into Shopify Admin API + Aftership for live order status. Refund initiation gated to human review.

WhatsApp CloudSonnet 4.6Shopify AdminAftershipPinecone
Outcome
73% Q+A handled without human handoff
Internal · DevOps Pattern

Slack on-call triage chatbot

Problem

On-call rotation drowning in repeated questions about runbook locations, alert ownership, and dashboard URLs — same 12 questions answered nightly by senior engineers.

Approach

Slack chatbot with RAG over Notion runbooks + the team's PagerDuty service catalog. Tool calls into PagerDuty for on-call lookup and Grafana for dashboard linking. Escalates to senior on-call if confidence drops or alert is sev-1.

Slack BoltSonnet 4.6PagerDuty APINotion APIHelicone
Outcome
5 hrs saved per on-call shift per engineer
frequently asked

Questions chatbot buyers ask most.
Real answers, no hedging.

What does AI chatbot development cost in 2026?

Three engagement tiers. A 1–2 week chatbot audit is $3,000 — discovery, channel recommendation, RAG architecture, model pick, eval-set design, and a 90-day roadmap. A pilot is $10,000–$25,000 fixed price, 4–6 weeks — one chatbot shipped end-to-end on your chosen channel with eval, monitoring, and a token-optimization pass. A continuous chatbot team is from $5,000 per month — embedded engineer + PM + ops analyst, shipping new channels and tuning the live one. Run-cost (model calls + vector DB + monitoring) typically lands at $200–$2,000 per chatbot per month depending on volume and channel mix.

What's the difference between a chatbot and an AI agent?

A chatbot is scoped, single-turn or short-turn, and grounded — user asks, system retrieves, maybe makes one tool call, replies. Latency budget is sub-2s. A chatbot answers customer service questions or qualifies a lead. An <a href="/services/ai-agent-development/">AI agent</a> is multi-step and long-horizon — plans, calls multiple tools, observes results, re-plans, eventually completes a task. Latency budget is 10s–10min. An agent files a refund across three systems, researches a prospect, or runs a deployment. Most teams asking for an "AI agent" actually need a chatbot first; we'll tell you which during the audit. Cost per interaction differs by ~50×.

Should we build a customer service chatbot on Claude or GPT?

Both are production-ready for customer service chatbots. <a href="/services/claude-development/">Claude Sonnet 4.6</a> wins on long-context RAG, multilingual support without separate language models, and tool-use stability when the chatbot has 6+ functions to choose from — these are the dimensions that matter most for support. <a href="/services/openai-development/">GPT-4o-mini</a> wins as the cheap classifier in front (intent + routing) and as the voice-channel reply model via the OpenAI Realtime API. Our default chatbot stack is Haiku 4.5 or GPT-4o-mini for intent classify, Sonnet 4.6 for the grounded reply — we're model-agnostic and we'll show you the eval-set numbers before recommending.

How long does it take to ship a production chatbot?

Most pilots ship in 4–6 weeks after a 1–2 week audit. Realistic distribution: simple chatbots (single channel, single-language, narrow scope like password reset + billing FAQ) in 3–4 weeks. Mid-complexity (RAG over a 1,000-doc knowledge base, 3–5 tool calls, web + WhatsApp) in 4–6 weeks. Complex (regulated industry with PII handling, voice channel, multilingual across 5+ languages, 10+ tools) in 8–10 weeks. The audit phase tells us which bucket you're in before any pilot contract. We don't quote a 30-day chatbot for work that takes 90 days.

What is a RAG chatbot and do we need one?

A RAG (retrieval-augmented generation) chatbot grounds its replies in your actual data instead of relying on the model's general knowledge. The flow: user asks → system retrieves the top 3–5 most relevant chunks from your knowledge base (pgvector / Pinecone) → those chunks plus the user message go to the reply model (Sonnet 4.6) → the model composes an answer cited to those chunks. You almost certainly need one. The only chatbots that don't are pure-personality bots ("chat with a brand mascot") or chatbots over data the model was trained on (general programming Q+A). Every customer service, support, ecommerce, and internal-knowledge chatbot is a RAG chatbot. Most chatbot quality issues are retrieval issues, not generation issues — which is why we tune retrieval before tuning prompts.

Can you deploy a chatbot to WhatsApp, voice, or Slack as well as our website?

Yes — multi-channel deployment is standard. WhatsApp via Meta's Cloud API (business verification + template approval, typically 1–3 business days). Voice via Twilio Voice or Vapi over the OpenAI Realtime API (sub-second first-token latency) or a Deepgram + Sonnet 4.6 pipeline. Slack via the Bolt SDK with event subscriptions + slash commands. Microsoft Teams via the Bot Framework SDK with admin scope approval. Same RAG corpus and tool surface across channels; the UI differs (streaming for web, message-edit-streaming for Slack, audio streams for voice). We'll recommend which channels matter during the audit — most teams over-deploy and end up with three channels they don't measure.

Who is the best AI chatbot development company for production work?

Honest answer — there isn't a single best, but the question to ask any AI chatbot development company is: do you ship eval suites, channel-specific honesty notes, and token-cost projections, or do you ship demos? Listicle sites rank chatbot vendors by review count and case-study polish, neither of which predicts whether your chatbot will deflect tier-1 traffic in production. We score ourselves on operator detail — we use Claude Code daily, we run model-agnostic across Claude + OpenAI, and we publish a $3K audit-to-roadmap engagement before any chatbot build kicks off. If your shortlist includes vendors that can't show you their eval methodology in 30 minutes, that's the disqualifying signal. <a href="/services/ai-consulting/">AI consulting</a> + audit is a $3K way to scope what's worth building before you sign a six-figure chatbot agency contract.

How do you keep an AI chatbot from hallucinating or going off-policy?

Four layers, stacked. (1) RAG grounding — the reply model sees retrieved chunks from your real data, and the system prompt instructs it to answer only from those chunks or say "I don't know." (2) Confidence gating — every reply gets a self-rated confidence score; sub-threshold replies escalate to a human with the AI's draft attached, never auto-send. (3) Guardrails layer — separate from generation, a policy-check pass runs PII scrubbing, refusal rules ("never quote a price", "never confirm an account number"), and competitor-mention blocking. Fail-closed by default. (4) Nightly eval — Langfuse / Helicone logs every turn; we run an eval suite against held-out questions nightly and alert on regression. The combination, not any single piece, is what makes a chatbot production-safe. We include this stack in every pilot.

Ready to ship

Hire an AI chatbot development team
that ships eval data, not demos.

Book a free AI chatbot audit. We'll review your existing chatbot or support queue, recommend channels (web · WhatsApp · voice · Slack/Teams), pick models per stage (Sonnet 4.6 / Haiku 4.5 / GPT-4o-mini), project token cost vs your current spend, and give you a 90-day chatbot roadmap. No deck, no obligation to build.

Read case studies
30 min, async or live Token-cost projection included Channel pick + eval-set design
keep exploring

Related pages.
Pick where you are.

Building a chatbot often connects to a sibling AI service. These pages go deeper on the adjacent decisions.

01

Claude Development

Anthropic Claude integration — the default reply model for our chatbots.

Read more
02

OpenAI Development

GPT-4o-mini as classifier · Realtime API for sub-second voice.

Read more
03

AI Agent Development

When you need multi-step planning, not single-turn replies.

Read more
04

AI Integration Services

Plug your chatbot into Salesforce, Zendesk, HubSpot, NetSuite.

Read more
05

AI Consulting

Strategy and roadmap before the chatbot build.

Read more
06

AI Automation Agency

When the workflow is bigger than a single chatbot turn.

Read more
07

Healthcare AI Development Company

HIPAA-grade patient-intake chatbots — BAA, PHI scrub, audit log.

Read more
08

AI in Manufacturing

Shop-floor chatbots — diagnostic Q&A on MES + historian with citation logging, no PLC writeback.

Read more
09

AI for Law Firms

Privilege-aware chatbots — matter intake + KM Q&A on rings 2/4 with citation logging.

Read more
10

AI in Travel

Travel chatbots across pre-trip nudges, in-trip exception handling, and post-trip review handlers.

Read more
11

AI in Education

FERPA-aware chatbots — Socratic tutors, advisor Q&A, and L&D micro-coaches with integrity-zone gating.

Read more
12

AI for HR

HR-services chatbots — policy Q&A, benefits intake, ER triage with hard-escalation rules and ADA accessible alternatives.

Read more
13

AI for Insurance

Insurance AI audit + roadmap — claim-lifecycle state machine, underwriting capacity sankey, and fraud-network mapping before any core-system integration.

Read more
14

AI for Fintech

Fintech AI audit + roadmap — risk-score gauges, payment-rails routing, KYC tier-ladder, model-risk-management before any production inference.

Read more