ai in ecommerce · production

AI in ecommerce,
shipped — not slide-decked.

Agentic commerce that closes carts, AI inventory management that planners trust, and an AI chatbot for ecommerce that doesn't embarrass your brand. We ship retail AI workflows on Shopify, BigCommerce, Magento, and WooCommerce — first agent live in 6–8 weeks, cost-per-loop reported monthly, peak-mode runbook tested before Black Friday.

abandoned-cart agent · live trace ≈ $0.004 / recovered cart
retail agent loop 01 02 03 04 05
  • 01 Signal $0.0000

    Cart abandoned · session > 22 min idle

    Shopify webhookKlaviyo event
  • 02 Enrich $0.0002

    Pull order history, margin, LTV, last-touch channel

    pgvector RAGShopify Admin API
  • 03 Decide $0.0009

    Pick tone, offer band, channel. Confidence-gate at 0.72.

    Claude Haiku 4.5
  • 04 Act $0.0021

    Draft email + SMS; send via marketing platform

    KlaviyoTwilio
  • 05 Log $0.0008

    Trace every hop, eval against baseline, write to warehouse

    LangfuseBigQuery
6–8 wk
first agent live behind a feature flag
p99 < 2.4s
decide step under Black Friday load
$200–$1.5K
monthly run-cost band per shipped workflow
$3K
audit-to-roadmap before any ecommerce build starts
the agentic-commerce shift

What changed.
And what's actually different this time.

Retail tech has had three AI cycles in a decade. This one is different because the unit economics finally work on a per-request basis — `agentic commerce` isn't a strategic narrative, it's a $0.001-per-decision plumbing change. Three things to know before you scope your first workflow.

From rule trees to live decisions

Yesterday's recommendation engine was a Lucene query plus a heuristics file. Today's `agentic commerce` stack uses a small fast model on every signal — abandoned cart, returning visitor, low-stock SKU — and decides what to do next at the per-request level, with the rules pushed into the prompt instead of the codebase.

Models picked per hop, not per app

A retail loop has 5+ hops; the right model is different for each. Haiku 4.5 for the cheap, high-volume routing call. Sonnet 4.6 when the next step writes a customer-facing email. GPT-5.4-mini for the structured-output catalog tag. We pick per hop, not per app — same `ai in ecommerce` stack runs all three.

Bimodal load is the design constraint

Steady-state and Black Friday are different products. The system that costs $48/day in October will cost $14K/day on peak Friday if you don't swap models, turn caching up, and route low-priority writes to batch. That bimodality is what §5 below visualizes — it's the load-mode our shipped retail clients hit.

ai in ecommerce, by P&L line

Six AI workflows that move ecommerce P&L.
Ranked in the audit, not the slide deck.

These are the six `ai in ecommerce` patterns that consistently pay back in the audits we run. Not every retailer needs all six — most teams have a high-ROI candidate in three of them. The audit ranks yours so you don't have to guess which to fund first.

Personalized recommendations & AI search

LLM-ranked product feeds, semantic search over your catalog, embedding refresh on new SKUs. Replaces the static recommender block with a per-session decision. Typical lift on assisted-revenue metric: 8–18%, model cost $300–$700/mo on mid-size catalogs.

AI chatbot for ecommerce + concierge agents

`ai chatbot for ecommerce` that actually closes — sees order history, knows return policy, drafts the WISMO reply in your tone, escalates with full context when the customer is heated. Replaces the rule-tree chatbot nobody likes.

AI product description generator at SKU scale

`ai product description generator` workflows that take a structured spec sheet + 3 brand examples and produce a description that won't get flagged for being off-brand. Used most for catalogs with 5K+ SKUs where copywriting is the bottleneck before launch.

Dynamic pricing AI on long-tail SKUs

`dynamic pricing ai` for the 80% of SKUs your category managers don't have time to re-price weekly. Pulls competitor signal, inventory cover, margin floor, and elasticity history; proposes a price; routes through your existing PIM. Margin-floor guardrails are non-negotiable.

AI inventory management & demand forecasting

`ai inventory management` agent that reads your sales-velocity stream, weather signal, marketing calendar, and supplier lead-times, then drafts a buying plan a planner can edit. Replaces the spreadsheet-and-vibes monthly buy meeting on slow movers.

Fraud, returns, and trust-and-safety triage

Vision + text agent reading order patterns + return reason + chat transcript, scoring fraud likelihood, and routing borderline cases to a human with a draft decision. Pays back fastest on high-AOV verticals (electronics, beauty, footwear).

Don't see your retail workflow?

The highest-ROI retail AI workflow on your team is usually one we haven't listed. Bring it to the 2-week audit — we'll rank it against the rest and tell you if it ships.

Tell us yours
the load-mode our shipped retail clients hit

Holiday peak breaks naïve AI.
Bimodal routing is the fix.

A retail AI stack that works in October melts on Black Friday — different traffic shape, different model picks, different fallbacks. The systems we ship run two modes by design, and the toggle below shows what flips when peak hits.

Cost / day $48

≈ 11.5K loops

p99 latency 1.9 s

decide + act inline

Throughput 8 rps

well under capacity

Decisions that flip Steady-state
  • ON

    Sonnet 4.6 on decide step

    Quality tier dominates; latency budget loose.

  • SHIFT

    Prompt-cache hit rate: 41%

    Catalog context cached across users.

  • OFF

    Batch API: off

    Inline ack required for cart-recovery email.

  • OFF

    Fallback policy: pass-through

    No backpressure; primary path always wins.

  • SHIFT

    Sonnet 4.6 → Haiku 4.5 on decide

    Eval delta < 2 pts; 7× cheaper at this volume.

  • SHIFT

    Prompt-cache hit rate: 78%

    Catalog & policy context warm across the fleet.

  • ON

    Batch API: on for log + warehouse

    Logs tolerate 5-min lag; cuts write cost 60%.

  • ON

    Fallback engaged: rules-based offer

    If model latency > 3 s, ship templated email.

platform integration patterns

Same loop. Four platforms.
Same eval harness across all of them.

The agent loop is platform-agnostic; the integration shape changes per platform. Shopify is API-rich and webhook-friendly; BigCommerce pushes more state off-platform; Magento needs a sidecar; WooCommerce wants you comfortable running your own infra. Pick a platform — the sketch and the stack swap with it.

Integration shape · Shopify
What we automate here
  • Abandoned-cart recovery agent
  • AI product description generator at SKU scale
  • Review-sentiment routing + reply drafts
  • Agentic checkout assist (post-purchase upsell)
  • Shopify Flow ↔ AI decision bridge
Stack we ship
Shopify Admin APIWebhooksKlaviyoClaude Haiku 4.5Cloudflare WorkersLangfuse
Latency profile

1–2s read · 3–6s agent decide · async webhook write

Where this fails

Admin API rate limits bite hard on bulk re-ingest — schedule against your plan's leaky-bucket, not at peak.

  1. 01 Shopify API-rich · webhook-friendly
    • Abandoned-cart recovery agent
    • AI product description generator at SKU scale
    • Review-sentiment routing + reply drafts
    • Agentic checkout assist (post-purchase upsell)
    • Shopify Flow ↔ AI decision bridge
    Shopify Admin APIWebhooksKlaviyoClaude Haiku 4.5Cloudflare WorkersLangfuse

    Admin API rate limits bite hard on bulk re-ingest — schedule against your plan's leaky-bucket, not at peak.

  2. 02 BigCommerce thin app surface · headless-first
    • Agentic checkout-assistant for logged-in buyers
    • Headless storefront semantic search
    • Catalog enrichment + structured tagging
    • Cloudflare Worker agent (state off-platform)
    • Catalyst-shape AI bridges to the front-end
    GraphQL StorefrontREST AdminCloudflare WorkerspgvectorLangfuse

    Webhook delivery is occasionally lossy at peak — design idempotent handlers and reconcile from the audit log.

  3. 03 Adobe Commerce / Magento sidecar service · B2B-strong
    • B2B catalog tagging + taxonomy
    • CSR copilot with order + quote context
    • Sidecar AI service writing through admin queue
    • Order-state webhook reactions
    • Tiered pricing-rule generator (planner-reviewed)
    REST/GraphQLFastAPI sidecarpgvectorRabbitMQSonnet 4.6

    Rate limits differ per endpoint and plugin clashes are real — sidecar with its own DB is non-negotiable.

  4. 04 WooCommerce WP-native · own-your-infra
    • WP-CLI bulk catalog operations
    • Agentic SKU description generator
    • Headless commerce semantic search
    • Review moderation + brand-tone classifier
    • Order-status concierge in the merchandiser UI
    WP RESTWP-CLIPython workerGPT-5.4-miniLangfuse

    Plugin ecosystem can collide with custom hooks — pin versions and run the worker in its own process.

model picks per ecommerce workflow

The model matrix.
Per hop, not per app.

Same `ai in ecommerce` stack runs four model picks. Haiku 4.5 wins on high-volume routing decisions and is the Black Friday peak swap. Sonnet 4.6 wins where tone or rationale quality matters (chatbot reply, inventory plan a planner has to trust). GPT-5.4-mini is the structured-output specialist for catalog tagging and pricing. GPT-5.4 sits on long-horizon scenario reasoning. Cost-per-decision below is roughly current — verify on your own usage before committing.

Dimension
You're here Claude Haiku 4.5 Anthropic · cheap, fast
Claude Sonnet 4.6 Anthropic · quality tier
GPT-5.4-mini OpenAI · structured output
GPT-5.4 OpenAI · long reasoning
Recommendations / AI search High-volume per-session decision. Latency budget tight, quality bar moderate.
Claude Haiku 4.5 Default · ≈ $0.0008/decision
Claude Sonnet 4.6 Overkill at this volume
GPT-5.4-mini Tied — pick on embeddings stack
GPT-5.4 Cost prohibitive per request
Ecommerce chatbot / WISMO Customer-facing reply. Tone matters; one bad answer ends up in a screenshot.
Claude Haiku 4.5 Fine for FAQ; tone drifts on edge cases
Claude Sonnet 4.6 Default · ≈ $0.004/reply
GPT-5.4-mini Workable with strict prompt eval
GPT-5.4 Best on multi-turn complaint flows
Product-description generation Structured spec → branded copy. Brand-tone classifier on the back end.
Claude Haiku 4.5 Tone drift inside 100 SKUs
Claude Sonnet 4.6 Strongest brand-tone retention
GPT-5.4-mini Best structured-output adherence
GPT-5.4 Wins on long-form but overpriced
Dynamic pricing decision Tabular reasoning over competitor + margin + cover. Audit log mandatory.
Claude Haiku 4.5 Default · with margin-floor guardrail
Claude Sonnet 4.6 Slower; reserve for tier-1 SKUs
GPT-5.4-mini Tied — structured output is the win
GPT-5.4 Cost vs. uplift doesn't break even
Inventory / demand-forecast planning Multi-signal reasoning, weekly cadence, planner reviews output.
Claude Haiku 4.5 Misses cross-signal patterns
Claude Sonnet 4.6 Default · planner-trusted output
GPT-5.4-mini Workable; weaker on rationale
GPT-5.4 Best on long-horizon scenarios
Fraud / returns triage Vision + text signal, borderline cases routed to human.
Claude Haiku 4.5 Pre-filter at top of funnel
Claude Sonnet 4.6 Default on borderline cases
GPT-5.4-mini Strong text, weaker on image cues
GPT-5.4 For chargeback-defense long-form
Black-Friday peak swap Which model the routing layer flips to under 20×+ baseline load.
Claude Haiku 4.5 The peak swap · 7× cheaper at scale
Claude Sonnet 4.6 Cost spikes hard at peak
GPT-5.4-mini Alt peak target on OpenAI stacks
GPT-5.4 Reserve for off-peak only

Cost figures are typical per-decision spend with prompt caching warm and standard context sizes. Run your own benchmark before locking a model pick; vendor prices and capabilities shift monthly.

ai in ecommerce — when it's the wrong answer

Three places we'll tell you no.
Honest scoping > pretty slide deck.

Most `ai in ecommerce` pitch decks have an AI answer for every problem. Most production retail teams should refuse three of them. If your team is scoping any of these, we'll say so in the audit — and we won't bill phase 2 to find that out.

Replacing checkout with a chat agent

Conversational checkout has been pitched for a decade and rarely wins on conversion. Your form fields exist because they convert. An `ecommerce chatbot` augmenting checkout (gift-message phrasing, address disambiguation, post-purchase upsell) is fine. Replacing it isn't.

Dynamic-pricing AI on hero SKUs

The 20 SKUs that drive 60% of revenue need a human pricing policy and an exec-signed margin floor. Letting an LLM (or any algorithm) move them autonomously is how brands end up in screenshots on Twitter. Hero SKUs stay manual; long-tail SKUs are where AI pricing pays back.

Vibes-only product descriptions

Generating SKU copy from the product image alone produces output that drifts off-brand within 100 SKUs. The pattern that works is structured-spec-in plus 3 voice examples plus an eval pass with a brand-tone classifier. Without that, you'll quietly contaminate your catalog.

retail AI we've shipped

Three capability patterns.
Anonymized — your case, in plain English.

Cases below are anonymized capability patterns drawn from real retail engagements. Named references shared under NDA once we know what you're building. Stack shown is the one we shipped; your stack will look similar but not identical.

DTC apparel · 240K SKUs Pattern

Abandoned-cart recovery agent

Problem

Generic recovery emails ignored; high-AOV carts (>$180) slipping through with the same templated nudge as $24 carts. Recovery dollars flat YoY.

Approach

Multi-step agent (the loop diagrammed in the hero) personalizes outreach using cart contents, customer LTV, return-rate, and product margin. Confidence-gates the offer band, escalates >$300 carts to human review. Klaviyo + Twilio on the send side; Langfuse traces every hop.

Claude Haiku 4.5KlaviyoShopifyLangfuse
Outcome
+22% recovered revenue on $180+ carts
Read the full case study
Mid-market beauty marketplace Pattern

AI inventory management on long-tail SKUs

Problem

Planner team re-buying the top 200 SKUs weekly; the long tail (4K+ SKUs) re-bought monthly on vibes. Stock-out rate climbing on shoulder-season items.

Approach

`ai inventory management` agent reads 90-day velocity, supplier lead-time, marketing-calendar overlap, and last-year same-week curve. Drafts a buying plan grouped by supplier; planner accepts, edits, or rejects. Margin-floor guardrails on every line.

Sonnet 4.6pgvectorNetSuiteBigQuery
Outcome
−31% stock-out rate on tracked tail SKUs
B2B industrial supply · 18K SKUs Pattern

Agentic checkout-assistant for trade buyers

Problem

Trade buyers calling sales for re-order quantity, alt-SKU substitution, and bulk-tier pricing. Sales spending 40+ hrs/week on conversations the order data could answer.

Approach

`agentic commerce` assistant inside the logged-in trade portal: sees order history, knows the tiered price list, suggests substitutes when an SKU is out, escalates to sales with a draft quote. Sales reviews and sends.

GPT-5.4-miniMagentoHubSpotLangfuse
Outcome
≈ 28 hrs/wk sales-team time returned
how we ship retail AI in 6–8 weeks

Four stages.
With a kill point at week 6.

Every `ai in ecommerce` engagement runs the same loop: audit, pilot, ship, scale. The pilot has an explicit walk-away point at week 6 — if the metric won't move, we stop before production hardening and you don't pay phase 2. No retainer trap.

  1. Weeks 1–2

    Retail AI audit

    Two-week shadow with merch, ops, and CX. We rank candidate `ai in ecommerce` workflows by margin lift × time-to-ship, list the per-loop cost band each will run at, and call out the ones that won't pay back so you don't fund them.

    90-day retail AI roadmap, ranked, with cost bands
  2. Weeks 3–6

    Pilot — one workflow, behind a flag

    We build the single highest-ROI candidate against your real Shopify / BigCommerce / Magento / WooCommerce stack. Live behind a feature flag, baseline vs. assisted runs measured. Bimodal config (steady + peak) tested before you go anywhere near a sale event.

    One agent live behind a flag with eval data
    Walk-away point
  3. Weeks 7–8

    Ship to production

    Production hardening: logging via Langfuse, retry + fallback policies, peak-mode runbook, eval suite gated in CI. Walk-through with your team — the workflow goes live with humans in the loop, not as an internal demo.

    Production agent + peak-mode runbook
  4. Ongoing

    Scale to next workflow

    Most retail clients run 3–5 workflows by month 6. Same eval harness, same Langfuse spans, same cost-reporting cadence. Compounding learning across recommendations → search → support → inventory.

    3–5 retail AI workflows live by month 6
engagement models

Three ways to start.
Honest pricing, named outcomes.

Most retail clients start with the 2-week audit, ship one `agentic commerce` workflow on a pilot, then move to monthly for the next three to five. Cost-per-loop reported monthly on every shipped workflow — no per-loop number, no engagement.

1–2 weeks

Retail AI audit

Find which `ai in ecommerce` workflows actually pay back before you commit a budget.

$3K fixed
  • Operator shadow with merch / ops / CX
  • ROI / time-to-ship / risk scoring per workflow
  • Per-workflow run-cost band ($200–$1,500/mo)
  • 90-day retail AI roadmap with named candidates
  • Honest list of workflows that won't pay back yet
Book the audit
Most teams start here
6–8 weeks

Pilot to production

One `agentic commerce` workflow shipped end-to-end, with bimodal load tested.

$10–25K fixed price
  • Build, integrate, deploy on your platform of record
  • Steady-state + peak-mode config tested pre-launch
  • Eval suite, Langfuse traces, retry + fallback runbook
  • Baseline vs. assisted metric report at end
  • Walk-away point — if the metric won't move, no phase 2
Start a pilot
Monthly

Continuous retail AI team

Embedded squad shipping the next workflow on your roadmap.

from $5K per month
  • PM + AI engineer + ops analyst, embedded
  • Per-workflow monthly cost-of-ownership report
  • Peak-readiness review before every sale event
  • Cancel any time — no annual contract
Talk to us
Your repo, your prompts Cost-per-loop reported monthly Bimodal config standard No annual contract
frequently asked — retail AI

Questions retail teams ask first.
Real answers, no hedging.

What is agentic commerce and is it production-ready?

`Agentic commerce` is the shift from rule-based retail tech to multi-step AI agents that decide what to do at the per-request level — pick a recommendation, draft a recovery email, propose a re-order quantity, route a fraud signal. The agent observes a signal, enriches it with your data, decides, acts on a real system (Shopify, Klaviyo, NetSuite), and logs the trace. We're shipping production agentic-commerce workflows today on Shopify, BigCommerce, Magento, and WooCommerce — the constraint isn't model quality anymore, it's eval discipline and bimodal-load handling. If you skip the eval suite and the peak-mode runbook, you ship a demo that breaks on the first sale event.

What's the highest-ROI AI use case in ecommerce right now?

It depends on your stack and SKU count, but the three patterns that consistently pay back in the audits we run are (1) abandoned-cart recovery agents on AOV >$120 — generic recovery is mostly ignored, personalized recovery clears 15–25% lift on the high-AOV cohort; (2) `ai inventory management` on long-tail SKUs (the 80% of SKUs your planner team can't re-buy weekly), where stock-out reduction translates directly to revenue; and (3) Tier-1 customer-support deflection with order context — the WISMO + returns workflow that's 60% of inbound. AI search and dynamic pricing pay back too but are slower to ship and need more guardrails. Recommendations look like the obvious win in slide decks; in practice they're the hardest to attribute uplift on.

How does AI inventory management actually work?

`AI inventory management` is not a model staring at a stock table. The pattern we ship is a weekly-cadence agent that ingests four streams — 90-day sales velocity per SKU, supplier lead-time and MOQ, marketing-calendar overlap (promos, campaigns), and last-year same-week curve (or weather signal for relevant categories) — and drafts a buying plan grouped by supplier. A planner reviews, edits, or rejects line-by-line; the agent learns from the edits. Margin-floor and cash-flow guardrails are non-negotiable; the agent never executes a buy autonomously. Run cost is typically $400–$900/month for a 4–10K-SKU catalog including model + warehouse spend. The win isn't replacing the planner — it's letting one planner cover 5× the SKU tail they can today.

Should we build our own AI chatbot for ecommerce or buy one?

Buy if your support is mostly WISMO + returns + FAQ on a catalog under 2K SKUs and you don't have an engineering team. The off-the-shelf `ecommerce chatbot` platforms (Gorgias AI, Zendesk Answer Bot, Tidio) are fine and cheap on that profile. Build (or hire us to build) if any of: (1) you have a catalog over 5K SKUs and need real product retrieval, not a FAQ matcher; (2) your tone and brand voice matter — every off-the-shelf bot sounds the same and your customers will notice; (3) you're on Magento / headless / multi-storefront and the off-the-shelf integrations are weak; (4) you want the conversation data in your warehouse, not a vendor's. An `ai chatbot for ecommerce` built on Claude or GPT with order-history grounding typically runs $400–$800/month in model spend versus $1,500–$5K/month seat fees on the platforms — and you own the prompts.

How do you integrate ChatGPT or Claude with Shopify?

A `shopify chatgpt integration` (or Claude variant — the pattern is identical) has three layers. (1) The signal layer is Shopify Webhooks plus Admin API polling for the events the agent reacts to (cart-abandoned, order-created, refund-initiated). (2) The decide layer is a small service — we usually run it on Cloudflare Workers or a Lambda — that holds the prompt, calls the model, and resolves the next action. (3) The write layer is Shopify Storefront / Admin API, plus your marketing stack (Klaviyo, Postscript) for the customer-facing send. Don't put the agent inside a Shopify theme app unless the merchandiser UX is the point — for most retail use cases the agent lives off-platform and the integration is webhooks + API. Token cost on a Shopify-grounded loop with prompt caching warm averages $0.001–$0.005 per decision depending on model and context size.

What does dynamic pricing AI actually cost to run per SKU?

`Dynamic pricing AI` on long-tail SKUs runs in two cost buckets. The signal layer (competitor scraping, elasticity history, margin lookup) is typically $80–$200/month of infrastructure regardless of catalog size — it's the same scrapers and warehouse views whether you have 1K or 20K SKUs. The model layer scales with how often you reprice: at weekly cadence on 10K SKUs using Haiku 4.5 with prompt caching, model spend is roughly $0.0002 per pricing decision, so ≈ $8/month of model spend for the whole catalog. The honest cost ceiling is engineering: getting the data clean, getting the margin-floor guardrail tight, getting the price-change diff into your PIM. Total all-in is usually $400–$1,200/month for a mid-market catalog. The pattern only pays back on tail SKUs — hero SKUs stay manually priced.

How do you stop AI from writing off-brand product descriptions?

Three things, all of which need to be in place — pick any two and you'll quietly contaminate your catalog. (1) The input is structured spec data, not just an image — material, weight, dimensions, use-case, target customer. Image-only generation drifts within 100 SKUs. (2) The prompt includes 3–5 voice examples from your existing best-rated descriptions, refreshed quarterly. (3) Every generation runs through a brand-tone classifier (we usually train a small classifier on 500 of your existing descriptions versus 500 negative examples) and only descriptions above threshold ship; the rest go to a copywriter queue. With those three, an `ai product description generator` workflow holds tone over thousands of SKUs. Without them, the output reads generic by SKU 300 and your category pages start ranking worse, not better.

How long does it take to ship AI in an ecommerce stack?

Two weeks for the audit, then six to eight weeks to first workflow live behind a feature flag — assuming we're not waiting on platform access. The breakdown: weeks 3–4 are integration scaffolding (webhooks, API auth, eval harness), weeks 5–6 are building the decide step and tuning the model picks, week 7 is bimodal-config testing (steady-state + simulated peak), week 8 is production hardening and the runbook. Narrow workflows — say, a product-description generator on a clean catalog — can ship in 3 weeks. Wider workflows that touch ERP and merchandising (`ai inventory management`, dynamic pricing) take the full 8. We don't quote 2-week timelines for 8-week work, and the pilot has an explicit kill point at week 6: if the metric won't move, we stop before production hardening and you don't pay phase 2.

Ready to ship

Stop A/B-testing the same nudge.
Ship AI that knows the customer.

Book a free 30-minute retail AI audit. We'll identify two or three high-ROI `ai in ecommerce` candidates from your stack, give you a per-workflow cost band, and tell you which ones won't pay back yet. No deck, no obligation to build.

30 min, async or live No NDA required You leave with a written roadmap