ai tools for business · the stack we ship

AI tools and frameworks,
picked per workflow, not by listicle.

The 2026 AI tech stack we actually ship in — across 8 service pillars, model-agnostic, eval-tested. Claude · GPT · Gemini · open-weights for the model layer. LangGraph · OpenAI Agents SDK · CrewAI for orchestration. pgvector · Pinecone · Langfuse · Braintrust for data and observability. We pick. Listicle authors can't.

See the 5-layer stack
8
AI service pillars we ship — each tool below maps to one
Daily
we use Claude Code + OpenAI Codex internally
$3K
audit picks the stack per workflow before you commit
30 days
first workflow live on the chosen stack
ai tools · best ai tools · ai model comparison

Every other page on this SERP is a list.
This one tells you which to use.

The top 10 results for ai tools, ai frameworks, and ai tools for business are 10 numbered lists. None of them tell you which tool wins for which job, which LLM behind it, or how much it costs to ship. We've made these picks 8 times for real buyers — across our AI service pillars — so this hub is the synthesis, not another listicle.

Decision support, not a top-10 list

Every other page on this SERP is a numbered list of AI tools. This one tells you which framework wins for which job, which LLM behind it, and which of our 8 service pillars to engage if you want it shipped.

Model-agnostic, openly

We don't sell a tool. We pick across Claude, GPT, Gemini, and open-weights per workload — on your eval data, not a partner badge. Sibling pillars at /services/claude-development/ and /services/openai-development/ prove it.

Priced on this page, not behind a contact form

$3K audit, $10–25K pilot, from $5K/month continuous. Listicles avoid pricing because they don't sell the build. We do — so the number is on the page.

modern ai stack · ai infrastructure · ai tech stack

The AI stack we ship in,
five layers — pick any to open.

Every enterprise ai platform on the SERP is one of these five layers in a single product wrapper. We treat the layers separately because that's how decisions actually get made in production: pick a model, pick an orchestration framework, pick a data layer, pick observability, pick infra. Each layer below opens to the tools we name, the failure modes we've hit in production, and our default unless there's a reason not to.

  1. tools we name
    • Claude Sonnet 4.6
    • Claude Haiku 4.5
    • GPT-5.4
    • GPT-5.4-mini
    • Gemini 2.0
    • Llama 3.3 (self-hosted)
    • Qwen 2.5
    • Mistral
    • XGBoost · scikit (when an LLM is the wrong tool)
    production failure modes
    • Single-vendor lock-in — outage takes the whole product down with no failover path.
    • Picked the flagship for everything — classifier workflow costs 12× what it should.
    • Default to LLM where a rule engine or classical ML would beat it on cost and quality.
    our default · unless reason not to

    Sonnet 4.6 for long-context tool use and production reasoning; Haiku 4.5 or GPT-5.4-mini for high-volume narrow tasks; GPT-5.4 where the Realtime API, vision pipeline, or Codex matters; open-weights for sovereign data and cost-floor at scale. The pick is a per-workflow decision based on the eval set — not a vendor badge.

  2. tools we name
    • LangGraph
    • OpenAI Agents SDK
    • CrewAI
    • AutoGen
    • Anthropic computer use
    • Hand-rolled Python tool loops
    • Temporal / Inngest (for the durable workflow shell)
    production failure modes
    • Reach for LangGraph on a 3-step workflow — the framework outweighs the problem.
    • Agent loops on the same tool 12 times before hitting a max-step cutoff.
    • Tool-use schema drifts from the model's actual output — silent JSON parse failure.
    our default · unless reason not to

    Plain Python tool loops for anything under ~6 steps. LangGraph when the graph branches or you need durable replay. OpenAI Agents SDK when the workflow is GPT-only and the team wants vendor-native ergonomics. CrewAI for fast prototyping of role-based multi-agent scenes. AutoGen rarely — we've found the others ship cleaner.

  3. tools we name
    • pgvector (Postgres)
    • Pinecone
    • Qdrant
    • Weaviate
    • LlamaIndex
    • LangChain index
    • Unstructured.io
    • Cohere Rerank
    • BM25 hybrid retrieval
    production failure modes
    • Embeddings re-index on every deploy — vector-DB bill quietly balloons.
    • Chunking ignores document structure — answer quality plateaus and won't improve.
    • No reranker — top-k retrieval returns plausible-but-wrong context.
    our default · unless reason not to

    pgvector on your existing Postgres for ≤2M chunks — operationally simpler than another vendor. Pinecone or Qdrant past that scale. LlamaIndex for the loader / chunker layer; LangChain only where its index abstractions actually save time. Hybrid BM25 + dense retrieval + Cohere Rerank by default — the quality lift is bigger than picking a fancier embedding model.

  4. tools we name
    • Langfuse
    • Braintrust
    • Arize Phoenix
    • Helicone
    • Inspect AI
    • Hand-rolled pytest eval harnesses
    • OpenTelemetry · GenAI semantic conventions
    production failure modes
    • No eval set — "works on my prompt" ships to prod and silently degrades over the next model swap.
    • Trace data lives nowhere — first you hear about a regression is a customer email.
    • Cost telemetry isn't wired in; you find out about a runaway prompt loop on the monthly invoice.
    our default · unless reason not to

    Langfuse for prompt + trace observability across vendors; Braintrust or a hand-rolled pytest harness for the golden-set regression; shadow-mode mirroring before every cutover. AI observability is the highest-CPC keyword cluster on this hub for a reason — buyers are starting to ask. If there's no eval suite, there's no ship, even from us.

  5. tools we name
    • Anthropic API direct
    • OpenAI API direct
    • AWS Bedrock
    • Azure OpenAI
    • Google Vertex AI
    • Cloudflare Workers AI
    • PrivateLink · KMS · BAA
    • Kubernetes / ECS for self-hosted open-weights
    production failure modes
    • AI vendor outage takes the whole product down — no multi-vendor failover.
    • Prompt logs hit the wrong region — accidental data-residency breach on a Tuesday.
    • Cold-start latency on serverless adds 800ms to a sub-second voice agent.
    our default · unless reason not to

    Anthropic + OpenAI direct for fastest model access; Azure OpenAI or AWS Bedrock when compliance posture (HIPAA BAA, SOC 2, FedRAMP) requires it. Multi-vendor failover wired in for any workload above $5K/mo run cost — single vendor at that scale is a liability we won't sign off on.

Defaults reflect our current 2026 operator playbook — picked per workflow, not per partner badge. The rationale for each is in the per-layer detail above.

claude vs gemini · anthropic vs openai · best llm for coding

Pick a model:
Claude · GPT · Gemini · open-weights.

The first decision on any AI build is the model. We ship all four families and pick per workflow on the eval data, not on the discovery-call demo. Below: when each one wins, when each one loses, and which sibling pillar covers the engagement detail. Claude vs gpt — the honest version — is almost always 'both, picked per workflow.'

Claude — Sonnet 4.6 / Haiku 4.5

Default for long-context tool use, production reasoning, and any workflow where instruction-following stability matters. Wins most coding-agent evals in our internal benchmarks (best llm for coding) — Claude Code is the tool we use daily. Read the full <a href="/services/claude-development/">Claude development pillar</a> for the engagement detail.

OpenAI — GPT-5.4 / Realtime / Codex

Picked when the workflow needs the Realtime API (sub-second voice), GPT vision pipelines, image generation, or OpenAI Codex for engineering. ChatGPT brand recognition matters for some buyer-facing use cases. Read the <a href="/services/openai-development/">OpenAI development pillar</a> — same operator team, different vendor.

Gemini — 2.0 / 1M+ context

Niche but real. Wins on massive-context document QA (1M+ token windows beat chunking + retrieval for some workloads) and tight Google Workspace integration. Not our default for general production work, but we'll ship it where the context size or the Workspace surface decides it.

Open-weights — Llama · Qwen · Mistral

Self-hosted Llama 3.3, Qwen 2.5, or Mistral when sovereign data, air-gapped deployment, or cost-floor-at-scale make the per-token bill on hosted vendors untenable. Run cost only — no per-token fee. The trade is operational overhead; if you don't have an SRE team, hosted vendors usually still win.

ai agent frameworks · best ai agent framework · langchain alternatives

Orchestration & agent frameworks —
LangGraph, OpenAI Agents SDK, CrewAI, AutoGen.

The four frameworks the 2026 SERP keeps citing — graded on production weight, vendor lock-in, and where each one breaks. LangGraph is our default for non-trivial agent work; the OpenAI Agents SDK wins when the team is GPT-only and wants vendor-native ergonomics. CrewAI lives in prototype land; AutoGen we rarely reach for. Full agent engagement detail lives on the AI agent development pillar.

Dimension
You're here LangGraph Stateful graph orchestration
OpenAI Agents SDK Vendor-native, GPT-only
CrewAI Role-based multi-agent
AutoGen Microsoft Research framework
Best when Where this framework fits naturally.
LangGraph Branching graphs · durable replay · multi-model routing
OpenAI Agents SDK GPT-only stack · vendor-native handoffs · short ramp
CrewAI Fast prototyping of role-based scenes · demos and PoCs
AutoGen Research-y multi-agent loops · less production tooling
Model lock-in How easy to swap vendors mid-build.
LangGraph Vendor-agnostic — Claude, GPT, Gemini, open-weights
OpenAI Agents SDK GPT-locked · porting cost if you ever need to switch
CrewAI Multi-vendor via LiteLLM under the hood
AutoGen Multi-vendor · OSS · pick your model adapter
Production weight How much framework you carry into prod.
LangGraph Heavy — worth it past ~6 steps; overkill under
OpenAI Agents SDK Light · idiomatic Python · easy to read 6 months later
CrewAI Light at first · gets brittle as scene grows
AutoGen Mid — production patterns less battle-tested
Where it breaks The failure mode we've actually hit.
LangGraph State serialization gotchas when re-hydrating long runs
OpenAI Agents SDK Handoff tracing thin · debugging multi-agent gets murky
CrewAI Role prompts drift; quality regresses as agents accumulate
AutoGen Termination conditions vague — runs loop until budget
Our default pick Unless the workflow says otherwise.
LangGraph Yes — for branching, durable, multi-model agent work
OpenAI Agents SDK Yes — when the team is GPT-only and wants ergonomics
CrewAI Prototypes only · re-platform before going to prod
AutoGen Rarely · the others ship cleaner for the same job

Picks reflect production builds we've shipped in 2026. Frameworks below the row of 'mixed' verdicts are still useful — they just earn their weight less often than we'd hoped from their docs.

Building a multi-step agent? Start with the pillar.

We have a dedicated AI agent development pillar with our AgentRecipePicker, tool-schema patterns, and the agent engagement model. The matrix above is the shortlist — the full playbook is one click away.

vector databases · rag frameworks · best vector database

Data & retrieval —
vector DBs, RAG, and the chunking question.

The data layer is where AI quality lives. The best vector database question matters less than the chunking and reranking strategy on top of it — and the latter is where most production AI quality plateaus. We default to pgvector on your existing Postgres until scale or hybrid-search needs force a swap. Full data-flow detail lives on the AI integration services pillar.

Vector databases

Where the embeddings live. pgvector on your existing Postgres handles ≤2M chunks operationally — usually the right pick for the first build. Pinecone, Qdrant, and Weaviate take over past that scale or when you need hybrid search out of the box. Read the <a href="/services/ai-integration-services/">AI integration services pillar</a> for the full data-flow story.

RAG frameworks & loaders

LlamaIndex for the loader / chunker / index layer — its document parsing is harder to beat than people expect. LangChain only where its abstractions actually save time over plain Python. Unstructured.io for messy real-world inputs (scanned PDFs, slide decks, mixed-format archives). Best vector database is a less interesting question than best chunking strategy.

Reranking & hybrid retrieval

BM25 + dense vector + Cohere Rerank by default. The quality lift from reranking is consistently bigger than picking a fancier embedding model. If you've capped your RAG quality and are reaching for GPT-5.5 to fix it, the answer is usually a reranker — at a tenth of the cost.

Need AI plugged into your existing data and systems?

The AI integration services pillar covers the full DataFlowDiagram, PlatformExplorer for Salesforce / Slack / NetSuite, and the audit-log + retry patterns we ship.

Read AI integration services →
ai observability · llm observability · llm monitoring · mlops tools

Observability, eval, and cost —
the layer nobody on the SERP shows.

AI observability is the highest-paying keyword cluster in our research — $47.96 CPC on the head term — and zero of the top-10 listicles give it real treatment. This is where production AI lives or dies. Four pieces every serious stack needs: tracing, eval suites, drift monitoring, and cost telemetry. Read the AI development pillar for the eval and cost-of-ownership engagement detail; read the AI consulting pillar for the audit that picks them per workflow.

Tracing — Langfuse · Braintrust

Every prompt, completion, tool call, and cost recorded with a trace ID. Langfuse for OSS-friendly self-host; Braintrust when the team wants the managed product and the eval harness in one tool. Without trace data, the first you hear about a regression is a customer email.

Eval suites — Inspect AI · pytest

Golden-set regression run on every prompt change and every model swap. Inspect AI from the UK AISI for serious eval harnesses; hand-rolled pytest for narrow workflows. The eval set is the AI's unit test — and the layer that survives the next model release. AI observability and llm observability live or die here.

Drift & quality monitoring

Shadow-mode mirroring before cutover; production sampling + LLM-as-judge scoring after. Catches the silent quality regression that token math alone won't. LLM monitoring is what separates a stack that's still shipping value at month 6 from one that quietly broke at month 2.

Cost telemetry — token + dollar

Per-workflow, per-customer cost dashboards. Wired in at the same layer as latency and error-rate, not bolted on by Finance later. We've watched runaway loops cost a client $9K in a weekend — cost telemetry would have alerted at $200. Read the <a href="/services/ai-development/">AI development pillar</a> for the eval and cost-of-ownership engagement model.

ai orchestration platform · enterprise ai platform · ai tools for enterprise

Tool to engagement —
pick the job, we'll point at the pillar.

This is the hub's structural payload. Each row is a job-to-be-done buyers actually arrive on this page for; each row maps to the stack we ship, why we picked it, and the sibling service pillar where the full engagement detail lives. We don't try to ship the build from this page — we route you to the pillar that does.

Dimension
Recommended stack What we ship today
Why we pick it The decision, in one line
You're here Read the full pillar Where the engagement detail lives
Build a generative AI app end-to-end Frontend + backend + retrieval + agent + eval.
Recommended stack Claude Sonnet · Next.js or Flutter · pgvector · LangGraph · Langfuse
Why we pick it Default stack for production GenAI — model-agnostic, eval-first
Read the full pillar → /services/ai-development/ (AI software development company)
Automate a back-office workflow with AI Agents doing the work — not just summarizing it.
Recommended stack LangGraph · Temporal · Claude Haiku · Salesforce/NetSuite/Slack tool use
Why we pick it Durable orchestration over your real systems · 6–8 week pilot
Read the full pillar → /services/ai-automation/ (AI automation agency)
Build a multi-step autonomous agent Tool use, memory, planning, recovery.
Recommended stack LangGraph + Claude Sonnet · custom tool schemas · AgentRecipePicker patterns
Why we pick it Best-of-breed agent framework + most stable tool-use model in production
Read the full pillar → /services/ai-agent-development/ (AI agent development)
Plug AI into Salesforce, Slack, NetSuite, etc. Integration into your existing system of record.
Recommended stack Claude or GPT · platform-native SDK · PlatformExplorer pattern · audit logs
Why we pick it Treat AI as one more service in your integration mesh — not a new platform
Read the full pillar → /services/ai-integration-services/ (AI integration services)
Ship a customer-service chatbot Web · WhatsApp · voice · Slack · email.
Recommended stack Claude Sonnet · RAG over docs+tickets · ChannelMatrix · human-in-loop fallback
Why we pick it Tier-1 deflection without the failure modes of pure-LLM bots
Read the full pillar → /services/ai-chatbot-development/ (AI chatbot development)
Build on Anthropic Claude specifically Sonnet 4.6 / Haiku 4.5 / Claude Code agents.
Recommended stack Claude Sonnet 4.6 · prompt caching · Anthropic computer use · LangGraph
Why we pick it Long-context tool use, instruction-following stability, token-cost playbook
Read the full pillar → /services/claude-development/ (Claude developers)
Build on OpenAI GPT specifically GPT-5.4 · Realtime API · Codex · Assistants.
Recommended stack GPT-5.4 + Realtime API · Agents SDK or LangGraph · vision pipelines · Codex
Why we pick it Sub-second voice, vision, and the Codex eng workflow we use internally
Read the full pillar → /services/openai-development/ (OpenAI developers)
Get a strategy + roadmap before you build AI consulting · audit · readiness assessment.
Recommended stack Workflow audit · ROI prioritisation · model-pick rationale · 90-day roadmap
Why we pick it Fit-test before commit — most teams arrive here, then pilot the top workflow
Read the full pillar → /services/ai-consulting/ (AI consulting company)

Eight jobs, eight pillars. If the job-to-be-done isn't in this table, the $3K audit exists to find which pillar (or which non-AI answer) fits — sometimes the right answer is a non-AI baseline.

ai tools comparison · ai framework comparison · ai tools consulting

Build with a framework, or buy a platform —
ten honest checks.

Every listicle on this SERP gives every tool a glowing paragraph. Real buyer-discipline says some workflows shouldn't be built at all — Lindy or Stack AI or a $50/seat SaaS will outperform a custom build at a tenth the cost. Below: ten green/red flag checks we run before quoting a pilot. If the green column wins, we build. If the red column wins, we'll tell you to buy.

your vendor scorecard
0/10 keep looking

tap pass / fail on each criterion · saved locally in your browser

  • 01

    Eval set already exists

    You have a golden set of ~50–200 real inputs with expected outputs. You know what "good" looks like and can measure it.

    There's no eval set. Quality is judged by whichever prompt the founder demoed last. Build a regex baseline first.

  • 02

    The workflow is non-trivial

    ≥3 steps, branching logic, tool use against real systems, or genuinely novel reasoning. A framework earns its weight.

    Single API call to a model with a prompt and a return value. Use the vendor SDK directly. Skip the framework entirely.

  • 03

    You need multi-vendor failover

    Workload is above ~$5K/mo or business-critical. Worth wiring routing across Claude + GPT + open-weights from day one.

    PoC budget under $500/mo, single vendor for now is fine. Add abstraction when revenue says you can afford the overhead.

  • 04

    Volume justifies running the infra

    Steady-state >10M tokens/month on a workload where open-weights would win on $/token. Self-host pays back in <6 months.

    Bursty, narrow, or low-volume. Hosted Claude/GPT is cheaper than the SRE time to run open-weights.

  • 05

    Privacy/compliance forces self-host

    HIPAA, FedRAMP, sovereign data, air-gapped deployment. The vendor BAA + PrivateLink genuinely doesn't cover it.

    "We're worried about data" with no specific regulation. Azure OpenAI BAA + PrivateLink covers most of this — buy the platform.

  • 06

    Latency budget is tight

    Sub-second voice agent · live-stream UX · sub-200ms classification at scale. You need to control the runtime.

    Async batch processing, human-in-the-loop, or anything where a 2-second response is acceptable. Buy the managed API.

  • 07

    Team can operate the layer

    You have or can hire an ML/AI engineer who can debug a LangGraph state machine or a self-hosted Llama deployment.

    Single full-stack dev who's never debugged a vector index. Buy a managed platform — operating cost will eat the savings.

  • 08

    Workflow shape will keep changing

    Active product evolution, new tool integrations every sprint, prompt-engineering owned in-house. Framework gives you control.

    Stable, narrow workflow that hasn't changed in 6 months. Managed SaaS (Lindy, Stack AI, Glean) is cheaper and faster.

  • 09

    Token cost is the bottleneck

    You've maxed prompt caching, batch API, model routing — and the bill still hurts. Open-weights or a custom optimization pass earns it.

    You haven't tried prompt caching or batch API yet. Do those first; they typically cut effective cost to 8–15% of naive.

  • 10

    There's no off-the-shelf product

    You looked. Honestly. Lindy, Stack AI, Glean, Zapier AI, vertical SaaS — none of them fit the workflow. Build is the only path.

    There's a product that does 80% of it for $50/seat/mo. Buy that, integrate around the edges, save the build budget for what's actually unique.

Copy this rubric into your next AI vendor discovery call. If a vendor can't honestly score themselves on the red side of any row, that's the data point.

ai tools consulting · engagement model

Three ways to start.
Audit, pilot, or continuous.

Same pricing as every sibling pillar — $3K audit, $10–25K pilot, from $5K/mo continuous. The audit picks the tools for each workflow before any build budget gets committed; the pilot ships one workflow end-to-end on the chosen stack; the continuous team carries the next ones with quarterly re-evaluation. We pick across vendors openly, and we'll tell you when no tool is the right answer.

1–2 weeks

AI stack audit

We pick the tools and frameworks for each workflow before you commit a build budget.

$3K fixed
  • Workflow inventory + ranked ROI shortlist
  • Per-workflow model pick (Claude / GPT / Gemini / open-weights)
  • Recommended framework (LangGraph / OpenAI Agents SDK / plain Python / SaaS)
  • Vector DB + retrieval recommendation per data shape
  • Observability + eval-suite spec for the chosen stack
  • 90-day implementation roadmap with named tools
Most teams start here
4–8 weeks

AI stack pilot

One workflow shipped end-to-end on the chosen stack — eval, monitoring, and runbook included.

$10–25K fixed price
  • Eval set rebuilt against your real data
  • Build on the audit-recommended stack (model · framework · DB · obs)
  • Deploy behind a feature flag with shadow-mode mirroring
  • Token-optimization pass post-cutover
  • Walk-away point — if the metric won't move, no phase 2
Monthly

Continuous AI team

Embedded squad shipping the next workflow on the same stack — or migrating it when something better lands.

from $5K per month
  • PM + AI engineer + ops analyst, embedded
  • Monthly cost-of-ownership + token-spend report
  • Drift, eval, and retry-rate monitoring
  • Tool/framework re-evaluation every quarter — we'll tell you when to swap
  • Cancel any month — no annual contract
Talk to us
Model-agnostic, openly Your repo, your prompts, your keys BAA / DPA available Tools picked per workflow, not per partner badge
frequently asked

Questions AI tool buyers ask most.
Honest answers, including when to walk away.

Which AI framework should we use for our use case?

The honest answer: it depends on the workflow shape, not the framework's marketing page. Three rules of thumb. (1) Under ~6 steps, no branching, narrow tool use → plain Python with the vendor SDK directly. A framework adds weight you don't need. (2) Branching graph, durable replay, multi-vendor routing → LangGraph. It's the most production-tested orchestration layer in the 2026 stack. (3) Multi-agent role-based scenes → CrewAI for the prototype, then re-platform to LangGraph before production. The $3K audit picks the framework per workflow with your eval data — we'd rather you not build on a framework that's wrong for the shape.

What's the difference between LangChain, LangGraph, CrewAI, and the OpenAI Agents SDK?

LangChain is the older general-purpose toolkit — useful pieces (loaders, retrievers), but production code on top of it tends to fight the abstractions. LangGraph is the same team's stateful graph framework — what we reach for when the orchestration is non-trivial; vendor-agnostic. CrewAI is role-based multi-agent — great for prototypes of "researcher + writer + reviewer" scenes, brittle as the cast grows. OpenAI Agents SDK is GPT-native — clean ergonomics if your stack is GPT-only and you don't need multi-vendor failover. Most langchain alternatives questions on this SERP collapse to: "use LangGraph for graphs, OpenAI Agents SDK for GPT-only, plain Python for everything under 6 steps." Read the full <a href="/services/ai-agent-development/">AI agent development pillar</a> for our agent engagement model.

Should we pick Claude, GPT, or Gemini for our build?

We ship all three. Claude vs GPT vs Gemini, briefly: Claude Sonnet 4.6 wins most of our internal coding-agent and long-context tool-use evals — it's the default for production agent work and the model we'd reach for on best llm for coding. GPT-5.4 wins when the workflow needs the Realtime API for sub-second voice, vision pipelines, image gen, or the Codex engineering workflow. Gemini 2.0 wins on massive-context document QA (1M+ tokens beats chunking on some workloads) and tight Google Workspace integration. Open-weights win on sovereign data and cost-floor at scale. The right answer to anthropic vs openai is almost always "both" — pick per workflow on the eval, not per partner badge. See the dedicated pillars: <a href="/services/claude-development/">Claude development</a> and <a href="/services/openai-development/">OpenAI development</a>.

Do we need a vector database, or is the LLM's context window enough?

Both, often. The rule we ship: if your retrievable corpus is under ~200K tokens and changes rarely, paste it into the system prompt and use prompt caching — no vector database needed. If it's larger, changes, or needs precise retrieval, you need vector search. The best vector database question is less interesting than the chunking and reranking question — we default to pgvector on existing Postgres for ≤2M chunks (operationally simpler than another vendor) and Pinecone or Qdrant past that. Hybrid BM25 + dense retrieval + Cohere Rerank consistently beats picking a fancier embedding model. The retrieval engagement detail lives on the <a href="/services/ai-integration-services/">AI integration services pillar</a>.

What does an "enterprise AI platform" actually mean — and do we need one?

"Enterprise AI platform" is the analyst-coined umbrella for everything in our 5-layer stack glued together with audit logs, RBAC, BAA paperwork, and a single vendor invoice. Real enterprise ai tools today sit in three buckets. (1) Hyperscaler stacks — AWS Bedrock, Azure OpenAI, Google Vertex — when compliance posture (HIPAA, SOC 2, FedRAMP) forces the buy. (2) Managed agent platforms — Lindy, Stack AI, Glean — when the workflow is narrow and the build doesn't earn its keep. (3) Custom stacks like ours — when the workflow is unique enough that platforms can't shape it. Most enterprises we work with end up with all three: a hyperscaler for infra, a SaaS for narrow workflows, and a custom build for the high-leverage one. The audit picks the bucket per workflow.

How do you measure AI quality in production?

AI observability and llm observability is the layer most listicles skip and the layer where production AI lives or dies. Four pieces. (1) Tracing — every prompt, completion, tool call, and cost stamped with a trace ID. Langfuse for OSS-friendly self-host; Braintrust when the team wants the eval harness in the same tool. (2) Eval suites — golden-set regression run on every prompt change and every model swap. Inspect AI for serious harnesses; hand-rolled pytest for narrow workflows. (3) Drift monitoring — shadow-mode mirroring before cutover, production sampling + LLM-as-judge scoring after. (4) Cost telemetry — per-workflow, per-customer dashboards wired in at the same layer as latency. Without the eval set, you don't know when the model swap regressed quality. Without cost telemetry, a runaway prompt loop costs $9K before anyone notices.

How much does it cost to set up an AI stack?

Three engagement tiers, no surprises. A one-week ai infrastructure audit is $3,000 — workflow inventory, model + framework pick per workflow, token-cost projection, observability spec, 90-day roadmap. A pilot is $10,000–$25,000 fixed price, 4–8 weeks — one workflow shipped end-to-end on the chosen stack with eval, monitoring, and runbook. A continuous AI team is from $5,000 per month — embedded PM + AI engineer + ops analyst, monthly cost-of-ownership reporting, and quarterly re-evaluation of the tooling. Per-workflow run cost at steady state typically lands at $200–$1,500/month depending on volume and which model tier the workflow uses. Listicles avoid pricing because they don't sell the build. We do, so the numbers are on the page.

When is "no framework" the right answer?

Three honest cases. (1) The workflow is under ~6 steps with no branching — plain Python and the vendor SDK is cleaner, easier to debug, and faster to ship than LangGraph. (2) An off-the-shelf product covers 80% of it. If Lindy, Stack AI, Glean, Zapier AI, or a vertical SaaS does the workflow for $50/seat/month, buy it and save the build budget for what's actually unique. (3) The problem isn't AI-shaped. Regex, rule engines, classical ML (XGBoost, scikit) outperform LLM calls on cost and quality for plenty of workloads — form parsing on a single PDF type, structured ETL, narrow classification. The <a href="/services/ai-consulting/">AI consulting audit</a> exists partly to tell you when the answer is "don't build with an AI framework at all."

Ready to ship

Book an AI stack audit
and walk out with the tools picked.

One to two weeks, $3K fixed-fee. We inventory your candidate AI workflows, rank them by ROI, pick the model and framework per workflow, recommend the data + observability + infra layer, project token cost at steady state, and hand you a 90-day implementation roadmap. No deck, no obligation to build with us afterward.

See the related pillars
Fixed $3K · 1–2 weeks · no obligation to pilot Model-agnostic across Claude, GPT, Gemini, open-weights Tools picked per workflow, not per partner badge