From Demos to Dependable: Engineering the ‘March of Nines’ for Enterprise AI Agents

Andrej Karpathy has a concise way to puncture AI hype: “When you get a demo and something works 90% of the time, that’s just the first nine.” For enterprise teams turning large language models into real products, that line is less a quip and more an operating principle.

The “March of Nines” describes how reliability improves as you add “nines” to your success rate—90%, 99%, 99.9%, 99.99% and beyond. The core lesson: each additional nine typically costs as much engineering effort as the previous one, and the difference between a slick demo and dependable software is measured in those later nines.

For agentic workflows—multi-step pipelines that parse intent, retrieve context, call tools, validate results, and log outcomes—the compounding effect of small failure rates can make a 90% demo feel unusable in production. To make AI agents safe and dependable, enterprises need to engineer reliability as intentionally as they engineered uptime for their core SaaS platforms.

From “first nine” demos to production reality

In a single model call, 90% success might look impressive. In a conference room demo, it often is. The problem emerges when that same capability is embedded into a real workflow with many moving parts.

Typical enterprise agent flows are not single-shot prompts. They chain steps such as:

Intent parsing
Context retrieval from internal systems
Planning a multi-step action sequence
One or more tool or API calls
Validation and safety checks
Formatting outputs for downstream systems
Audit logging for compliance

Each of those steps can fail or degrade: an ambiguous intent, an empty retrieval, a flaky connector, an invalid JSON payload, a timeout. When each step succeeds “most of the time,” the end-to-end experience can still feel broken for the user.

This is why engineering teams and technical leaders quickly find that a 90% or even 99% demo does not translate to acceptable behavior in production. The March of Nines forces a more rigorous framing: what level of reliability will users perceive as “software that works,” and what engineering work is required to get there?

The compounding math: why 90% is prototype territory

Karpathy summarizes the engineering reality bluntly: “Every single nine is the same amount of work.” The data behind that statement becomes obvious when you treat an AI agent as a workflow of n steps, each with its own success probability p. Under simplifying assumptions, the chance the entire workflow succeeds is roughly pⁿ.

Consider a 10-step workflow:

At 90% success per step, end-to-end success is about 34.9%. That means roughly two-thirds of workflows are interrupted or fail. In practice, this is prototype territory.
At 99% per step, end-to-end success rises to about 90.4%. This may look fine in a short demo, but a user running 10 workflows a day still sees frequent interruptions.
At 99.9% per step, you reach roughly 99% end-to-end success. One failure in 10 still feels unreliable to business users for critical tasks.
Only around 99.99% per step does a 10-step workflow start to “feel” like dependable enterprise-grade software, with failures rare enough to fade into the background.

Real systems are even more complex. Failures are often correlated: authentication outages, rate limits, shared connectors, or indexing issues can simultaneously break multiple steps. Without hardening shared dependencies, these correlated failures dominate your reliability profile.

The implication for engineering leaders is clear: the March of Nines isn’t theoretical. It describes why “good enough” per-step metrics can still produce painful user experiences—and why teams must design for compounding reliability.

Turning reliability into SLOs, not vibes

A second Karpathy observation applies directly to AI agents: “It makes a lot more sense to spend a bit more time to be more concrete in your prompts.” The same logic applies to reliability. Vague expectations like “it usually works” are not enough. Teams that reach the later nines define concrete service-level objectives (SLOs) and measure against them.

For agentic systems, that means selecting a focused set of service-level indicators (SLIs) that describe both model behavior and the surrounding orchestration:

Workflow completion rate: percentage of runs that either succeed or explicitly escalate to a safe fallback (e.g., human review), rather than silently failing.
Tool-call success rate: share of tool or API calls that complete within timeouts and satisfy strict schema validation on inputs and outputs.
Schema-valid output rate: frequency with which models produce structurally valid JSON or argument payloads for downstream tools.
Policy compliance rate: adherence to constraints around PII, secrets, and security policies.
p95 latency and cost per workflow: time and spend for end-to-end runs, which affect usability and budgets.
Fallback rate: how often the system must fall back to safer models, cached results, or human approval.

By setting SLO targets per workflow tier—low-, medium-, and high-impact paths—and managing an explicit error budget, teams can run controlled experiments without compromising mission-critical operations.

Nine engineering levers to add more nines

Reaching later nines is rarely about a single breakthrough. It is the cumulative effect of disciplined engineering levers applied across the workflow. The source article identifies nine such levers that repeatedly prove effective in production.

1) Constrain autonomy with an explicit workflow graph

Reliability improves when the agent operates inside bounded, well-understood states rather than free-form improvisation. In practice:

Place model calls inside a state machine or DAG where each node defines allowed tools, maximum attempts, and an explicit success predicate.
Persist state with idempotent keys so retries are safe, repeatable, and debuggable.

This constrains behavior and gives operators predictable hooks for handling retries, timeouts, and terminal outcomes.

2) Enforce contracts at every boundary

Many production incidents trace back to interface drift: malformed JSON, missing fields, incorrect units, or invented identifiers.

Use JSON Schema or protobuf for all structured outputs and validate server-side before executing tools.
Standardize on enums, canonical IDs, and normalized representations for time (ISO-8601 with timezone) and units (SI).

Contracts turn loosely structured model outputs into predictable inputs for downstream systems.

3) Layer validators: syntax, semantics, business rules

While schema validation catches structural issues, it cannot detect “plausible but wrong” answers.

Semantic checks verify referential integrity, numeric bounds, permissions, and deterministic joins on IDs where available.
Business-rule checks capture domain constraints such as approval requirements for write actions, data residency limits, and customer-tier rules.

Layered validation ensures that a syntactically valid response cannot silently break business invariants.

4) Route by risk using uncertainty signals

Not all agent actions carry the same risk. High-impact operations warrant higher assurance.

Use confidence or uncertainty signals—from classifiers, consistency checks, or secondary verifier models—to drive routing decisions.
Gate sensitive steps behind stronger models, additional verification, or explicit human approval.

This turns uncertainty into a product feature instead of a hidden liability.

5) Engineer tool calls like distributed systems

Connectors and external dependencies often dominate failure rates in agentic workflows, just as they do in microservices architectures.

Apply per-tool timeouts, exponential backoff with jitter, circuit breakers, and concurrency limits.
Version tool schemas and validate responses to avoid silent breakage when APIs change.

In effect, every tool becomes a first-class distributed system component with its own reliability engineering.

6) Make retrieval predictable and observable

For retrieval-augmented systems, the quality of retrieval largely determines how grounded and safe the agent will be.

Track metrics such as empty-retrieval rate, document freshness, and hit rate on labeled queries.
Roll out index changes behind canaries to detect regressions early.
Apply least-privilege access and redaction at the retrieval layer to limit data leakage risks.

Treat retrieval pipelines as versioned data products, not static infrastructure.

7) Build a production evaluation pipeline

As you add nines, failures become rarer—and harder to detect before they impact users. A continuous evaluation pipeline becomes essential.

Maintain an incident-driven “golden set” built from real production traffic and run it against every change.
Use shadow mode and A/B canaries with automatic rollback when SLIs regress.

This turns every incident into a permanent test and accelerates learning loops.

8) Invest in observability and operational response

Beyond a certain point, reliability gains come from faster diagnosis and remediation rather than fewer bugs.

Emit traces or spans for each step, store redacted prompts and tool I/O with strict access controls, and classify every failure into a clear taxonomy.
Equip teams with runbooks and “safe mode” toggles to disable risky tools, switch models, or require human approval during incidents.

Operational maturity is what allows rare issues to stay rare and short-lived.

9) Ship an autonomy slider with deterministic fallbacks

AI systems are inherently fallible. Production software needs mechanisms to gradually increase autonomy and to retreat safely when risk rises.

Default to read-only or reversible actions; require explicit confirmation or approval flows for irreversible changes.
Provide deterministic fallbacks: retrieval-only answers, cached responses, rules-based handlers, or human escalation when confidence is low.
Expose per-tenant safe modes to disable risky tools, force stronger models, lower temperature, or tighten timeouts during incidents.
Design resumable handoffs that persist state, surface plans or diffs to reviewers, and resume from the exact step using idempotency keys.

Treat autonomy as a dial, not an on–off switch, and make the safest setting the default.

A practical implementation pattern: the bounded step wrapper

One way to operationalize many of these levers is to wrap every model or tool invocation in a common “bounded step” abstraction. Conceptually, this wrapper:

Starts a trace span for the step.
Executes the attempted action within a fixed timeout.
Validates outputs using schema, semantic, and business-rule checks.
Retries transient failures with jittered backoff.
Retries validation failures once in a “safer” mode (for example, lower temperature or stricter prompting).
Emits metrics for successes, retries, and fallbacks.
Escalates to a human or safe fallback when retries are exhausted.

By standardizing this pattern, teams convert inherently unpredictable model behavior into policy-driven, observable, and debuggable building blocks. Over time, these bounded steps become the foundation for higher-level guarantees at the workflow level.

Why CIOs and CDOs insist on the later nines

The push for later nines is not academic. Reliability gaps translate directly into business and regulatory risk. A 2025 McKinsey global survey cited in the source article reports that 51% of organizations using AI experienced at least one negative consequence, with nearly one-third pointing to issues tied to AI inaccuracy.

These experiences drive executives to demand stronger measurement, guardrails, and operational controls before allowing AI agents into core business processes. For technical leaders, that means:

Defining completion SLOs for top workflows and instrumenting terminal status codes.
Adding contracts and validators around every model output and tool interaction.
Treating connectors and retrieval infrastructure as first-class reliability work, with timeouts, circuit breakers, and canaries.
Routing high-impact actions through higher-assurance paths that incorporate verification or approval.
Turning every incident into a regression test in the production golden set.

The March of Nines is ultimately a discipline: bounded workflows, strict interfaces, resilient dependencies, and fast operational learning loops. Demos may win attention at 90%. Enterprise adoption, however, arrives only when AI agents behave like the rest of the software stack—predictable, observable, and engineered for failure long before it reaches the user.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.