TrueFoundry’s TrueFailover Aims to Keep Enterprise AI Online When Model Providers Go Dark

For many enterprises, large language models are no longer experiments in a lab—they are threaded into revenue-generating workflows and regulated processes. That reality came into sharp focus during OpenAI’s outage in December, when a pharmacy customer of enterprise AI infrastructure provider TrueFoundry briefly faced the prospect of being unable to refill prescriptions. Every second of disruption meant lost revenue and, more importantly, patients waiting for medications.

TrueFoundry is now formalizing its answer to that risk with TrueFailover, a new resilience layer that aims to keep production AI systems online when model providers slow down, degrade in quality, or go dark altogether. Built on top of the company’s existing AI Gateway, which already serves billions of requests monthly for large enterprises, TrueFailover is designed for teams that can no longer afford to treat outages as rare edge cases.

This explainer unpacks what TrueFailover does, how it works, and what it cannot solve—so AI leaders, platform engineers, and SREs can better evaluate how to build resilience into their own stacks.

From experimental GenAI to production-critical dependencies

Only a short time ago, most enterprise AI teams were running pilots and proofs of concept. Generative AI and agentic systems were often confined to internal tools: document summarization, code assistance, or knowledge search. If an LLM went down, the impact was inconvenient but rarely existential.

That context has changed quickly. According to TrueFoundry’s account, AI now sits directly in the path of critical user journeys and revenue streams: handling prescription refills in pharmacies, generating sales proposals, powering customer support agents, and assisting software development at scale. When the underlying models stall or fail, the disruption ripples through business operations, SLAs, and brand perception.

The December OpenAI outage crystallized these stakes. TrueFoundry’s pharmacy customer depended on LLMs for prescription refills, where “seconds of downtime” translated into thousands of dollars in lost revenue and delayed patient access to medications. In this case, the customer had already implemented a failover setup with TrueFoundry and was able to shift traffic to another provider within minutes. Without that preparation, recovery would likely have taken hours.

Importantly, the risks are not limited to headline-grabbing, full provider outages. Partial failures are increasingly common and often more insidious: models that slow just enough to break UX expectations, or that start returning lower-quality outputs while remaining technically “up.” These scenarios erode trust, violate SLAs, and are harder to catch with traditional infrastructure monitoring, which was built around binary health states rather than nuanced model performance.

For AI platform owners, the shift from experimentation to production means reliability discussions must evolve beyond model benchmarks and feature sets. The core question becomes: what happens when the model that powers a key business process falters at the worst possible moment?

The fragility of single-provider AI architectures

Most enterprises today still rely heavily on a single large language model provider—commonly OpenAI, Anthropic, Google, or another hyperscaler—much as early cloud adopters did with a single IaaS provider. On the surface, this mirrors traditional cloud architectures. In practice, LLMs introduce a different reliability profile.

Cloud platforms like AWS, Azure, and Google Cloud offer mature uptime guarantees backed by decades of operational experience and a clear model for redundancy. Foundational AI models, by contrast, are extremely large, shared systems that are costly to run and still evolving in how they are provisioned and managed. Even at scale, these providers experience periodic outages, latency spikes, and quota-driven slowdowns.

TrueFoundry’s CEO Nikunj Bajaj notes that major LLM providers see outages or significant performance events on the order of weeks or months, with visible downstream impact on customers that depend on a single provider. At the scale of a high-volume pharmacy or a large call center, even small windows of unavailability add up quickly in both revenue and user experience impact.

Shared-resource economics exacerbate the problem. Many foundational models operate as pooled capacity shared across numerous customers. When a few workloads spike, they can trigger latency or instability that affects others—especially customers that do not pay for dedicated capacity. While some providers do offer higher-uptime options, these typically correlate with reserved infrastructure and substantially higher cost, and even those tiers may spill back into shared resources when traffic exceeds quotas.

For AI leaders, the implication is straightforward but uncomfortable: traditional cloud-style SLAs are not yet the norm for LLMs, and may remain difficult to achieve at reasonable cost. Enterprises that build critical paths on top of single-provider, shared models are accepting a level of operational risk that many would not tolerate in other parts of their stack.

Inside TrueFailover: a resilience layer atop TrueFoundry’s AI Gateway

TrueFailover is positioned as an answer to this gap: a horizontal resilience layer that sits above model providers and below applications. It builds on TrueFoundry’s AI Gateway, which the company says already handles more than 10 billion requests per month for Fortune 1000 and other large customers, giving it a broad vantage point on provider behavior.

At a high level, TrueFailover weaves several capabilities into a single control plane for AI reliability:

1. Multi-model, multi-provider routing. Enterprises can define primary and backup models across different providers. If a primary model from, say, OpenAI becomes unavailable or underperforms, traffic can be routed to alternatives such as Anthropic, Google’s Gemini, Mistral, or self-hosted models, depending on what the customer has configured. This rerouting is intended to be transparent to application teams, without requiring code changes in the critical path.

2. Multi-region and multi-cloud resilience. TrueFailover extends the same logic across geographic regions and infrastructure providers. By distributing AI endpoints across zones and clouds, it can perform health-based routing at a regional level—turning what might have been a global incident into a localized adjustment users never see. In practice, this may mean moving workloads across cloud regions or between on-premise and cloud deployments, as long as those paths are pre-approved.

3. Degradation-aware routing. Perhaps the most distinctive element is that TrueFailover is not limited to binary “up/down” monitoring. It continuously evaluates latency, error rates, and quality-adjacent signals to identify when a model is drifting toward unusable performance even while ostensibly online. Rising response times, increasing error patterns, and signs of instability are combined into a composite view that feeds an AI-driven decision system for rerouting.

4. Strategic caching and rate-limit protection. To absorb traffic spikes and protect against provider-level rate limiting, TrueFailover incorporates caching strategies that reduce unnecessary calls and smooth bursty demand. This is aimed at preventing cascading rate-limit errors and “brownouts” where throughput drops unpredictably during load surges.

Collectively, these layers represent a shift from reactive, manual response to proactive, policy-driven resilience. Instead of an SRE team scrambling during an outage to manually switch endpoints or reconfigure clients, the idea is that the system detects trouble early and executes pre-planned failover paths automatically.

Detecting slow failures and quality degradation before users feel it

For SREs and platform engineers, outright outages are relatively straightforward to detect: error rates spike, health checks fail, and dashboards turn red. The harder challenge lies in the “slow but up” category—when latency creeps upward, responses become inconsistent, or quality declines in ways that are highly visible to users but subtle in standard infrastructure metrics.

TrueFailover’s degradation-aware routing is designed specifically for these nuanced scenarios. Rather than relying only on simple health endpoints, it monitors:

Latency trends: sustained increases in response time beyond normal variance.
Error rate patterns: not just raw error counts, but shapes of failure that suggest instability versus isolated issues.
Quality-related signals: while the article does not detail exact metrics, Bajaj describes using a combination of indicators that together point to deteriorating model behavior.

Individually, any one of these signals may be ambiguous. In aggregate, they can serve as an early warning system that a model is entering a degraded state—especially in shared-resource environments where one customer’s surge can affect others. TrueFailover then uses this composite view to decide when to begin shifting traffic, and how aggressively to do so, before end users experience a noticeable drop in quality.

Because TrueFoundry operates this logic across multiple customers and workloads, it arguably has a broader window into systemic issues than any single enterprise observing only its own logs. That vantage point underpins the product’s premise: that a shared resilience layer can detect and mitigate provider-level issues faster and more consistently than fragmented, per-team responses.

Still, the article does not claim perfect detection or universal quality guarantees. The system’s effectiveness depends heavily on the policies, thresholds, and backup options that a given enterprise configures up front.

Switching models without breaking prompts—or user expectations

Even if routing away from a failing provider is technically simple, the semantic challenge is harder: different models respond differently to the same prompt. For production systems tuned for a particular LLM’s behavior, naive failover risks silently degrading output quality or breaking workflows.

Bajaj describes a spectrum of approaches enterprises take today:

“Good enough” prompt portability. Some teams rely on the fact that modern large models are sufficiently capable that minor prompt differences do not materially affect outcomes. In this mode, switching from one provider to another may lead to some visible change in behavior, but teams accept that trade-off to maintain uptime.
Provider-specific prompts. More mature setups maintain distinct prompts for different providers, tailored and tested to achieve comparable results. In these systems, failover is not just about switching models but switching to an associated configuration that has been validated in advance.

TrueFailover is built to support the latter pattern at scale. When traffic shifts between models, it can also shift prompts and related configuration dynamically, routing each request with the appropriate provider-specific setup. The aim is to keep output quality within agreed tolerances while avoiding manual interventions during incidents.

Critically, this is framed as a planned behavior, not an ad-hoc reaction. Enterprises are expected to define the failover logic, prompts, and guardrails ahead of time. When issues occur, the system executes those pre-defined paths rather than improvising.

Not every failover event requires cross-provider transitions, either. Bajaj notes that many incidents can be handled by simply shifting to the same model in a different region—such as moving from an East Coast to a West Coast endpoint—where no prompt adjustments are required. Geographic rebalancing often serves as the first line of defense before invoking more complex provider changes.

Guardrails for regulated industries: failover without losing compliance

Automatic routing of AI traffic naturally raises alarms in regulated sectors like healthcare and financial services, where data residency, access control, and provider selection are tightly governed. A pharmacy cannot simply send patient data to any available model; a bank cannot casually route financial data across jurisdictions or to unapproved vendors.

TrueFoundry addresses this by constraining TrueFailover’s autonomy to what enterprises explicitly approve through configuration. According to Bajaj, the system:

Will not route data to any model or provider that has not been explicitly authorized.
Operates within an admin-defined configuration layer where teams specify allowable models, providers, regions, and even model categories (such as closed-source vs. open-source).
Treats non-approved models as out of bounds for routing, regardless of availability.

The intent is to give central platform and compliance teams full control over the “envelope” in which automatic failover can operate. Within that envelope, TrueFailover can act quickly; outside it, routing will simply not occur.

This design reflects lessons from existing TrueFoundry deployments. The company describes a Fortune 50 healthcare customer that uses its platform to power more than 500 million IVR calls a year via an agentic AI system, while running across both cloud and on-premise environments under strict data residency requirements. Environments like this—hybrid, regulated, and high-volume—are the ones where misconfigured failover could easily cross compliance boundaries if not tightly governed.

For enterprise AI leaders, the key takeaway is that resilience planning must be done jointly with security and compliance. The architecture can support rapid failover, but only along paths that have been deliberately vetted.

What TrueFailover can’t fix: limits, failure modes, and realistic risk

Despite its ambitions, TrueFailover is not presented as a cure-all for AI reliability. The article highlights several classes of risk where the system either has limited impact or none at all—important caveats for engineers designing end-to-end resilience.

1. Large-to-small model failover without rethinking expectations. If teams configure failover from a high-capacity, high-quality model to a significantly smaller one without adjusting prompts or quality expectations, TrueFailover can move the traffic but cannot make the smaller model behave like the larger one. In other words, the routing layer cannot compensate for fundamental capability gaps.

2. Single-point infrastructure failures. When an enterprise runs all of its self-hosted models on a single GPU cluster or tightly coupled environment, there may simply be nowhere to fail over to. If that cluster fails, no amount of routing logic can recover capacity that does not exist. This underscores that resilience still depends on underlying infrastructure diversity.

3. Hypothetical simultaneous provider-wide failures. The scenario where multiple major providers go entirely offline at the same time often appears in risk conversations. Bajaj argues that, in practice, “going down” usually means partial outages: specific models or regions affected by capacity constraints or traffic spikes. In those more common situations, TrueFailover’s layered routing—across on-premise and cloud, across regions, and across models—can significantly reduce the chance of complete outages for end users.

However, if a customer’s guardrails are narrow (few backup models, limited regions, single cloud), the value of those layers diminishes. Ultimately, the breadth of redundancy—and the investment in prompts and configuration to support it—determines how much protection TrueFailover can deliver.

For SREs, these constraints are familiar: a routing and orchestration layer can only leverage the redundancy that architects have actually provisioned. The product’s role is to make use of that redundancy intelligently, not to conjure capacity or capability that an enterprise did not plan for.

TrueFoundry’s bet on AI reliability as a platform layer

TrueFoundry’s move into formalized failover comes after several years building AI infrastructure for large-scale deployments. Founded in 2021 by former Meta engineers Nikunj Bajaj, Abhishek Choudhary, and Anuraag Gutgutia, the San Francisco-based startup initially focused on accelerating machine learning deployments before pivoting to generative AI in 2023 as enterprise interest surged.

The company reports tangible traction: more than 30 paid customers, over $1.5 million in annual recurring revenue last year, and more than 1,000 clusters managed for machine learning workloads. In February 2025, it raised a $19 million Series A led by Intel Capital with participation from Eniac Ventures, Peak XV Partners, Jump Capital, and several notable angel investors, bringing total funding to $21 million.

Customer examples illustrate the range of use cases that inform TrueFailover’s design:

Nvidia uses TrueFoundry to build multi-agent systems that optimize GPU cluster utilization across data centers, where even small efficiency gains have outsized impact given GPU scarcity.
Adopt AI routes more than 15 million requests and 40 billion input tokens through TrueFoundry’s AI Gateway to power enterprise agentic workflows.
Games 24×7 serves models to more than 100 million users at over 200 requests per second via the platform.
Whatfix migrated to a microservices architecture on TrueFoundry, reporting a sixfold acceleration in release cycles and a 40% reduction in testing time.

TrueFailover will be sold as an add-on module atop the existing AI Gateway and platform. Pricing is usage-based, tied to traffic volume and the breadth of the configuration—number of users, models, providers, and regions involved. An early access program for design partners is opening in the near term, signaling that TrueFoundry intends to iterate the product with close input from its existing enterprise base.

From a market perspective, TrueFoundry is betting that as more companies embed AI into mission-critical processes, they will view reliability as a distinct platform concern, not just a feature of individual providers. In that framing, a neutral resilience layer—capable of spanning multiple clouds, models, and hosting patterns—becomes a strategic component of the enterprise AI stack.

The new reliability checklist for AI-first business processes

Behind TrueFailover’s launch is a broader shift in how CIOs, heads of platform, and SRE leaders think about AI risk. As Bajaj puts it, earlier waves of GenAI usage were largely internal and non-critical; outages did not immediately affect top-line revenue or public perception. Today, many organizations are running public-facing, revenue-impacting, and regulated workflows on top of LLMs.

That change reframes the operational questions:

Not just “Which model scores best on benchmarks?” but “What is our plan if that model degrades or fails?”
Not just “Which provider has the most compelling roadmap?” but “How many independent paths do we have to serve this workload within our compliance constraints?”
Not just “Can we ship this AI feature?” but “Can we keep it reliable enough that users and regulators trust it?”

TrueFoundry’s answer is that reliability in the AI era requires layered redundancy: multiple models, multiple regions, multiple clouds or hosting environments, and an orchestration layer capable of using them intelligently. It also requires doing the hard work up front—defining guardrails, tuning prompts for alternative models, and aligning with compliance teams—so that when outages or degradations occur, the response is automated and planned, not chaotic and improvised.

Somewhere, a pharmacist is refilling a prescription, a support agent is resolving a critical issue, or a sales team is finalizing a proposal. In each case, a failure in an upstream LLM could now be enough to stall the entire workflow. TrueFoundry is wagering that enterprises will invest in making those dependencies resilient—and that a dedicated failover layer can be the bridge between today’s imperfect AI infrastructure and the reliability expectations of mission-critical systems.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.