Theorem’s $6M bet: mathematically proving AI-written code is safe before it ships

As AI coding assistants race ahead, the constraint for software teams is shifting from writing code to being able to trust it. A new San Francisco startup, Theorem, is wagering that the only way to keep up is to bring mathematical rigor to AI-generated software—at industrial scale.

The company, which emerged from Y Combinator’s Spring 2025 batch, has raised $6 million in seed funding to automate formal verification of AI-written code. The round was led by Khosla Ventures, with participation from Y Combinator, e14, SAIF, Halcyon, and angels including Recursion Pharmaceuticals co-founder Blake Borgesson and Tezos co-founder Arthur Breitman.

For engineering leaders and security teams experimenting with GitHub, Amazon, and Google’s code assistants, Theorem’s pitch is direct: AI can already generate more code than humans can realistically review. Unless oversight scales just as aggressively, bugs—and exploitable vulnerabilities—will slip into critical systems.

The oversight gap: AI is generating more code than humans can safely review

AI coding tools have moved from curiosity to core workflow in a few short years. Systems from major providers now produce billions of lines of code annually, accelerating development across enterprises. But the verification side of the equation has not kept up.

Jason Gross, Theorem’s co-founder, characterizes the situation as an “oversight gap” between what AI can generate and what human teams can meaningfully validate. Asked whether AI-generated code is already outpacing human review capacity, he is unequivocal: “We’re already there. If you asked me to review 60,000 lines of code, I wouldn’t know how to do it.”

For leaders rolling out AI coding agents inside large codebases, that remark captures a growing tension. The productivity gains are real: more features, faster refactors, and rapid experimentation. But the surface area of unreviewed—or lightly reviewed—logic is expanding just as quickly, especially in low-level systems where subtle bugs can be catastrophic.

Traditional mitigations don’t scale cleanly. More manual code review simply doesn’t match the volumes involved. Even well-built testing pipelines, while indispensable, are grounded in sampling behavior via test cases rather than exhaustively proving that the code does what it is meant to do under all inputs.

Theorem’s founding thesis is that as AI systems start to write large portions of the stack, organizations will need a different class of oversight—one that can grow with AI output without requiring comparable increases in highly specialized human labor.

From PhD-only tooling to AI-assisted formal verification

Theorem’s core technology is anchored in formal verification: a branch of computer science that uses mathematical proofs to show that software conforms exactly to its specification. Historically, this has been the domain of avionics software, nuclear control systems, and cryptographic protocols—places where failure is unacceptable and budgets can support years of expert effort.

The barriers have been steep. Gross notes that conventional formal verification might require “eight lines of mathematical proof for every single line of code,” often authored by PhD-level engineers. He has firsthand experience: before founding Theorem, he completed a PhD at MIT working on verified cryptography code that now underpins the HTTPS protocol, securing trillions of internet connections. That single project consumed an estimated 15 person-years.

Theorem aims to change the economics of this process by pairing formal methods with AI models trained to generate and check proofs automatically. Instead of teams of specialists writing proofs by hand, AI systems shoulder most of the burden, with humans providing specifications and high-level guidance.

“Nobody prefers to have incorrect code,” Gross says. The reason verification has been rare in mainstream software, he argues, is simple: it was not economical. “Proofs used to be written by PhD-level engineers. Now, AI writes all of it.”

For organizations, the promise is not that formal verification becomes trivial, but that the cost and time requirements fall from years to weeks or even days—opening the door for its use beyond the most safety-critical niches.

Inside Theorem’s approach: ‘fractional proof decomposition’ and SFBench

Rather than trying to exhaustively explore every possible behavior of a system—a task that quickly becomes infeasible as complexity grows—Theorem’s system follows a strategy Gross calls “fractional proof decomposition.”

The idea is to allocate verification effort proportionally to the importance and risk profile of each component. Critical, low-level pieces of logic can be fully or near-fully verified, while less critical paths receive lighter but still principled scrutiny. This avoids the all-or-nothing trap that has historically limited formal verification deployments.

Theorem’s tooling has already uncovered issues that slipped past traditional testing. At one AI lab, Anthropic—the company behind the Claude chatbot—the approach identified a bug that had survived conventional test suites. Gross points to this as evidence that mathematically grounded oversight can reveal faults that even careful testing misses, and do so without “expending a lot of compute.”

In a technical demonstration dubbed SFBench, Theorem used AI to translate 1,276 problems from Rocq, a formal proof assistant, into Lean, another verification language. It then automatically proved that each translation remained equivalent to the original. The company estimates this workload would have taken a human team approximately 2.7 person-years.

Architecturally, Theorem emphasizes not only parallelism but also the ability to run agents sequentially across interdependent code. “Everyone can run agents in parallel, but we are also able to run them sequentially,” Gross explains. That matters for real-world systems, where code spans dozens of files and solutions build on each other. By contrast, many AI coding agents are constrained by context windows and struggle to track and reason about long chains of dependencies.

For engineering leaders, the practical implication is that Theorem is not just targeting isolated functions or toy benchmarks. Its design is oriented toward the messy, interconnected subsystems that are hardest to validate with today’s AI tooling.

Case study: From a 1,500-page spec to 16,000 lines of trusted code

Theorem is already working with early customers in AI research labs, electronic design automation, and GPU-accelerated computing. One case study highlights how the approach can compress complex specifications into verifiable implementations.

A customer arrived with a 1,500-page PDF specification and a legacy implementation plagued by memory leaks, crashes, and other difficult-to-reproduce bugs. The performance gap was stark: the existing system delivered around 10 megabits per second. The target was 1 gigabit per second—a 100x increase—without sacrificing correctness.

Using its system, Theorem generated 16,000 lines of production code to replace the problematic implementation. Notably, the customer deployed this code without manually reviewing it line by line. Their confidence rested instead on two artifacts:

A compact executable specification—a few hundred lines that distilled the intent of the 1,500-page document into precise, machine-checkable form.
An equivalence-checking harness that verified the new implementation behaved exactly as the specification dictated.

According to Gross, the result is a production-grade parser capable of operating at 1 Gbps, with the assurance that “no information is lost during parsing.” For teams, the striking detail is not just the performance gain, but the shift in trust model: confidence anchored in proofs and checked equivalence, rather than exhaustive human reading of auto-generated code.

For organizations considering AI-generated components in performance-critical paths—networking stacks, data pipelines, or protocol parsers—this case illustrates one possible pattern: push the complexity into a rigorous, executable spec, then let AI-assisted verification guarantee that the implementation faithfully follows it.

Security implications: AI-written bugs in critical infrastructure

Theorem’s funding arrives amid broader anxiety about the reliability of AI systems embedded in critical infrastructure. Software already orchestrates financial markets, medical devices, transportation, and power grids. As AI accelerates how quickly that software is changed and extended, it also increases the risk that subtle bugs—and exploitable flaws—propagate widely before they are caught.

Gross frames the problem explicitly in security terms. With AI tooling, the cost of mounting sophisticated attacks is falling. Automated vulnerability discovery and exploit generation put pressure on defenders whose practices still rely heavily on manual processes and reactive patching.

“Software security is a delicate offense-defense balance,” he says. “With AI hacking, the cost of hacking a system is falling sharply. The only viable solution is asymmetric defense. If we want a software security solution that can last for more than a few generations of model improvements, it will be via verification.”

In this view, formal verification is not merely an assurance tool but a strategic lever: a way to build defenses that do not require linear increases in human effort to match future waves of automated attacks. Mathematics becomes a way to lock in certain guarantees, even as offensive capabilities evolve.

On the regulatory front, Gross is blunt when asked whether formal verification should be required for AI-generated code in critical systems. “Now that formal verification is cheap enough,” he argues, “it might be considered gross negligence to not use it for guarantees about critical systems.”

For CISOs and compliance leaders, that statement hints at a possible future trajectory: as verification tools become more accessible, expectations around how critical software is validated may tighten, whether through regulation, industry standards, or liability norms.

How Theorem positions itself in the AI verification landscape

Theorem is entering a space where multiple startups and research groups are exploring intersections between AI and formal methods. According to Gross, the company’s differentiation is its narrow focus on software oversight rather than on general mathematics or proof automation for its own sake.

“Our tools are useful for systems engineering teams, working close to the metal, who need correctness guarantees before merging changes,” he says. The emphasis is on practical integration into workflows that ship code, particularly in low-level and performance-sensitive domains.

The founding team reflects this applied orientation. Gross brings deep expertise in programming language theory and a track record of getting verified code into production at internet scale. Co-founder Rajashree Agrawal, a machine learning research engineer, focuses on training the AI models that drive Theorem’s verification pipeline.

Agrawal describes the ambition as enabling “formal program reasoning so that everyone can oversee not just the work of an average software-engineer-level AI, but really harness the capabilities of a Linus Torvalds-level AI,” referencing the creator of Linux. The analogy underscores a key tension for organizations: if AI systems eventually reach—or surpass—elite human engineering capabilities, traditional review practices will be even less able to keep up.

By constraining its mission to scaling oversight rather than expanding AI’s creative capacity, Theorem is effectively betting that the next bottleneck in AI-assisted development will be trust, not throughput.

What’s next: scaling verification before AI scales everything else

Theorem currently employs four people and plans to use its $6 million seed round to expand the team, increase compute resources for training verification models, and enter new verticals including robotics, renewable energy, cryptocurrency, and drug synthesis.

Each of these domains shares a common characteristic: complex systems where software defects can have outsized financial, safety, or societal impact. As organizations in these areas adopt AI coding tools, they may face especially intense pressure to demonstrate that rapid iteration does not compromise reliability.

More broadly, Theorem’s emergence hints at a shift in how enterprises will evaluate AI coding platforms. The initial wave emphasized productivity metrics: lines of code generated, tasks automated, developer time saved. Theorem is betting that the next wave will force a harder question: can you prove that this acceleration has not introduced hidden failure modes?

Gross articulates the stakes in long-term terms. AI systems are improving exponentially, he argues, and if that continues, “superhuman software engineering” becomes plausible—systems capable of designing and evolving software beyond what human teams could manage alone.

“And without a radically different economics of oversight,” he warns, “we will end up deploying systems we don’t control.”

For engineering and security leaders, that framing suggests a strategic decision point. As AI agents take on more of the coding workload, the choice is not just which assistant to use, but which verification and assurance stack will stand behind it. The machines may be writing the code—but someone, or something, will need to check their work before it goes live.

Cary Huang

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.