Skip to content
Home » All Posts » LLM Reasoning vs. SAST: How Anthropic and OpenAI Just Rewrote AppSec Detection

LLM Reasoning vs. SAST: How Anthropic and OpenAI Just Rewrote AppSec Detection

Static application security testing (SAST) has been a backbone of enterprise AppSec programs for more than a decade. But two back-to-back launches from Anthropic and OpenAI are exposing a structural limitation in pattern-matching scanners that no tuning or rule pack can fix. By putting large language model (LLM) reasoning directly on top of code, both vendors are surfacing entire classes of vulnerabilities that legacy tools were never designed to see—while making those capabilities effectively free for enterprises.

For CISOs and AppSec leaders, this is not a curiosity. It is a procurement, governance, and risk timing problem that will land on the next board agenda.

The structural blind spot in traditional SAST

SAST tools were built to detect known bad patterns: insecure function calls, unsafe APIs, tainted data flows that match defined rules. That architecture works well for recurring mistakes that can be expressed as signatures or simple logic. It does not work as well for vulnerabilities that emerge only when you understand multi-file logic, algorithmic behavior, or subtle state transitions over time.

Anthropic and OpenAI both attacked that ceiling by using LLMs to reason over code, instead of treating detection as a pattern-matching exercise. Their results converged on the same conclusion: no matter how mature your SAST deployment, there are classes of bugs it is structurally blind to.

This is not about one more scanner to bolt onto an already crowded pipeline. It is about recognizing that your current AppSec stack was architected for a previous generation of threats, and that reasoning-based analysis is now table stakes for discovering certain high-impact issues.

What Anthropic and OpenAI actually found

Anthropic made its first move on February 5, publishing zero-day research alongside the release of Claude Opus 4.6. Using that model, the company reported more than 500 previously unknown high-severity vulnerabilities in long-lived open-source codebases—projects that had already survived decades of expert review and extensive fuzzing.

One example, in the CGIF library, illustrates what is new. Claude identified a heap buffer overflow by reasoning about the LZW compression algorithm itself. Coverage-guided fuzzing had achieved 100% code coverage on this target but still missed the flaw. The issue emerged not from untested branches but from understanding the algorithmic behavior at a higher level than traditional tools provide.

Anthropic subsequently packaged this capability as Claude Code Security, a reasoning-based vulnerability scanner available in limited research preview as of February 20 for Enterprise and Team customers, with expedited free access for open-source maintainers. The stated goal, according to Anthropic’s communications lead in an interview, is to make these defensive capabilities widely available.

OpenAI reached similar conclusions via a different path. Codex Security, derived from an internal GPT-5–powered tool called Aardvark, reached private beta in 2025 and scanned more than 1.2 million commits across external repositories. OpenAI says the system surfaced 792 critical and 10,561 high-severity findings across targets including OpenSSH, GnuTLS, GOGS, Thorium, libssh, PHP, and Chromium. The work has already resulted in 14 assigned CVEs.

During beta, OpenAI reports that Codex Security cut false positive rates by more than 50% across repositories and reduced over-reported severity by over 90%. That combination—new bug classes plus lower noise—directly challenges SAST’s value proposition as the authoritative early-stage gatekeeper.

At the same time, these vendor-reported metrics are not independently audited. Anthropic and OpenAI have not subjected their detection claims to third-party verification. Security leaders should treat the numbers as directional, not certified—useful for understanding capabilities, but not sufficient to justify rip-and-replace decisions alone.

Why reasoning-based scanners don’t replace your stack

Despite eye-catching zero-day counts, neither Claude Code Security nor Codex Security is a drop-in replacement for your existing AppSec stack. Both are research previews with access constraints. Both focus on the code reasoning layer, not the entire software lifecycle.

Independent testing has also shown limitations. Checkmarx Zero researchers, for example, demonstrated that moderately complex vulnerabilities can evade Claude Code Security, and that developers can sometimes trick the agent into overlooking problematic code. In one production-grade codebase scan, Claude identified eight alleged vulnerabilities but only two were true positives. This suggests a detection ceiling below initial headline numbers, especially when code is deliberately obfuscated or complex.

More importantly, these LLM-based scanners do not address adjacent layers: software composition analysis (SCA), container scanning, infrastructure-as-code (IaC) checks, dynamic application security testing (DAST), or runtime detection and response. Your current tooling remains responsible for those domains, and its importance does not diminish just because a new class of static analysis emerged.

Where the impact is immediate is in pricing power and procurement logic. If enterprise-grade code reasoning becomes effectively free from major AI labs, the value proposition for standalone static scanners shifts. You are unlikely to retire SAST entirely; you are more likely to renegotiate, right-size, or re-scope what you pay for.

Vendor reactions: commoditized scanning, harder remediation

Established AppSec vendors are already framing the shift. Developer security platform Snyk has acknowledged the technical breakthrough while arguing that detection has never been the hard part. The true bottleneck is fixing vulnerabilities at scale across hundreds of repositories without breaking production.

Snyk points to research summarized in Veracode’s 2025 GenAI Code Security Report, which found that AI-generated code is 2.74 times more likely to contain vulnerabilities than human-written code. The same class of models that is now exceptionally good at zero-day hunting is simultaneously introducing new vulnerability classes every time it writes code. This tension makes remediation and guardrails, not just detection, the core problem.

Cycode’s CTO, Ronen Slavin, has characterized Claude Code Security as a genuine advancement in static analysis but emphasized that LLMs are probabilistic by nature. Enterprise security leaders, he argues, need consistent, reproducible, audit-grade results. A scanner embedded in an IDE, however powerful, does not replace platforms that enforce governance, protect CI/CD pipeline integrity, and monitor runtime behavior.

From a budget perspective, Merritt Baer, CSO at Enkrypt AI and former Deputy CISO at AWS, captures the likely trajectory: if code reasoning scanners from major AI labs are effectively free to enterprises, static code scanning commoditizes quickly. Over the next 12 months, Baer expects AppSec spending to tilt toward three areas:

Runtime and exploitability layers. Tools that understand actual exposure in production—runtime protection, attack path analysis, and contextual exploitability—become more central as raw detection gets cheaper.

AI governance and model security. Guardrails, prompt-injection defenses, and oversight of agents (including scanning agents) will absorb more budget as these models become part of critical workflows.

Remediation automation. Platforms that shorten the cycle from discovery to patching, and orchestrate changes safely at scale, become the new bottleneck and value center.

Dual-use risk: treat these findings like near–zero days

The same properties that make LLMs powerful defensive tools create uncomfortable dual-use dynamics. Anthropic and OpenAI are responsibly disclosing and coordinating patches for the vulnerabilities they find in open-source projects. But nothing prevents adversaries from aiming similar reasoning systems—via API or competing labs—at the same public codebases.

Baer’s guidance is blunt: vulnerabilities surfaced by reasoning models in widely used open source should be treated closer to zero-days than to backlog items. The real risk window is between vendor discovery and your own deployment of the fixes. That interval is where attackers operate.

Evidence from other AI security work reinforces the point. AI security startup AISLE, for example, independently discovered all 12 zero-day vulnerabilities that appeared in OpenSSL’s January 2026 security patch, including a stack buffer overflow (CVE-2025-15467) that may be remotely exploitable without valid key material. Traditional fuzzers had been aimed at OpenSSL for years and missed every one of these issues.

The implication is clear: assume capable adversaries are already running reasoning-based analysis against the same dependencies you rely on. Your advantage comes not from hidden code but from how quickly you can detect, triage, and remediate once issues surface.

Redrawing your AppSec investment map

From a strategy standpoint, reasoning-based scanners force AppSec leaders to rethink how they allocate time and budget across the lifecycle.

First, pattern-matching SAST is no longer sufficient as the sole early-stage gate. It still has value, especially for known bad patterns and regulatory checklists, but it now sits alongside LLM reasoning tools that can assess multi-file logic, state transitions, and developer intent. As Baer framed it for boards: organizations “bought the right tools for the threats of the last decade; the technology just advanced.”

Second, the center of gravity shifts from finding to fixing. With Anthropic and OpenAI compressing the detection window, your competitive advantage becomes how quickly and safely you can manage change. Any investment that reduces friction between discovery, triage, and patching—automation, better ownership mapping, change validation—directly reduces risk.

Third, governance and data protection need to catch up. These scanners operate on your source code, which is your crown-jewel intellectual property. Yet, in interviews with dozens of CISOs, formal governance frameworks for reasoning-based scanning are still rare. Questions around training exclusion, data retention, subprocessor use, and derived IP (such as embeddings or reasoning traces) remain under-specified. Data residency for code, once largely ignored compared to customer data, is increasingly subject to export controls and national security review.

Finally, running both Anthropic and OpenAI in parallel may be a rational interim strategy. Each uses different architectures: Claude Code Security emphasizes contextual reasoning, data-flow tracing, and multi-stage self-verification; Codex Security reportedly builds project-specific threat models and validates findings in sandboxed environments. Because different models “reason differently,” the delta between them can reveal issues that neither would capture consistently on its own, at least in the short term.

Seven concrete actions before your next board meeting

Board questions are coming: Which scanner are you piloting? Why did your existing tools miss vulnerabilities a model just found? You want empirical answers, not vendor slides. The following seven moves, grounded in what these tools can actually do today, will put you in a defensible position.

  1. Run both scanners on a representative subset. Select a single, representative repository—large enough to be realistic, small enough to manage. Run Claude Code Security and Codex Security side-by-side, and compare their findings to your current SAST output. The delta is your working inventory of structural blind spots. Given both tools are in research preview with access limits, avoid whole-estate scans for now.

  2. Define governance before you pilot. Treat each scanner as a new data processor for your most sensitive IP. Negotiate data-processing terms that explicitly address training exclusion, data retention, and subprocessor chains. Build a segmented submission pipeline so only designated repositories can be scanned. Classify which code may leave your boundary and which may not. Explicitly address whether embeddings, reasoning traces, or other derived artifacts are considered your IP, and align on data residency expectations for code.

  3. Map what these tools do not cover. Make a clear inventory of remaining responsibilities: SCA, container scanning, IaC checks, DAST, runtime detection and response. Claude Code Security and Codex Security live at the code reasoning layer; everything else remains on your current stack. This exercise clarifies where renewal and reinvestment are still critical and highlights where you now have leverage in SAST negotiations.

  4. Quantify dual-use exposure in your dependencies. For each major open-source component you rely on, track where Anthropic, OpenAI, or other AI security firms have recently disclosed vulnerabilities. Use that to estimate how many of your applications transitively depend on the affected projects. Assume adversaries can run similar scans. Your risk is the time between upstream disclosure and your patch deployment, so invest in SBOM visibility and automated dependency hygiene.

  5. Prepare a board-ready comparison. Build a one-page, side-by-side summary of Claude Code Security, Codex Security, and your existing SAST: detection approach, coverage, integration points, and limitations. Include the key message that legacy SAST was built for known anti-patterns, while reasoning models can evaluate multi-file logic and intent. Emphasize that the new tools are additive, not replacements—yet.

  6. Track the competitive release cycle. Both Anthropic and OpenAI are moving toward IPOs; enterprise security traction feeds their growth narratives. Expect rapid model updates on monthly cycles and fast follower behavior—when one lab closes a blind spot, the other will likely target it. Consider running both tools, at least for high-value repositories, to benefit from diversity of reasoning systems while the technology is still maturing.

  7. Set a 30-day pilot window with clear exit criteria. Time-box your initial evaluation to 30 days. Run both scanners on the same codebase, measure overlap and unique findings, evaluate operational noise, and assess integration costs. Use real findings and remediation outcomes—not marketing claims—to drive your procurement and roadmap decisions.

Anthropic and OpenAI launched their scanners 14 days apart. The gap between subsequent releases will be even shorter. Attackers are watching the same calendar. Your job is not to pick a winner in the lab race; it is to realign your detection, governance, and remediation programs to a world where reasoning about code is cheap, widely available, and no longer optional.

Join the conversation

Your email address will not be published. Required fields are marked *