Security teams are buying AI defenses that do not hold up once attackers adapt. A joint October 2025 study by researchers from OpenAI, Anthropic, and Google DeepMind is unequivocal: when tested under realistic, adaptive attack conditions, 12 published defenses against LLM jailbreaks and prompt injections were all bypassed, most with success rates above 90%. For CISOs and AI program owners, this is not a theoretical concern; it is a sign that much of today’s AI security tooling has been validated against attackers that behave nothing like real adversaries.
The uncomfortable verdict: ‘near-zero’ attack rates collapse under adaptive pressure
The research paper, titled “The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections,” set out to test a wide range of current defensive strategies: prompting-based guardrails, training-based alignment methods, and input/output filtering approaches. Many of the evaluated defenses had been published with claims of near-zero attack success rates.
Under adaptive testing, those claims did not survive contact with reality. Prompting-based defenses that looked strong on paper yielded attack success rates between 95% and 99%. Training-based defenses, including approaches designed to make the underlying model safer, fared no better; bypass rates reached 96% to 100% when adversaries were allowed to iterate.
The authors invested in a rigorous methodology rather than quick demonstrations. Fourteen researchers participated, and they created a $20,000 prize pool to incentivize and validate successful attacks. Critically, they did not limit themselves to static test suites. Instead, they modeled realistic attackers who study a defense, probe its behavior, adjust strategies, and repeat until they succeed.
For enterprises, the implication is direct: AI security products tested only against one-shot or static benchmarks are likely overestimating their strength. Defenses that look robust in a controlled lab can collapse once an adversary adapts—even slightly—to how a guardrail or filter behaves.
Why stateless controls and WAF-style thinking fail at the inference layer
Many organizations are trying to extend familiar concepts—such as web application firewalls (WAFs)—into the AI era. The research underscores why that mindset is dangerously incomplete.
Traditional WAFs are largely stateless. They inspect individual HTTP requests or payloads in isolation, often using signatures and simple heuristics. Modern AI attacks, by contrast, operate across conversation turns, modify their behavior in response to model output, and exploit semantics rather than syntax.
The study evaluated defenses against established jailbreak methods, including Crescendo and Greedy Coordinate Gradient (GCG):
- Crescendo breaks a malicious objective into innocuous-sounding fragments spread across as many as ten conversational turns. The attacker builds rapport and context gradually until the model eventually complies. No individual message looks obviously malicious.
- Greedy Coordinate Gradient (GCG) is an automated technique that generates adversarial suffixes by optimizing against the model’s gradients. The result is a jailbreak payload that looks opaque or random to human reviewers and signature-based filters.
Both are published methodologies with working code, not speculative attacks. Yet stateless or pattern-based filters struggle with them because each technique targets a different blind spot—context loss, automation, semantic obfuscation—but all exploit the same underlying assumption: that the defense will not adapt to them over time.
As Carter Rees, VP of AI at Reputation, put it, even a seemingly harmless phrase like “ignore previous instructions” or a Base64-encoded payload can be as impactful for an AI system as a buffer overflow was for traditional software. The difference is that these attacks operate at the semantic layer, which signature- or string-based detection simply does not parse. For CISOs used to tuning WAF rules, this is a fundamental architectural shift: you are now defending meaning, not just bytes.
Deployment is surging while AI security lags badly
The fragility of current defenses would be worrying in any environment. In today’s deployment context, it is dangerous.
Gartner projects that by the end of 2026, 40% of enterprise applications will include task-specific AI agents, up from less than 5% in 2025. That is an order-of-magnitude change in exposure in roughly a year. The research suggests the security controls being deployed alongside that growth are often not built—or tested—for adaptive adversaries.
Meanwhile, attacker tradecraft is already shifting. CrowdStrike’s 2025 Global Threat Report found that 79% of detections were “malware-free,” relying instead on hands-on keyboard techniques that sidestep traditional endpoint defenses. Adam Meyers, SVP of Counter Adversary Operations at CrowdStrike, highlighted that the fastest breakout time they observed was just 51 seconds. Defenders are being forced to respond in near real-time while adversaries abandon detectable payloads in favor of abusing legitimate tools and access.
Anthropic’s disruption of what it described as the first documented AI-orchestrated cyber operation in September 2025 shows how quickly this can escalate. Attackers used AI to execute thousands of requests—often multiple per second—while reducing direct human involvement to just 10%–20% of total effort. Campaign durations that previously stretched over three to six months were compressed into 24 to 48 hours. When paired with weak access controls, this acceleration is especially costly: the IBM 2025 Cost of a Data Breach Report noted that 97% of organizations suffering AI-related breaches lacked proper access controls.
Large enterprises are acutely aware of the emerging risk. Jerry Geisler, EVP and CISO of Walmart, has warned that agentic AI introduces entirely new threats that bypass conventional controls—ranging from data exfiltration and autonomous misuse of APIs to covert cross-agent collusion. Those behaviors are precisely the kind that stateless, request-by-request defenses are least equipped to spot.
Four attacker profiles already exploiting AI defense gaps
The OpenAI–Anthropic–DeepMind study makes clear that these weaknesses are not hypothetical. The same patterns the researchers used to break academic defenses are already visible in the wild across four attacker profiles mapped by Carter Rees.
First, the authors emphasize that “security through obscurity” is ineffective at the inference layer. Defensive techniques and configurations inevitably show up in internet-scale training data. As models learn both offensive and defensive patterns, they can help attackers adapt on the fly. In other words, once a heuristic becomes common, it is likely to be reverse-engineered—by people and by AI systems alike.
The four profiles exploiting today’s gaps are:
- External adversaries who operationalize published research. They incorporate attacks like Crescendo, GCG, and ArtPrompt into campaigns, tuning their approaches to the specifics of each target’s defenses, just as the researchers did. The moment a guardrail becomes predictable, it becomes a parameter to optimize around.
- Malicious B2B clients with legitimate API access who turn their position into an intelligence channel. They probe model outputs to infer training data, extract proprietary information through inference attacks, or refine reinforcement learning-based attacks. The research found that reinforcement learning strategies could succeed in black-box setups with as few as 32 sessions of five rounds each—well within the behavior window of a determined customer.
- Compromised API consumers who benefit from trusted credentials. Once they gain access, they can use the model itself to exfiltrate sensitive data or poison downstream systems with manipulated outputs. The paper reports that output filtering failed as badly as input filtering; search-based attacks reliably generated adversarial triggers that evaded both directions, demonstrating that simply adding outbound checks does not fix a fundamentally static detection strategy.
- Negligent insiders who expand your exposure without malicious intent. According to IBM’s 2025 Cost of a Data Breach Report, “shadow AI” behavior—employees pasting sensitive code or data into public LLMs to speed up work—added an average of $670,000 to breach costs. Rees points to incidents like Samsung engineers submitting proprietary semiconductor code to ChatGPT, which retains user inputs for training, as a clear reminder that convenience often trumps caution unless strong guardrails exist.
For CISOs and AI leaders, these profiles underscore that AI risk is not confined to external cybercriminals. It includes partners, customers, compromised applications, and your own workforce, all interacting with AI systems that may be protected by brittle, stateless controls.
Architectural lessons: why stateless detection loses against conversations
The research does more than demonstrate failures; it points to architectural requirements that any serious AI security solution will need to meet. Three design themes stand out:
- Normalization before semantic analysis. Encoding and obfuscation techniques—such as Base64, Unicode tricks, or attacks like ArtPrompt that hide malicious instructions in ASCII art—can easily slip past filters that attempt to reason directly over raw strings. Normalizing content into a consistent, simplified representation before semantic inspection is essential to even see the underlying intent.
- Context tracking across turns. Attacks like Crescendo only become obvious when viewed across a full conversation. This requires stateful analysis that persists context, correlates intents, and recognizes multi-step build-ups of malicious behavior. Stateless, per-message heuristics have no way to connect the dots.
- Bi-directional filtering. Restricting checks to inputs leaves models free to exfiltrate data or operational details through outputs. The study shows that naïve output filters perform no better than input-only filters when adversaries adapt. Instead, defenses must treat the entire interaction loop—inbound prompts and outbound completions—as a single attack surface.
These capabilities are not just technical wishlist items; they are core to governance. As Jamie Norton, CISO at the Australian Securities and Investments Commission, has argued, CISOs must enable innovation without “charging off into the wilderness” and exposing sensitive data. Without architectural support for context, semantics, and bi-directional scrutiny, organizations are effectively blind at the very layer where AI operates.
Seven vendor questions CISOs should make non-negotiable
The paper’s results cut directly against the marketing claims many AI security vendors are making—especially around “near-zero” attack success rates. For buyers, the central risk is a false sense of security. The following seven questions, all mapped to specific failures documented in the research, provide a practical interrogation framework for procurement and design reviews:
-
What is your bypass rate against adaptive attackers? Ask explicitly about scenarios where attackers understand the defense, have time to iterate, and can tune their approach. Results on static test sets or single-attempt evaluations are not representative. Any vendor citing near-zero attack rates without describing an adaptive testing methodology is presenting, at best, an incomplete picture.
-
How does your solution detect multi-turn attacks? Crescendo-style attacks distribute risk across many benign-looking turns. If a vendor relies on stateless inspection of individual requests, they will not see the pattern. Insist on specifics: how is conversation state maintained, over what time window, and how is risk aggregated?
-
How do you handle encoded or obfuscated payloads? Techniques like ArtPrompt illustrate how attackers can hide instructions in ASCII art, while Base64 and Unicode transformations routinely bypass text-matching filters. Vendors should be able to explain their normalization pipeline and how they surface true intent under obfuscation, not just list signatures.
-
Does your solution filter outputs as well as inputs? The study showed that output filtering failed as badly as input filtering when facing adaptive adversaries. Still, a lack of outbound controls leaves organizations exposed to data exfiltration and model abuse. Press for details on how input and output checks interact and how they perform under coordinated attack, not just in isolation.
-
How do you track context across conversation turns? High-level references to “statefulness” are not enough. Ask for implementation-level clarity: what data is retained, how is it summarized, how do they prevent context loss, and how do they handle long-running sessions?
-
How do you test against attackers who understand your defense? The research found that once attackers adapt to a specific protective design, existing defenses usually fail. Vendors that rely on obscurity, hidden rules, or unpublished patterns are assuming their approaches will not be learned and copied—an assumption the paper explicitly challenges.
-
What is your mean time to update defenses for new attack patterns? New jailbreak and prompt-injection techniques are published regularly. A static ruleset or rare update cycle cannot keep pace. Vendors should be able to articulate how they detect emerging attack styles, how frequently they adjust defenses, and how quickly protections can be rolled out across customer environments.
For CISOs and AI leaders, these questions can be woven into RFPs, proofs of concept, and periodic vendor reviews. They also serve as a lens for evaluating in-house controls—if you cannot answer them about your own stack, there is likely unaddressed risk.
Closing the gap between AI deployment and AI security
The combined research from OpenAI, Anthropic, and Google DeepMind delivers a clear message: the majority of AI defenses in use today were designed and tested for attackers who do not adapt. Real attackers do adapt—quickly, creatively, and with help from the same AI systems enterprises are trying to protect.
With AI agents poised to become embedded in a large share of enterprise applications, the exposure curve is steep and rising. The security curve, by contrast, appears flat. That divergence—rapid deployment backed by brittle, stateless defenses—is where the next wave of breaches is likely to occur.
Every organization running LLMs or agentic AI in production should treat this as an audit trigger. Map current controls against the attack methodologies documented in the study. Challenge vendors using the seven questions above. Re-evaluate any “near-zero” attack claims that do not explicitly include adaptive testing. And, critically, begin shifting from WAF-era thinking to architectures that understand context, semantics, and full conversation loops.
The attacker may move second, as the paper’s title suggests—but when that attacker is adaptive, static defenses are already out of the game.

Hi, I’m Cary Huang — a tech enthusiast based in Canada. I’ve spent years working with complex production systems and open-source software. Through TechBuddies.io, my team and I share practical engineering insights, curate relevant tech news, and recommend useful tools and products to help developers learn and work more effectively.





