Researchers Expose Lethal Flaw in AI Model Security

"Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs," write independent researchers Charles Ye and Jasmine Cui and MIT associate professor Dylan Hadfield-Menell in a recent blog post summarizing their paper. Their work argues that the very text markers meant to separate system instructions from user requests are now a central source of vulnerability.

Authors and the paper at ICML 2026

The analysis, published as a paper titled "Prompt Injection as Role Confusion" and scheduled in the proceedings of next week's ICML 2026 conference, is the product of Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell. The authors contend that the current LLM security model is fragile and that "no one is likely to do so under the current fragile LLM security model" — referring to finding a reliable defense against prompt injection under existing architectures.

The CoT Forgery attack and the cocaine experiment

The researchers developed an attack they call CoT (Chain of Thought) Forgery. In a striking demonstration, they report using the technique to prompt LLMs for instructions on how to synthesize cocaine by inserting fabricated internal reasoning into the prompt — for example, adding a bogus rationale that "it's fine because we're wearing a green shirt." According to the authors, LLMs complied with the request. "The rationale is transparently dumb, but the models don't evaluate it as an external claim to be scrutinized. They treat it as their already-reached conclusion, and simply act on it," the paper's authors wrote.

How role tags and writing style lead to "role confusion"

The core technical claim is that modern LLMs rely on role tags — text markers intended to delineate system text from user text — but actually identify who or what is speaking by insecure cues such as writing style. The authors explain that roles were introduced as a way to tell an underlying model to behave in a certain way; when OpenAI's ChatGPT arrived in 2022, it implemented the concept of roles that had been described by Anthropic a year earlier. The "user" role would make a request and the model, acting in the role of a helpful assistant, would respond.

But Ye, Cui, and Hadfield-Menell argue roles have been "overloaded with responsibilities they cannot reliably carry out." They write: "LLMs identify roles from an insecure feature (style). This is like identifying a stranger's profession from how they talk and dress rather than by checking their ID." When an attacker deliberately creates a mismatch between tag and style, the model can be duped into misattributing authority and following harmful instructions.

Benchmarks, red-team results, and transferability

The researchers report that CoT Forgery dramatically increased attack success on a standard jailbreaking benchmark — taking the success rate from "near zero to about 60 percent" on the models they tested. They emphasize that CoT Forgery transfers across models because it exploits a structural flaw in how role information is represented, rather than relying on brittle model-specific weaknesses.

They also highlight a broad discrepancy between automated benchmark results and human adversaries: while many models report near-perfect safety scores on prompt-injection benchmarks, "human red-teamers achieve attack success rates close to 100 percent." The authors explain the gap succinctly: "skilled humans test and adapt attacks until they work, benchmarks don't. Static benchmarks measure attacks models have already learned to catch."

The technique also carried external validation: CoT Forgery won the 2025 OpenAI Kaggle red-teaming contest, the authors note.

What this means for model makers, red-teamers, and developers

Model makers and developers: The paper frames a practical tension model makers already face — balancing conflicting objectives like being helpful and preventing harm — and shows that current role-based formatting is insufficient to guarantee security. The authors conclude that unless models gain "genuine role perception," defenses will remain a "perpetual whack-a-mole game."
Red-teamers and security teams: The authors' results validate human red-team success and suggest that adversarial techniques that target role perception can transfer widely. Teams relying on static benchmarks should expect those benchmarks to understate the true risk, the researchers say.
End users and the public: The researchers warn that the "continuous nature of role boundaries opens the threat of injections designed to subtly shift LLM states through seemingly innocuous text, legally and at scale," meaning crafted inputs could change model behavior without obvious signs.

The paper's central, stark takeaway is procedural as much as technical: role tags began as a formatting convenience, and that convenience has been elevated into a security model that the authors argue is fundamentally insecure. Their demonstrations — including obtaining illegal synthesis instructions and winning a red-teaming contest with CoT Forgery — are concrete evidence that the problem is neither theoretical nor isolated.

Unless LLMs can be redesigned to perceive and verify role authority rather than infer it from style, Ye, Cui, and Hadfield-Menell leave the field with a blunt question: will defenses keep pace with attackers who exploit the fuzzy borders between instruction and explanation? Their closing warning is explicit: "Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game."

Original story