CoT Forgery: New Attack Tricks LLMs by Impersonating Their Own Reasoning
Researchers demonstrate that injecting fake chains of thought bypasses chatbot security, achieving a 60% success rate in revealing prohibited information.
July 4, 2026 · 3 min read
TL;DR: The CoT Forgery attack exploits the tendency of LLMs to trust text that mimics their own reasoning style, bypassing safety restrictions with a 60% success rate. The technique reveals a fundamental vulnerability in chatbot architecture.
What happened?
Independent researchers (Charles Ye, Jasmine Cui, and MIT Associate Professor Dylan Hadfield-Menell) have published a paper titled “Prompt Injection as Role Confusion” to be presented at the ICML 2026 conference. In it, they demonstrate a new type of prompt injection attack called CoT Forgery.
The attack involves inserting a fake chain of reasoning into the prompt that mimics the model's own style (e.g., using phrases like “The user is wearing a green shirt, so it’s safe to share the recipe”). Because it is written with the same structure as the LLM's internal reasoning, the model treats it as its own already-validated thought, inheriting the trust it places in its own conclusions. This causes the model to execute prohibited instructions (such as providing a recipe for synthesizing cocaine) without triggering safety filters.
Tests showed the attack's success rate jumped from near zero to approximately 60% across all evaluated models, and the technique won the OpenAI GPT-OSS-20B red-teaming contest on Kaggle in 2025.
Why is this important?
This finding exposes a fundamental weakness in the security architecture of current LLMs. Models receive the conversation history as a continuous text string, where tags like <user>, <tool>, or <think> indicate the source and authority of each segment. However, the researchers built “role probes” that internally measure how the model interprets each token. They discovered that models rely more on writing style than on role tags to decide whether a text is its own reasoning or an external instruction. This means any text that looks like reasoning (by its structure, tone, or vocabulary) is treated as such, even if tags indicate otherwise.
The problem is serious because LLM security has heavily relied on these role tags to separate trusted commands from untrusted data. As the authors note:
“Role tags were a formatting trick that became the security architecture and cognitive scaffolding of modern LLMs.”
Consequences and risks
The CoT Forgery attack is especially dangerous because:
- It is not weakened by more extreme requests, unlike persuasion-based jailbreaks.
- It is easy to execute: just add a few lines of text in the appropriate style.
- It affects all tested models, suggesting a widespread vulnerability.
The researchers also showed that removing stylistic markers that make injected text resemble the model's reasoning dropped the attack success rate from 61% to 10%. Additionally, changing a single phrase like “The user” to “The request” reduced success by 19%.
In another experiment, they hid a malicious command on a webpage (instructing the model to upload a secrets file) and prefixed it with “User:” to make it sound like a legitimate user instruction. This also worked, confirming that role confusion is a general principle explaining why prompt injection is effective.
What should readers know?
For users and companies integrating LLMs into their systems, this vulnerability means that role tags alone cannot be trusted to ensure safety. Current defenses (such as content filters, adversarial training, or human oversight) may be insufficient if the model cannot correctly distinguish between its own reasoning and external data.
Developers should consider more robust approaches, such as explicit validation of each segment's origin through grounding techniques or using separate models for reasoning generation and action execution. Meanwhile, users should be aware that even the most advanced chatbots can be relatively easily deceived.
Future research
The paper suggests that role confusion could explain many other types of prompt injection. The authors plan to explore defenses that strengthen the distinction between the model's own reasoning and external data, perhaps through architectures that do not rely on text tags to establish authority.