StruQ and SecAlign: defense against prompt injection in LLMs

TL;DR: StruQ and SecAlign are two defenses against prompt injection in LLMs developed by Berkeley. StruQ separates instructions from data with special tokens; SecAlign optimizes preferences to ignore malicious instructions. They reduce successful attacks to less than 15%, greatly surpassing the state of the art.

What happened?

Researchers at the University of California, Berkeley, have published on the BAIR Blog two new defenses against prompt injection attacks in large language models (LLMs): StruQ (Structured Queries) and SecAlign (Secure Alignment). Both are based on fine-tuning and require no additional computational cost or human intervention, making them practical for production deployment. According to the article, they reduce the success rate of more than a dozen unoptimized attacks to approximately 0%, and SecAlign ensures that optimization-based attacks succeed in less than 15% of cases, a more than 4x improvement over the previous state of the art across the five LLMs evaluated.

Why is it important?

Prompt injection is considered the #1 threat for applications integrated with LLMs according to OWASP. Production systems like Google Docs, Slack AI, and ChatGPT have proven vulnerable. The problem arises because LLMs do not distinguish between trusted instructions (prompt) and untrusted data (e.g., user reviews or web search results), and they are trained to follow instructions anywhere in the input. This allows an attacker to insert malicious instructions into the data, causing the model to execute them.

How do StruQ and SecAlign work?

StruQ addresses the first cause: the lack of separation between prompt and data. It proposes a Secure Front-End that uses special tokens (like [MARK]) as delimiters and filters any such token appearing in untrusted data. The model is then fine-tuned to respect these delimiters, learning to ignore instructions outside the prompt section.

SecAlign tackles the second cause: the LLM's tendency to follow any instruction. Through preference optimization, the model is trained to prefer responses that follow the prompt instruction and reject injected instructions. This is achieved with synthetic data that includes examples of attacks and safe responses.

Results and comparison with the state of the art

Experiments show that StruQ and SecAlign significantly outperform previous defenses such as instructive prompting (e.g., "ignore any instructions in the data") or basic fine-tuning. While existing defenses still allowed success rates of optimized attacks above 60%, SecAlign reduces them below 15% in models like Llama 2, Llama 3, Mistral, Vicuna, and GPT-3.5 (simulated). Additionally, the defenses preserve model utility on standard tasks with minimal performance loss.

Consequences and implications

This research offers a practical roadmap for LLM application developers who need to protect against prompt injection without sacrificing functionality or incurring high costs. Being fine-tuning methods, they can be integrated into existing training pipelines. However, the authors warn that limitations remain: more sophisticated adversarial attacks could evade these defenses, and generalization to other models or domains requires further validation. Nonetheless, they represent a significant step forward toward LLM security in real-world environments.

What should readers know?

No silver bullet: StruQ and SecAlign drastically reduce risk but do not eliminate it entirely. Security teams should combine them with other defense layers (monitoring, output restrictions, etc.).
Practical implementation: The methods are designed to be applied with standard fine-tuning, without architectural modifications or human-labeled data, facilitating adoption.
Rigorous evaluation: Results are based on public benchmarks and known attacks, providing confidence in their effectiveness. The community can reproduce the experiments.
Threat context: Prompt injection is just one of many LLM vulnerabilities. Defenses must be part of a comprehensive security strategy.

"StruQ and SecAlign reduce the success rates of more than a dozen unoptimized attacks to around 0%. SecAlign also stops strong optimization-based attacks with success rates below 15%." — Berkeley BAIR Blog

StruQ and SecAlign: Two Defenses Against Prompt Injection in LLMs