SelfDefend

Abstract

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs. Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs can identify harmful prompts or intentions in user queries, which we empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. When deployed to protect GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral, these models outperform seven state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. Further experiments show that the tuned models are robust to adaptive jailbreaks and prompt injections.

Defense Prompt

Sizes of model trees

The two detection prompt templates designed. ``[Jailbreak/Normal Prompt]'' will be replaced with the user query.

Tuning Shadow LLMs

Sizes of model trees

Given the powerful defensive capability of GPT-4-based \name, our intuition is to ``transfer'' this capability to an open-source model. To do so, we leverage GPT-4-based \name to distill and generate high-quality tuning data. Specifically, by continuously incorporating harmful and harmless prompts into our defense prompts (i.e., P_{direct} or P_{intent}) as inputs for GPT-4, we gather their outputs as labels for these samples. Since we utilize two defense prompts, we eventually obtain two separate datasets, which we then use to fine-tune the employed open-source model.

Successful Cases

Sizes of model trees

This figure presents a detection example of our methods applied to a jailbreak query, compared with the Llama Guard series. The attack prompt involves creating a fake news article to spread disinformation about a political candidate. Both Llama Guard and Llama Guard 2 identify it as safe, while our methods mark its harmful portions. The reason our method outperforms Llama Guards might be due to the enhanced reasoning ability of the foundation model in identifying harmful portions. As described in our design, $P_{direct}$-based model identifies harmful parts directly from the input text, whereas $P_{intent}$-based model summarizes the request's intention and then identifies unsafe parts from that summary. When comparing our $P_{direct}$-tuned models with $P_{direct}$-based GPT-4, we observed that the parts they lifted from the attack prompt came from different sentences, since $P_{direct}$ emphasizes that it is sufficient to output one harmful part. A similar phenomenon occurs with the intent prompt, where the semantics of GPT-4 and our tuned model's intercepts are not identical.

BibTeX

@inproceedings{wang2024selfdefend,
  title={SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner},
  author={Wang, Xunguang and Wu, Daoyuan and Ji, Zhenlan and Li, Zongjie and Ma, Pingchuan and Wang, Shuai and Li, Yingjiu and Liu, Yang and Liu, Ning and Rahmel, Juergen},
  booktitle={USENIX Security},
  year={2025}
}