AutoDAN-Turbo:
A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

* Equal Contribution
Corresponding to xiaogeng.liu@wisc.edu
Pre-print
1University of Wisconsin–Madison, 2NVIDIA, 3Cornell University, 4Washington University, St. Louis 5University of Michigan, Ann Arbor, 6The Ohio State University, 7UIUC

Abstract

In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, \modelname~can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

Left: our method AutoDAN-Turbo achieves the best attack performance compared with other black-box baselines in Harmbench, surpassing the runner-up by a large margin. Right: our method AutoDAN-Turbo autonomously discovers jailbreak strategies without human intervention and generates jailbreak prompts based on the specific strategies it discovers.

Features


Pipeline

The Pipeline of AutoDAN-Turbo


Automatic Strategy Discovery

AutoDAN-Turbo can discover new jailbreak strategies without human intervention or reliance on predefined strategies. The system learns from previously discovered strategies, enhancing its ability to devise even more powerful and efficient strategies over time. AutoDAN-Turbo can automatically convert discovered strategies into a specific data structure. This structure is then incorporated into a strategy library, which AutoDAN-Turbo can reference and utilize in future jailbreak attempts.

External Strategy Compatibility

AutoDAN-Turbo is a unified framework equipped with a plug-and-play feature that seamlessly integrates externally provided strategies into its own strategy library. Intriguingly, AutoDAN-Turbo not only utilizes these external strategies proficiently but also draws inspiration from them, leading to the discovery of novel, robust, and highly efficient jailbreaking methods.

Practical Usage

AutoDAN-Turbo operates as a black-box system, requiring access only to the outputs of the LLMs.

As illustrated in our video, AutoDAN-Turbo persistently attempts jailbreaking against various malicious requests. Throughout this ongoing endeavor, it progressively discovers and evolves increasingly complex jailbreak strategies. Remarkably, this entire process is fully automated, devoid of human intervention.

Evaluations


Main Results

Top: The ASR results evaluated using the Harmbench protocol, where higher values indicate better performance. Bottom: The scores evaluated using the StrongREJECT protocol, also with higher values being better. Our method outperforms the runner-up by 72.4% in Harmbench ASR and by 93.1% in StrongREJECT scores. The model name in parentheses indicates the attacker model used in our method.

Our method is the state-of-the-art attack in Harmbench.

We compare the attack effectiveness of AutoDAN-Turbo with other baselines. Specifically, we evaluate two versions of our AutoDAN-Turbo, AutoDAN-Turbo (Gemma-7B-it), where Gemma-7B-it serves as the attacker and the strategy summarizer, and AutoDAN-Turbo (Llama-3-70B), where the Llama-3-70B serves as the attacker and the strategy summarizer.
As illustrated above, our method AutoDAN-Turbo consistently achieves better performance in both Harmbench ASR and StrongREJECT Score, which means that our method not only induces the target LLM to answer and provide harmful content in more malicious requests, as measured by the Harmbench ASR, but also results in a higher level of maliciousness compared to responses induced by other attacks, as indicated by the StrongREJECT Score. Specifically, if we use the Gemma-7B-it model as the attacker and strategy summarizer in our method (i.e., AutoDAN-Turbo (Gemma-7B-it)), we have an average Harmbench ASR of 56.4, surpassing the runner-up (Rainbow Teaming, 33.1) by 70.4%, and StrongREJECT Score equals to 0.24, surpassing the runner up (Rainbow Teaming, 0.13) by 84.6%. If we utilize a larger model, i.e., the Llama-3-70B as the attacker and strategy summarizer in our method (i.e., AutoDAN-Turbo (Llama-3-70B)), we have an average Harmbench ASR of 57.7, surpassing the runner-up (Rainbow Teaming, 33.1) by 74.3%, and StrongREJECT Score equals to 0.25, surpassing the runner up (Rainbow Teaming, 0.13) by 92.3%. Interestingly, our method demonstrates remarkable jailbreak effectiveness on one of the most powerful models, GPT-4-1106-turbo. Specifically, AutoDAN-Turbo (Gemma-7B-it) achieves a Harmbench ASR of 83.8, and AutoDAN-Turbo (Llama-3-70B) achieves 88.5, showcasing the great effectiveness of our method on state-of-the-art models. We also compare our method with all the jailbreak attacks included in Hrambench, which shows our method is the most powerful jailbreak attack in Harmbench. The outstanding performance of our method compared to the baselines is due to the autonomous exploration of jailbreak strategies without human intervention or predefined scopes. In contrast, Rainbow Teaming only utilizes 8 human-developed jailbreak strategies as its jailbreak reference. This fixed predefined scope leads to a lower ASR.

Strategy Transferability across Different Models

Our experiments on the transferability of the strategy library that AutoDAN-Turbo has learned proceed as follows: First, we run AutoDAN-Turbo with Llama-2-7B-chat. This process results in a skill library containing 21 jailbreak strategies. We then use different attacker LLMs and different target LLMs to evaluate if the strategy library can still be effective across various attacker and target LLMs. The evaluation has two different settings. In the first setting, we test if the strategy library can be directly used without any updates, by fixing the strategy library and measuring the Harmbench ASR (noted as Pre-ASR). In the second setting, the strategy library is updated according to new attack logs generated by new attacker and target LLMs, and new strategies are added to the library. We also report the Harmbench ASR in this setting (noted as Post-ASR), as well as the number of strategies in the strategy library (noted as Post-TSF).

Transferability of strategy library across different attacker and target LLMs

According to the results, the strategy library that AutoDAN-Turbo has learned demonstrates strong transferability, which can be detailed in two points: Firstly, the strategy library can transfer across different target models. This is evident from the columns in blue, where the attacker is Llama-2-7B-chat and the target models vary. Despite the diversity of the victim models, the Harmbench ASR remains consistently high, indicating effective jailbreaks. This means that the strategies learned by attacking Llama-2-7B-chat are also effective against other models like Llama-3-8B and Gemma-7B-it. Secondly, the strategy library can transfer across different attacker models. This is shown in the columns in gray, where the target model is Llama-2-7B-chat and the attacker models vary. Each attacker model achieves a high ASR compared to the original attacker, Llama-2-7B-chat. This indicates that strategies used by one attacker can also be leveraged by other LLM jailbreak attackers. Another important observation is that, under the continual learning setting, the AutoDAN-Turbo framework can effectively update the strategy library with new attacker and target LLMs, thereby improving the Harmbench ASR. This is validated by comparing the Pre-ASR with the Post-ASR, and by comparing the Post-TSF with the original TSF which equals to 21.

Compatibility of Human-developed Strategy

We evaluate whether our AutoDAN-Turbo can use existing human-designed jailbreak strategies in a plug-and-play manner. Here, we gathered 7 human-designed jailbreak strategies from academic papers and evaluated whether our AutoDAN-Turbo framework can use these strategies to enhance its performance. For evaluation, we use Gemma-7B-it and Llama-3-70B as the attacker models, and Llama-2-7B-chat and GPT-4-1106-turbo as the target models. We define two breakpoints for injecting the human-developed strategies into the AutoDAN-Turbo framework: Breakpoint 1: when the framework starts to run and the strategy library is empty. Breakpoint 2: after the framework has run for 3000 iterations on different malicious requests without generating any new strategies.

The attack performance of AutoDAN-Turbo when external human-designed strategies are injected into the framework

As shown above, the injection of human-designed strategies consistently increases the number of strategies in the strategy library and improves the attack success rate. Additionally, injecting strategies at Breakpoint 2 leads to greater improvements since the existing strategies in the library allow the framework to generate more combinations of jailbreak strategies compared to Breakpoint 1, where the strategy library was empty.

Playground


AutoDAN-Turbo