Source code for the paper Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge.
Install a python environment and install dependency:
conda env create -f environment.yaml
- Download the Llama-2-7b-chat-hf and store it in ./models/Llama2-7b-chat-hf/
- Download the Wizard-Vicuna-30B-Uncensored-GPTQ and store it in ./models/Wizard-Vicuna-30B-Uncensored-GPTQ/
- Set your OpenAI key in line 6 of ./generate_data/entity_extraction.py
- Execute the following commands to generate data:
cd generate_data
bash generate.sh
The trained Lora adapter is stored in "./models/Eraser_Llama2_7b_Lora". If you want to retrain, execute the following commands (Please generate the data before training):
cd train_Eraser
python train.py
AIM attack:
cd Evaluate_defense_capability
bash run_AIM.sh
AutoDAN attack: Please refer to https://github.com/SheltonLiu-N/AutoDAN
GCG attack: Please refer to https://github.com/llm-attacks/llm-attacks
Please refer to https://github.com/EleutherAI/lm-evaluation-harness
@article{lu2024eraser,
title={Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge},
author={Lu, Weikai and Zeng, Ziqian and Wang, Jianwei and Lu, Zhengdong and Chen, Zelin and Zhuang, Huiping and Chen, Cen},
journal={arXiv preprint arXiv:2404.05880},
year={2024}
}