Eraser

Source code for the paper Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge.

Installation

Install a python environment and install dependency:

conda env create -f environment.yaml

Download Models

Download the Llama-2-7b-chat-hf and store it in ./models/Llama2-7b-chat-hf/
Download the Wizard-Vicuna-30B-Uncensored-GPTQ and store it in ./models/Wizard-Vicuna-30B-Uncensored-GPTQ/

Generate Data

Set your OpenAI key in line 6 of ./generate_data/entity_extraction.py
Execute the following commands to generate data:

cd generate_data
bash generate.sh

Eraser Training

The trained Lora adapter is stored in "./models/Eraser_Llama2_7b_Lora". If you want to retrain, execute the following commands (Please generate the data before training):

cd train_Eraser
python train.py

Evaluting the defense capability:

AIM attack:

cd Evaluate_defense_capability
bash run_AIM.sh

AutoDAN attack: Please refer to https://github.com/SheltonLiu-N/AutoDAN

GCG attack: Please refer to https://github.com/llm-attacks/llm-attacks

Evaluting the general capability:

Please refer to https://github.com/EleutherAI/lm-evaluation-harness

Cite:

@article{lu2024eraser,
  title={Eraser: Jailbreaking defense in large language models via unlearning harmful knowledge},
  author={Lu, Weikai and Zeng, Ziqian and Wang, Jianwei and Lu, Zhengdong and Chen, Zelin and Zhuang, Huiping and Chen, Cen},
  journal={arXiv preprint arXiv:2404.05880},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eraser

Installation

Download Models

Generate Data

Eraser Training

Evaluting the defense capability:

Evaluting the general capability:

Cite:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Evaluate_defense_capability		Evaluate_defense_capability
data		data
generate_data		generate_data
models/Eraser_Llama2_7b_Lora		models/Eraser_Llama2_7b_Lora
train_Eraser		train_Eraser
README.md		README.md
environment.yaml		environment.yaml

ZeroNLP/Eraser

Folders and files

Latest commit

History

Repository files navigation

Eraser

Installation

Download Models

Generate Data

Eraser Training

Evaluting the defense capability:

Evaluting the general capability:

Cite:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages