DarkCite

The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

The widespread deployment of large language models (LLMs) across various domains has showcased their immense potential while exposing significant security vulnerabilities. A major concern is ensuring that LLM-generated content aligns with human values. Existing jailbreak techniques reveal how this alignment can be compromised through specific prompts or adversarial suffixes. In this study, we introduce a new threat: LLMs' bias toward authority. While this inherent bias can improve the quality of outputs generated by LLMs, it also introduces a potential vulnerability, increasing the risk of producing harmful content. Notably, the biases in LLMs is the varying levels of trust given to different types of authoritative information in harmful queries. For example, malware development often favors trust GitHub. To better reveal the risks with LLM, we propose DarkCite, an adaptive authority citation matcher and generator designed for a black-box setting. DarkCite matches optimal citation types to specific risk types and generates authoritative citations relevant to harmful instructions, enabling more effective jailbreak attacks on aligned LLMs. Our experiments show that DarkCite achieves a higher attack success rate (e.g., LLama-2 at 76% versus 68%) than previous methods. To counter this risk, we propose an authenticity and harm verification defense strategy, raising the average defense pass rate (DPR) from 11% to 74%. More importantly, the ability to link citations to the content they encompass has become a foundational function in LLMs, amplifying the influence of LLMs' bias toward authority.

Dependencies

You can create and activate the same running environment and install dependencies as us by using the following commands:

conda create -n DarkCite python=3.10
conda activate DarkCite
pip install -r requirements.txt

Run

You can run some examples using the following commands:

python main.py --target_model "llama-2-7b" --judge_model "llama-guard-3" --nums_pre_scene 1 --input_path "data\DarkCite\darkcite_advbench_using_gpt.jsonl"

Citation

@article{yang2024dark,
  title={The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models},
  author={Yang, Xikang and Tang, Xuehai and Han, Jizhong and Hu, Songlin},
  journal={arXiv preprint arXiv:2411.11407},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
judges.py		judges.py
main.py		main.py
openai_proxy.py		openai_proxy.py
requirements.txt		requirements.txt
system_template.py		system_template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DarkCite

Dependencies

Run

Citation

About

Releases

Packages

Languages

License

YancyKahn/DarkCite

Folders and files

Latest commit

History

Repository files navigation

DarkCite

Dependencies

Run

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages