We show that knowledge distillation can be subverted to manipulate language model benchmark scores, revealing a critical vulnerability in current evaluation practices. We introduce "Data Laundering," a three-phase process analogous to financial money laundering, that enables the covert transfer of benchmark-specific knowledge through seemingly legitimate intermediate training steps. Through extensive experiments with a 2-layer BERT student model, we show how this approach can achieve substantial improvements in benchmark accuracy (up to 75% on GPQA) without developing genuine reasoning capabilities. Our investigation examines various aspects of this vulnerability, including the impact of different loss functions (MSE vs. KL divergence), the role of soft-label weighting (
The Data Laundering framework parallels traditional money laundering phases: Placement (knowledge acquisition through teacher model), Layering (knowledge transformation through distillation), and Integration (legitimate knowledge verification through benchmark testing). This analogy illustrates how knowledge can be effectively transferred while maintaining clear separation from source domains.
Benchmark Dataset:
- GPQA
- MMLU-Redux
Training Dataset:
- MedMCQA
- RACE
Bert-base (2 and 12 layers) and GPT-2 (2 and 12 layers)
These results underscore the applicability of the Data Laundering method to inflate benchmark scores, revealing vulnerabilities in benchmarks to contamination during training. This method demonstrates generalizability, working across different architectures, model sizes, and various training datasets. Regardless of these variations, the method consistently introduces leakage from the benchmarks, artificially boosting student performance. While these inflated scores may obscure the true reasoning capabilities of models, they expose critical areas where benchmark designs must evolve to ensure robustness and fairness, ultimately emphasizing the need for stronger safeguards against contamination.
These results demonstrate the persistent issue of knowledge leakage across all configurations, regardless of the choice of loss function or
Even after multiple iterations of knowledge distillation, where the test set is never directly observed during training, information about the benchmark remains embedded in the model. This leakage persists across iterations, undermining the validity of benchmark scores as indicators of genuine model capabilities.
Remarkably, even with extremely small datasets like 500 samples, test set knowledge leakage persists.
@misc{mansurov2024datalaunderingartificiallyboosting,
title={Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation},
author={Jonibek Mansurov and Akhmed Sakip and Alham Fikri Aji},
year={2024},
eprint={2412.15255},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.15255},
}