DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Latest News 🔥

[Latest] We have fixed some bugs and released a new version of DD-Ranking. Please install the latest version via pip install ddranking==0.1.4 or pip install ddranking --upgrade.

Unfold to see more details.

[2025/02] We have fixed some bugs and released a new version of DD-Ranking. Please update your package via pip install ddranking==0.1.4 or pip install ddranking --upgrade.
[2025/01] Our PyPI package is officially released! Users can now install DD-Ranking via pip install ddranking.
[2024/12/28] We officially released DD-Ranking! DD-Ranking provides us a new benchmark decoupling the impacts from knowledge distillation and data augmentation.

Motivation: DD Lacks an Evaluation Benchmark

Unfold to see more details.

Dataset Distillation (DD) aims to condense a large dataset into a much smaller one, which allows a model to achieve comparable performance after training on it. DD has gained extensive attention since it was proposed. With some foundational methods such as DC, DM, and MTT, various works have further pushed this area to a new standard with their novel designs.

Notebaly, more and more methods are transitting from "hard label" to "soft label" in dataset distillation, especially during evaluation. Hard labels are categorical, having the same format of the real dataset. Soft labels are outputs of a pre-trained teacher model. Recently, Deng et al., pointed out that "a label is worth a thousand images". They showed analytically that soft labels are exetremely useful for accuracy improvement.

However, since the essence of soft labels is knowledge distillation, we find that when applying the same evaluation method to randomly selected data, the test accuracy also improves significantly (see the figure above).

This makes us wonder: Can the test accuracy of the model trained on distilled data reflect the real informativeness of the distilled data?

Additionally, we have discoverd unfairness of using only test accuracy to demonstrate one's performance from the following three aspects:

Results of using hard and soft labels are not directly comparable since soft labels introduce teacher knowledge.
Strategies of using soft labels are diverse. For instance, different objective functions are used during evaluation, such as soft Cross-Entropy and Kullback–Leibler divergence. Also, one image may be mapped to one or multiple soft labels.
Different data augmentations are used during evaluation.

Motivated by this, we propose DD-Ranking, a new benchmark for DD evaluation. DD-Ranking provides a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.

Introduction

Unfold to see more details.

DD-Ranking (DD, *i.e.*, Dataset Distillation) is an integrated and easy-to-use benchmark for dataset distillation. It aims to provide a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.

Benchmark

Revisit the original goal of dataset distillation:

The idea is to synthesize a small number of data points that do not need to come from the correct data distribution, but will, when given to the learning algorithm as training data, approximate the model trained on the original data. (Wang et al., 2020)

The evaluation method for DD-Ranking is grounded in the essence of dataset distillation, aiming to better reflect the informativeness of the synthesized data by assessing the following two aspects:

The degree to which the real dataset is recovered under hard labels (hard label recovery): $\text{HLR}=\text{Acc.}{\text{real-hard}}-\text{Acc.}{\text{syn-hard}}$.
The improvement over random selection when using personalized evaluation methods (improvement over random): $\text{IOR}=\text{Acc.}{\text{syn-any}}-\text{Acc.}{\text{rdm-any}}$. $\text{Acc.}$ is the accuracy of models trained on different samples. Samples' marks are as follows:

$\text{real-hard}$: Real dataset with hard labels;
$\text{syn-hard}$: Synthetic dataset with hard labels;
$\text{syn-any}$: Synthetic dataset with personalized evaluation methods (hard or soft labels);
$\text{rdm-any}$: Randomly selected dataset (under the same compression ratio) with the same personalized evaluation methods.

DD-Ranking uses a weight sum of $\text{IOR}$ and $-\text{HLR}$ to rank different methods: $\alpha = w\text{IOR}-(1-w)\text{HLR}, \quad w \in [0, 1]$

Formally, the DD-Ranking Score (DDRS) is defined as: $(e^{\alpha}-e^{-1}) / (e - e^{-1})$

By default, we set $w = 0.5$ on the leaderboard, meaning that both $\text{IOR}$ and $\text{HLR}$ are equally important. Users can adjust the weights to emphasize one aspect on the leaderboard.

Overview

DD-Ranking is integrated with:

Multiple strategies of using soft labels in existing works;
Commonly used data augmentation methods in existing works;
Commonly used model architectures in existing works.

DD-Ranking has the following features:

Fair Evaluation: DD-Ranking provides a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data.
Easy-to-use: DD-Ranking provides a unified interface for dataset distillation evaluation.
Extensible: DD-Ranking supports various datasets and models.
Customizable: DD-Ranking supports various data augmentations and soft label strategies.

DD-Ranking currently includes the following datasets and methods (categorized by hard/soft label). Our replication of the following baselines can be found at the methods branch. Evaluation results can be found in the leaderboard and evaluation configurations can be found at the eval branch.

Supported Dataset	Evaluated Hard Label Methods	Evaluated Soft Label Methods
CIFAR-10	DC	DATM
CIFAR-100	DSA	SRe2L
TinyImageNet	DM	RDED
	MTT	D4M

Tutorial

Install DD-Ranking with pip or from source:

Installation

From pip

pip install ddranking

From source

python setup.py install

Quickstart

Below is a step-by-step guide on how to use our dd_ranking. This demo is based on soft labels (source code can be found in demo_soft.py). You can find hard label demo in demo_hard.py.

Step1: Intialize a soft-label metric evaluator object. Config files are recommended for users to specify hyper-parameters. Sample config files are provided here.

from ddranking.metrics import SoftLabelEvaluator
from ddranking.config import Config

config = Config.from_file("./configs/Demo_Soft_Label.yaml")
soft_label_metric_calc = SoftLabelEvaluator(config)

You can also pass keyword arguments.

device = "cuda"
method_name = "DATM"                    # Specify your method name
ipc = 10                                # Specify your IPC
dataset = "CIFAR10"                     # Specify your dataset name
syn_data_dir = "./data/CIFAR10/IPC10/"  # Specify your synthetic data path
real_data_dir = "./datasets"            # Specify your dataset path
model_name = "ConvNet-3"                # Specify your model name
teacher_dir = "./teacher_models"		# Specify your path to teacher model chcekpoints
im_size = (32, 32)                      # Specify your image size
dsa_params = {                          # Specify your data augmentation parameters
    "prob_flip": 0.5,
    "ratio_rotate": 15.0,
    "saturation": 2.0,
    "brightness": 1.0,
    "contrast": 0.5,
    "ratio_scale": 1.2,
    "ratio_crop_pad": 0.125,
    "ratio_cutout": 0.5
}
save_path = f"./results/{dataset}/{model_name}/IPC{ipc}/dm_hard_scores.csv"

""" We only list arguments that usually need specifying"""
soft_label_metric_calc = SoftLabelEvaluator(
    dataset=dataset,
    real_data_path=real_data_dir, 
    ipc=ipc,
    model_name=model_name,
    soft_label_criterion='sce',  # Use Soft Cross Entropy Loss
    soft_label_mode='S',         # Use one-to-one image to soft label mapping
    data_aug_func='dsa',         # Use DSA data augmentation
    aug_params=dsa_params,       # Specify dsa parameters
    im_size=im_size,
    stu_use_torchvision=False,
    tea_use_torchvision=False,
    teacher_dir='./teacher_models',
    device=device,
    save_path=save_path
)

For detailed explanation for hyper-parameters, please refer to our documentation.

Step 2: Load your synthetic data, labels (if any), and learning rate (if any).

syn_images = torch.load('/your/path/to/syn/images.pt')
# You must specify your soft labels if your soft label mode is 'S'
soft_labels = torch.load('/your/path/to/syn/labels.pt')
syn_lr = torch.load('/your/path/to/syn/lr.pt')

Step 3: Compute the metric.

metric = soft_label_metric_calc.compute_metrics(image_tensor=syn_images, soft_labels=soft_labels, syn_lr=syn_lr)
# alternatively, you can specify the image folder path to compute the metric
metric = soft_label_metric_calc.compute_metrics(image_path='./your/path/to/syn/images', soft_labels=soft_labels, syn_lr=syn_lr)

The following results will be returned to you:

HLR mean: The mean of hard label recovery over num_eval runs.
HLR std: The standard deviation of hard label recovery over num_eval runs.
IOR mean: The mean of improvement over random over num_eval runs.
IOR std: The standard deviation of improvement over random over num_eval runs.

Check out our documentation to learn more.

Coming Soon

Evaluation results on ImageNet subsets.
More baseline methods.
DD-Ranking scores that decouple the impacts from data augmentation.

Contributing

Feel free to submit grades to update the DD-Ranking list. We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.

Technical Members:

Zekai Li* (National University of Singapore)
Xinhao Zhong* (National University of Singapore)
Zhiyuan Liang (University of Science and Technology of China)
Yuhao Zhou (Sichuan University)
Mingjia Shi (Sichuan University)
Dongwen Tang (National University of Singapore)
Ziqiao Wang (National University of Singapore)
Wangbo Zhao (National University of Singapore)
Xuanlei Zhao (National University of Singapore)
Haonan Wang (National University of Singapore)
Ziheng Qin (National University of Singapore)
Dai Liu (Technical University of Munich)
Kaipeng Zhang (Shanghai AI Lab)
Tianyi Zhou (A*STAR)
Zheng Zhu (Tsinghua University)
Kun Wang (University of Science and Technology of China)
Guang Li (Hokkaido University)
Junhao Zhang (National University of Singapore)
Jiawei Liu (National University of Singapore)
Yiran Huang (Technical University of Munich)
Lingjuan Lyu (Sony)
Jiancheng Lv (Sichuan University)
Yaochu Jin (Westlake University)
Zeynep Akata (Technical University of Munich)
Jindong Gu (Oxford University)
Rama Vedantam (Independent Researcher)
Mike Shou (National University of Singapore)
Zhiwei Deng (Google DeepMind)
Yan Yan (University of Illinois at Chicago)
Yuzhang Shang (University of Illinois at Chicago)
George Cazenavette (Massachusetts Institute of Technology)
Xindi Wu (Princeton University)
Justin Cui (University of California, Los Angeles)
Tianlong Chen (University of North Carolina at Chapel Hill)
Angela Yao (National University of Singapore)
Baharan Mirzasoleiman (University of California, Los Angeles)
Hakan Bilen (University of Edinburgh)
Manolis Kellis (Massachusetts Institute of Technology)
Konstantinos N. Plataniotis (University of Toronto)
Bo Zhao (Shanghai Jiao Tong University)
Zhangyang Wang (University of Texas at Austin)
Yang You (National University of Singapore)
Kai Wang (National University of Singapore)

* equal contribution

License

DD-Ranking is released under the MIT License. See LICENSE for more details.

Related Works

Dataset Distillation, Wang et al., in arXiv 2018.
Dataset Condensation with Gradient Matching, Zhao et al., in ICLR 2020.
Dataset Condensation with Differentiable Siamese Augmentation, Zhao & Bilen, in ICML 2021.
Dataset Distillation via Matching Training Trajectories, Cazenavette et al., in CVPR 2022.
Dataset Distillation with Distribution Matching, Zhao & Bilen, in WACV 2023.
Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective, Yin et al., in NeurIPS 2023.
Towards Lossless Dataset Distillation via Difficulty-Aligned Trajectory Matching, Guo et al., in ICLR 2024.
On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm, Sun et al., in CVPR 2024.
D4M: Dataset Distillation via Disentangled Diffusion Model, Su et al., in CVPR 2024.

Reference

If you find DD-Ranking useful in your research, please consider citing the following paper:

@misc{li2024ddranking,
  title = {DD-Ranking: Rethinking the Evaluation of Dataset Distillation},
  author = {Li, Zekai and Zhong, Xinhao and Liang, Zhiyuan and Zhou, Yuhao and Shi, Mingjia and Wang, Ziqiao and Zhao, Wangbo and Zhao, Xuanlei and Wang, Haonan and Qin, Ziheng and Liu, Dai and Zhang, Kaipeng and Zhou, Tianyi and Zhu, Zheng and Wang, Kun and Li, Guang and Zhang, Junhao and Liu, Jiawei and Huang, Yiran and Lyu, Lingjuan and Lv, Jiancheng and Jin, Yaochu and Akata, Zeynep and Gu, Jindong and Vedantam, Rama and Shou, Mike and Deng, Zhiwei and Yan, Yan and Shang, Yuzhang and Cazenavette, George and Wu, Xindi and Cui, Justin and Chen, Tianlong and Yao, Angela and Kellis, Manolis and Plataniotis, Konstantinos N. and Zhao, Bo and Wang, Zhangyang and You, Yang and Wang, Kai},
  year = {2024},
  howpublished = {GitHub repository},
  url = {https://github.com/NUS-HPC-AI-Lab/DD-Ranking}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Motivation: DD Lacks an Evaluation Benchmark

Introduction

Benchmark

Overview

Tutorial

Installation

Quickstart

Coming Soon

Contributing

Technical Members:

License

Related Works

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Motivation: DD Lacks an Evaluation Benchmark

Introduction

Benchmark

Overview

Tutorial

Installation

Quickstart

Coming Soon

Contributing

Technical Members:

License

Related Works

Reference