[2019 CLUSTER] Efficient User-Level Storage Disaggregation for Deep Learning. [PDF]
[2020 FAST] Quiver: An Informed Storage Cache for Deep Learning. [PDF] [Slides]
[2020 ICPP] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training. [PDF] [Slides]
[2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training. [PDF] [DS-Analyzer] [CoorDL Code]
[2021 VLDB] tf.data: A Machine Learning Data Processing Framework. [PDF]
[2021 ATC] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. [PDF] [Slides]
[2022 SIGMOD] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. [PDF] [Recording] [Code]
[2022 ATC] Cachew: Machine Learning Input Data Processing as a Service. [PDF] [Code]
[2022 TPDS] DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets. [PDF]
[2022 CLUSTER] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications. [PDF]
[2023 FAST] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training. [PDF] [Slides] [Code]
[2023 HPCA] iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training. [PDF] [Slides] [Code]
[2023 ATC] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training. [PDF] [Slides]
[2023 SoCC] tf.data service: A Case for Disaggregating ML Input Data Processing. [PDF]
[2024 ATC] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement. [PDF] [Code]
[2020 CCGRID] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. [PDF]
[2020 ICML] On Efficient Constructions of Checkpoints. [PDF]
[2020 ICPP] Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity. [PDF]
[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing. [PDF] [Slides] [Code]
[2021 ICCD] QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks. [PDF]
[2022 NSDI] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. [PDF] [Slides]
[2023 TPDS] Design of a Quantization-Based DNN Delta Compression Framework for Model Snapshots and Federated Learning. [PDF]
[2023 SOSP] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. [PDF] [Slides]
[2023 ICCD] A Cost-Efficient Failure-Tolerant Scheme for Distributed DNN Training. [PDF] [Slides] [Code]
[2024 EuroSys] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. [PDF]
[2024 ICDCS] Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy. [PDF]
[2024 HPDC] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. [PDF] [Code]
[2024 ICML] ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking. [PDF] [Code]
[2024 ICCD] ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models. [PDF]
[2024 SoCC] Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization. [PDF]
[2024 TMC] CheckBullet: A Lightweight Checkpointing System for Robust Model Training on Mobile Networks. [PDF]
[2025 FCS] BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. [PDF]
[2023 NSDI] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. [PDF] [Slides] [Code]
[2023 SOSP] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. [PDF] [Code]
[2023 PVLDB] Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding. [PDF] [Code]
[2024 TPDS] Swift: Expedited Failure Recovery for Large-scale DNN Training. [PDF] [Code]
[2024 SOSP] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation. [PDF [Slides] [Poster]
[2023 ICS] DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level Access. [PDF]
[2024 HPDC] EvoStore: Towards Scalable Storage of Evolving Learning Models. [PDF]
[2023 SOSP] Efficient Memory Management for Large Language Model Serving with PagedAttention. [PDF] [Code]
[2021 CCGrid] DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications. [PDF] [Code]