Storage for AI

1.Data Preparation
2.Model Training & Inference
3.Benchmark

Data Preparation

[2019 CLUSTER] Efficient User-Level Storage Disaggregation for Deep Learning. [PDF]

[2020 FAST] Quiver: An Informed Storage Cache for Deep Learning. [PDF] [Slides]

[2020 ICPP] DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training. [PDF] [Slides]

[2021 VLDB] Analyzing and Mitigating Data Stalls in DNN Training. [PDF] [DS-Analyzer] [CoorDL Code]

[2021 VLDB] tf.data: A Machine Learning Data Processing Framework. [PDF]

[2021 ATC] Refurbish Your Training Data: Reusing Partially Augmented Samples for Faster Deep Neural Network Training. [PDF] [Slides]

[2022 SIGMOD] Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines. [PDF] [Recording] [Code]

[2022 ATC] Cachew: Machine Learning Input Data Processing as a Service. [PDF] [Code]

[2022 TPDS] DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets. [PDF]

[2022 CLUSTER] Hvac: Removing I/O Bottleneck for Large-Scale Deep Learning Applications. [PDF]

[2023 FAST] SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training. [PDF] [Slides] [Code]

[2023 HPCA] iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training. [PDF] [Slides] [Code]

[2023 ATC] Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training. [PDF] [Slides]

[2023 SoCC] tf.data service: A Case for Disaggregating ML Input Data Processing. [PDF]

[2024 ATC] Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement. [PDF] [Code]

Model Training & Inference

Fault Tolerance

Checkpointing

[2020 CCGRID] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models. [PDF]

[2020 ICML] On Efficient Constructions of Checkpoints. [PDF]

[2020 ICPP] Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity. [PDF]

[2021 FAST] CheckFreq: Frequent, Fine-Grained DNN Checkpointing. [PDF] [Slides] [Code]

[2021 ICCD] QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks. [PDF]

[2022 NSDI] Check-N-Run: a Checkpointing System for Training Deep Learning Recommendation Models. [PDF] [Slides]

[2023 TPDS] Design of a Quantization-Based DNN Delta Compression Framework for Model Snapshots and Federated Learning. [PDF]

[2023 SOSP] GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. [PDF] [Slides]

[2023 ICCD] A Cost-Efficient Failure-Tolerant Scheme for Distributed DNN Training. [PDF] [Slides] [Code]

[2024 EuroSys] Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. [PDF]

[2024 ICDCS] Portus: Efficient DNN Checkpointing to Persistent Memory with Zero-Copy. [PDF]

[2024 HPDC] DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models. [PDF] [Code]

[2024 ICML] ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking. [PDF] [Code]

[2024 ICCD] ParaCkpt: Heterogeneous Multi-Path Checkpointing Mechanism for Training Deep Learning Models. [PDF]

[2024 SoCC] Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization. [PDF]

[2024 TMC] CheckBullet: A Lightweight Checkpointing System for Robust Model Training on Mobile Networks. [PDF]

[2025 FCS] BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. [PDF]

Others

[2023 NSDI] Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. [PDF] [Slides] [Code]

[2023 SOSP] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. [PDF] [Code]

[2023 PVLDB] Efficient Fault Tolerance for Recommendation Model Training via Erasure Coding. [PDF] [Code]

[2024 TPDS] Swift: Expedited Failure Recovery for Large-scale DNN Training. [PDF] [Code]

[2024 SOSP] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation. [PDF [Slides] [Poster]

Model Repository

[2023 ICS] DStore: A Lightweight Scalable Learning Model Repository with Fine-Grain Tensor-Level Access. [PDF]

[2024 HPDC] EvoStore: Towards Scalable Storage of Evolving Learning Models. [PDF]

KV Cache

[2023 SOSP] Efficient Memory Management for Large Language Model Serving with PagedAttention. [PDF] [Code]

Benchmark

[2021 CCGrid] DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications. [PDF] [Code]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Storage for AI

Data Preparation

Model Training & Inference

Fault Tolerance

Checkpointing

Others

Model Repository

KV Cache

Benchmark

About

Releases

Packages

hegongshan/Storage-for-AI-Paper

Folders and files

Latest commit

History

Repository files navigation

Storage for AI

Data Preparation

Model Training & Inference

Fault Tolerance

Checkpointing

Others

Model Repository

KV Cache

Benchmark

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages