iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training

This is the codebase for the paper that was accepted at HPCA 2023.

Code Architecture

All the code is stored in the deepcache-go directory, with two subdirectories: dcclient-python and dcserver-go. dcclient-python is a modified version of the PyTorch framework used for deep neural network training, while dcserver-go is a Golang-based, distributed training data caching system that responds to client requests. To improve code structure and readability, we've separated different functions into their respective subdirectories.

Prerequisites and Installation

To get started with this project, please follow these steps to to obtain the code and install the necessary packages.

Clone the repository to your local machine and navigate to the project directory.
```
git clone https://github.com/ISCS-ZJU/iCache
cd iCache
```

Create conda env and install python dependencies.

conda create -n icache python==3.9 # create conda env
conda activate icache # activate env
pip3 install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip3 install six grpcio==1.46.1 grpcio-tools==1.46.1 requests==2.27.1 # installing dependencies

Install golang and turn on go module function.

wget https://go.dev/dl/go1.18.2.linux-amd64.tar.gz
rm -rf /usr/local/go && tar -C /usr/local -xzf go1.18.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin # add this line at the end of your ~/.bashrc file
source ~/.bashrc # apply the changes immediately
go version # verify if golang has been installed successfully.
go env -w GO111MODULE=on # turn on go module

To prepare for training, you will need to install a remote parallel file system (PFS) on the storage nodes and store the training data on the PFS. Make sure the computing nodes can access the training data on the storage nodes.

NOTE: In our paper, we used OrangeFS as the parallel file system. However, depending on your specific environment, you may choose to use other alternative parallel file systems. To install OrangeFS, please refer to the detailed installation process outlined in this link.
Because we have extended iCache into distributed version, you also need to install etcd served as the distributed KV store. To do this, please download etcd-3.4.4-linux-amd64 from here, extract the contents of the archive, and add the path to the etcd executable files to your PATH environment variable.

Getting Started

Configurations

For the python client side, modify related files in PyTorch frameworks.
```
cd deepcache-go/dcclient-python/
python3 prep-1.8.py
```

Enable python and golang communication by re-generating grcp files.

cd deepcache-go/dcclient-python/dcrpc

protoc \
--go_out=Mgrpc/service_config/service_config.proto=/internal/proto/grpc_service_config:.\
--go-grpc_out=Mgrpc/service_config/service_config.proto=/internal/proto/grpc_service_config:.\
--go_opt=paths=source_relative \
--go-grpc_opt=paths=source_relative \
deepcache.proto

python -m grpc_tools.protoc --python_out=. --grpc_python_out=. -I. deepcache.proto

cd deepcache-go/dcserver-go/rpc/cache/ # and execute the same above two commands

Configure the cache server yaml file conf/dcserver.yaml in deepcache-go/dcserver-go/conf.

NOTE: When running DNN training on a single computing node, enter the IP address of the current node in the etcd_nodes field. For distributed DNN training across multiple nodes, enter the IP addresses separated by commas in the etcd_nodes field.

Single node DNN training

To run iCache on a single computing node, please follow these steps:

Start etcd on single node.

rm -rf etcd_name0.etcd && etcd --name etcd_name0 --listen-peer-urls http://{node_ip}:2380 --initial-advertise-peer-urls http://{node_ip}:2380 --listen-client-urls http://{node_ip}:2379 --advertise-client-urls http://{node_ip}:2379 --initial-cluster etcd_name0=http://{node_ip}:2380,

Start the cache server.

cd deepcache-go/dcserver-go
go run dcserver.go # you can use -config yaml_file_path to specify the target yaml file to run

Open another terminal and start the client DNN training.

cd deepcache-go/dcclient-python
CUDA_VISIBLE_DEVICES=0 time python3 -u cached_image_classification.py --arch resnet18 --epochs 90 -b 256 --worker 2 -p 20 --num-classes 10 --req-addr http://127.0.0.1:18283/ --grpc-port {node_ip}:18284 # if the cache_type in yaml is isa, you should add `--sbp --reuse-factor 3 --warm-up 5` flags to specify the tunable hyper-parameters used by the importance sampling algorithm.

Distributed DNN training

To run iCache on multiple computing nodes (we will use two nodes with IP addresses ip0 and ip1 as an example), please follow these steps:

Start a distributed KV on each node.

rm -rf etcd_name0.etcd && etcd --name etcd_name0 --listen-peer-urls http://{ip0}:2380 --initial-advertise-peer-urls http://{ip0}:2380 --listen-client-urls http://{ip0}:2379 --advertise-client-urls http://{ip0}:2379 --initial-cluster etcd_name1=http://{ip1}:2380,etcd_name0=http://{ip0}:2380,

rm -rf etcd_name1.etcd && etcd --name etcd_name1 --listen-peer-urls http://{ip1}:2380 --initial-advertise-peer-urls http://{ip1}:2380 --listen-client-urls http://{ip1}:2379 --advertise-client-urls http://{ip1}:2379 --initial-cluster etcd_name1=http://{ip1}:2380,etcd_name0=http://{ip0}:2380,

Start two cache servers on each nodes. Make sure the two yaml configuration files are the same.

Open two client terminals on each node and start clients.

# rank 0
CUDA_VISIBLE_DEVICES=0 python3 -u cached_image_classification.py --arch resnet18 --epochs 5 --worker 2 -b 256 -p 20 --num-classes 10 --seed 2022 --multiprocessing-distributed --world-size 2 --rank 0 --ngpus 1 --dist-url 'tcp://{ip0}:1234'  --req-addr http://127.0.0.1:18283/ --grpc-port {ip0}}:18284
# rank 1
CUDA_VISIBLE_DEVICES=1 python3 -u cached_image_classification.py --arch resnet18 --epochs 5 --worker 2 -b 256 -p 20 --num-classes 10 --seed 2022 --multiprocessing-distributed --world-size 2 --rank 1 --ngpus 1 --dist-url 'tcp://{ip0}:1234'  --req-addr http://127.0.0.1:18283/ --grpc-port {ip1}:18284
# if the cache_type is isa, please add flags as stated previously.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deepcache-go		deepcache-go
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training

Table of Contents

Code Architecture

Prerequisites and Installation

Getting Started

Configurations

Single node DNN training

Distributed DNN training

About

Releases

Packages

Languages

ISCS-ZJU/iCache

Folders and files

Latest commit

History

Repository files navigation

iCache: An Importance-Sampling-Informed Cache for Accelerating I/O-Bound DNN Model Training

Table of Contents

Code Architecture

Prerequisites and Installation

Getting Started

Configurations

Single node DNN training

Distributed DNN training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages