Name		Name	Last commit message	Last commit date
parent directory ..
hw		hw
lab		lab
README.md		README.md

README.md

Week 9: HPC, MPI, and Multinode / MultiGPU (MNMG) Training

History of HPC. HPC vs HTC/Big Data. Typical HPC problems. Architecture of supercomputers. Interconnect topologies: fat tree, torus. FLOPs, Top500. Amdahl’s law. Programming for HPC systems. MPI. HPC schedulers. Infiniband vs TCP/IP. Google TPUs and TPU pods. Nvidia DGX systems and superpods. Magnum IO. Distributed Deep Learning Model training. Uber Horovod. Distributed Training in TensorFlow and PyTorch. Distributed Training in AWS and Azure.

Reading:

Top 500 Supercomputers (site review)
Infiniband in the top 500 supercomputers (review)
Nvidia Selene blog (skim)
Training PyTorch models on Google TPU pods (skim)
Distributed Data Parallel with PyTorch (review)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week09

week09

README.md

Week 9: HPC, MPI, and Multinode / MultiGPU (MNMG) Training

Reading:

Files

week09

Directory actions

More options

Directory actions

More options

Latest commit

History

week09

Folders and files

parent directory

README.md

Week 9: HPC, MPI, and Multinode / MultiGPU (MNMG) Training

Reading: