History of HPC. HPC vs HTC/Big Data. Typical HPC problems. Architecture of supercomputers. Interconnect topologies: fat tree, torus. FLOPs, Top500. Amdahl’s law. Programming for HPC systems. MPI. HPC schedulers. Infiniband vs TCP/IP. Google TPUs and TPU pods. Nvidia DGX systems and superpods. Magnum IO. Distributed Deep Learning Model training. Uber Horovod. Distributed Training in TensorFlow and PyTorch. Distributed Training in AWS and Azure.
- Top 500 Supercomputers (site review)
- Infiniband in the top 500 supercomputers (review)
- Nvidia Selene blog (skim)
- Training PyTorch models on Google TPU pods (skim)
- Distributed Data Parallel with PyTorch (review)