layout | title | description |
---|---|---|
page |
CME 213 Course Schedule |
Detailed class schedule |
- Introduction to the class; syllabus.
- Homework, class material
- Why we need parallelism; example of parallel program: summing up numbers
- Shared memory and multicore processors
- Introduction to Pthreads; creating and joining threads
- Homework 1 out
- Example: multiplication of two matrices
- Mutexes; example: piazza restaurant delivery
- Condition variables; example with codes
- OpenMP; introduction; parallel regions
- Parallel for loops; matrix multiplication; sections; tasks; taskwait; using tasks for tree traversal and list traversal
- Homework 1 due
- Homework 2 out
- OpenMP wrap-up: reduction, atomic, critical, single; master; barrier, etc.
- Colfax lecture 1
- Colfax lecture 2
- Homework 3 out
- CUDA introduction; moving data between CPU and GPU memory; execution of a skeleton CUDA code (malloc, memcpy, kernel); threading model, basic commands, simple example programs
- Homework 2 due
- Threads, warps, blocks; execution model for a kernel; resident blocks; executing an instruction for a warp; synchronizations between threads; SIMT model
- Execution model for a kernel; resident blocks; resources required by a grid block; resources available on an SM; memory and caches; occupancy
- Warps; coalescing and performance impact; memory access pattern; caching
- Example of matrix transpose
- Shared memory; bank conflicts
- Lecture on homework 3; finite-difference stencil; algorithm based on global memory; loop over y to increase data-reuse and arithmetic intensity; shared memory algorithm
- Homework 3 due
- Homework 4 out
- Students were asked to group in team and find an efficient procedure to quickly calculate the cumulative sum of many numbers.
- Reduction algorithm; warp; shared memory; thread divergence; bank conflict; thread-block; use of atomics.
- Discussion of Thrust; segmented algorithms; examples of problems that can be broken into Thrust algorithms; lambda functions and placeholders
- Homework 4 due
- Homework 5 out
- Lecture on the final project
- Final project out; neural network for digit recognition
- OpenACC, NVIDIA
- CUDA optimization, flops and mems, NVIDIA
- CUDA profiling, nvvp, optimization guidelines, NVIDIA
- Message passing; introduction; MPI; collective communications; collective communications for final project
- Homework 5 due
- Point-to-point communication; Deadlocks; ring communication; blocking vs. non-blocking; sample sort
- Memorial Day; no class
- Matrix-vector product; groups, communicators
- Cartesian topologies; application to matrix-vector products with 2D partitioning
- Communications using graph topologies; graph; neighbor_all_to_all comms; FEM applications
- One-page interim report due
- Performance metrics; speed-up, efficiency; Amdahl’s law; example: dot-product; efficiency and isoefficiency
- Alphago, board games, deep learning, and Monte-Carlo tree searches; algorithm and implementation
- Matrix-vector product with 1D and 2D partitioning; matrix-matrix products; Cannon and DNS algorithms; LU factorization algorithms
- Conclusion, and wrap-up
- Projects are due, deadline: 11PM