Distributed optimizer infrastructure for FP8 parameters #1723

timmoon10 · 2023-09-06T21:21:42Z

This PR does some refactoring that will enable distributed optimizer support for FP8 parameters in NeMo. It adds the option to do parameter all-gathers in integer dtypes and adds two member functions - _check_params_shard_dtypes and _param_copy_fragments - to handle casting into and out of the all-gather buffer. For now these functions will either do a direct cast for floating-point dtypes or copy the most significant bytes for other dtypes. I plan to override these functions in the NeMo derived class so that it casts to FP8, performs the all-gather in UINT8, and unpacks into a custom FP8 tensor class.

This PR depends on #1719 and #1721.

Signed-off-by: Tim Moon <[email protected]>

apex/contrib/optimizers/distributed_fused_adam.py

Co-authored-by: Masaki Kozuki <[email protected]>

* Add distopt support for param syncs with non-floating-point dtypes Signed-off-by: Tim Moon <[email protected]> * Update apex/contrib/optimizers/distributed_fused_adam.py Co-authored-by: Masaki Kozuki <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Masaki Kozuki <[email protected]>

* Add update_scale_hysteresis * Fix compile errors * Massively reduce LayerNorm/RMSNorm GPU memory usage in modern networks by tricking torch autograd (#1715) * input grad checks out * adding clamp gamma * Both old and proposed implementation checks out * 2 tests not yet passed due to numerical issues * mem_eff works * fast-layer-norm done * Moving mem-eff to templates * Relax tolerance for memory efficient backward * Fix backward api of python * Distributed optimizer infrastructure for FP8 parameters (#1723) * Add distopt support for param syncs with non-floating-point dtypes Signed-off-by: Tim Moon <[email protected]> * Update apex/contrib/optimizers/distributed_fused_adam.py Co-authored-by: Masaki Kozuki <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Masaki Kozuki <[email protected]> * Add unit test * Fix comment in unit test * Remove unnecessary bits --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Jaemin Choi <[email protected]> Co-authored-by: Rui Wang <[email protected]> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Masaki Kozuki <[email protected]>

Add distopt support for param syncs with non-floating-point dtypes

c71321f

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the distopt-fp8 branch from fbc1ab4 to c71321f Compare September 6, 2023 23:59

crcrpar reviewed Sep 12, 2023

View reviewed changes

apex/contrib/optimizers/distributed_fused_adam.py Outdated Show resolved Hide resolved

Update apex/contrib/optimizers/distributed_fused_adam.py

7128896

Co-authored-by: Masaki Kozuki <[email protected]>

timmoon10 mentioned this pull request Sep 20, 2023

Distributed optimizer support for experimental FP8 tensors NVIDIA/NeMo#7469

Closed

8 tasks

timmoon10 requested a review from crcrpar September 28, 2023 16:47

crcrpar approved these changes Sep 29, 2023

View reviewed changes

crcrpar merged commit 2386a91 into NVIDIA:master Sep 29, 2023

timmoon10 mentioned this pull request Nov 14, 2023

Distributed optimizer support for contiguous param buffer with FP8 params #1749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed optimizer infrastructure for FP8 parameters #1723

Distributed optimizer infrastructure for FP8 parameters #1723

timmoon10 commented Sep 6, 2023

Distributed optimizer infrastructure for FP8 parameters #1723

Distributed optimizer infrastructure for FP8 parameters #1723

Conversation

timmoon10 commented Sep 6, 2023