remove old-style fsdp/ddp #1640

t-vi · 2025-01-13T11:52:31Z

cc: @crcrpar

Plan:

Replace with transform-aware module functions:

thunder/tests/distributed/test_checkpoint.py::DistributedCheckpointTest::test_get_model_state_dict
thunder/tests/distributed/test_checkpoint.py::DistributedCheckpointTest::test_load_model_state_dict

Removed as not (currently) applicable

FAILED thunder/tests/distributed/test_ddp.py::DDPTest::test_ddp_model_as_argument
--> We are not re-entrant.

Fixed

thunder/tests/distributed/test_ddp.py::DDPTest::test_ddp_grad_parity_with_without_bucketing * 2
thunder/tests/distributed/test_ddp.py::test_native_ddp * 4
test_fsdp.py::FSDPTest::test_fsdp_with_no_sync_grad_accumulation * 4 - Thank you @crcrpar

…stributed

crcrpar

I'm speculating that the failure of fsdp no_sync parity tests is caused by the mismatch between the parameters that _sync_grads collects and ones that _stash_grad_for_fsdp_prim_impl attaches the unsharded grads.

lightning-thunder/thunder/distributed/__init__.py

Lines 157 to 159 in 5e18c2e

    
           params_with_grad = tuple(filter(lambda p: hasattr(p, "_thunder_fsdp_unsharded_grad"), module.parameters())) 
        
           if not params_with_grad: 
        
               return

.

lightning-thunder/thunder/executors/torchex.py

Lines 2116 to 2128 in 5e18c2e

    
           def _stash_grad_for_fsdp_prim_impl( 
        
               grad: torch.Tensor, 
        
               param_fqn: str, 
        
               compile_data: CompileData, 
        
           ) -> None: 
        
               grad_name = "_thunder_fsdp_unsharded_grad" 
        
               param = compile_data.fn.get_parameter(param_fqn) 
        
               if torch.is_tensor(unsharded_grad := getattr(param, grad_name, None)): 
        
                   unsharded_grad += grad 
        
               else: 
        
                   setattr(param, grad_name, grad) 
        
               return grad

…ter sharded and sync'ed grads (#1643)

lantiga

Stamped!

Co-authored-by: Masaki Kozuki <[email protected]>

remove old-style fsdp/ddp

de657fa

t-vi requested review from mruberry and lantiga as code owners January 13, 2025 11:52

t-vi marked this pull request as draft January 13, 2025 12:29

t-vi added 2 commits January 13, 2025 13:56

fix some tests

0deac3c

Merge remote-tracking branch 'origin/main' into tom/drop-old-style-di…

a477a7c

…stributed

t-vi force-pushed the tom/drop-old-style-distributed branch from d19a76d to a477a7c Compare January 13, 2025 12:57

crcrpar reviewed Jan 14, 2025

View reviewed changes

crcrpar mentioned this pull request Jan 14, 2025

Correctly get parameters with unsharded grads and parameters to register sharded and sync'ed grads #1643

Merged

crcrpar and others added 2 commits January 14, 2025 08:28

Correctly get parameters with unsharded grads and parameters to regis…

fa8b80c

…ter sharded and sync'ed grads (#1643)

update checkpoint handling and testing

a5ea9b3

t-vi marked this pull request as ready for review January 14, 2025 09:54

t-vi enabled auto-merge (squash) January 14, 2025 09:54

t-vi added 2 commits January 14, 2025 15:34

Merge branch 'main' into tom/drop-old-style-distributed

e36a686

fix merge

8d9d5de

lantiga approved these changes Jan 14, 2025

View reviewed changes

t-vi merged commit bcf4fe4 into main Jan 14, 2025
45 checks passed

t-vi deleted the tom/drop-old-style-distributed branch January 14, 2025 16:15

kshitij12345 mentioned this pull request Jan 24, 2025

fix TE distributed tests with fsdp_v2 and ddp_v2 #1690

Merged

riccardofelluga pushed a commit that referenced this pull request Jan 27, 2025

remove old-style fsdp/ddp (#1640)

90e48af

Co-authored-by: Masaki Kozuki <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove old-style fsdp/ddp #1640

remove old-style fsdp/ddp #1640

t-vi commented Jan 13, 2025 •

edited

Loading

crcrpar left a comment

lantiga left a comment

	params_with_grad = tuple(filter(lambda p: hasattr(p, "_thunder_fsdp_unsharded_grad"), module.parameters()))
	if not params_with_grad:
	return

	def _stash_grad_for_fsdp_prim_impl(
	grad: torch.Tensor,
	param_fqn: str,
	compile_data: CompileData,
	) -> None:
	grad_name = "_thunder_fsdp_unsharded_grad"
	param = compile_data.fn.get_parameter(param_fqn)
	if torch.is_tensor(unsharded_grad := getattr(param, grad_name, None)):
	unsharded_grad += grad
	else:
	setattr(param, grad_name, grad)

	return grad

remove old-style fsdp/ddp #1640

remove old-style fsdp/ddp #1640

Conversation

t-vi commented Jan 13, 2025 • edited Loading

Replace with transform-aware module functions:

Removed as not (currently) applicable

Fixed

crcrpar left a comment

Choose a reason for hiding this comment

lantiga left a comment

Choose a reason for hiding this comment

t-vi commented Jan 13, 2025 •

edited

Loading