[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

uygnef · 2025-02-20T13:08:23Z

Problem
Currently, when the Megatron worker loads a model, every rank loads the checkpoint (ckpt). For large models, this often causes out-of-memory (OOM) errors.

Solution
To address this, we've modified the process so that only rank 0 loads the actual model weights. This significantly reduces memory usage during model loading and prevents OOM issues.

Test
The solution has been tested on a 4 node H800 system, with each node equipped with 800GB of RAM, successfully loading the Megatron Qwen2.5-32B model without encountering OOM errors.

uygnef · 2025-02-21T05:34:01Z

Note:
During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error. I think this issue is unrelated to the current PR. I was unable to reproduce the error again. Based on the discussion in NVIDIA/nccl#1273, this appears to be a potential NCCL bug.

Error Log:

1682
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1031, in send_forward_recv_backward
1683
【2025-02-20 23:44:58】 output_tensor_grad = p2p_communication.send_forward_recv_backward(
1684
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1685
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 447, in send_forward_recv_backward
1686
【2025-02-20 23:44:58】 _, output_tensor_grad, _ = _communicate(
1687
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^
1688
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 325, in _communicate
1689
【2025-02-20 23:44:58】 reqs = p2p_func(
1690
【2025-02-20 23:44:58】 ^^^^^^^^^
1691
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 158, in _batched_p2p_ops
1692
【2025-02-20 23:44:58】 reqs = torch.distributed.batch_isend_irecv(ops)
1693
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1694
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2158, in batch_isend_irecv
1695
【2025-02-20 23:44:58】 with _coalescing_manager(group, device, async_ops=True) as cm:
1696
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/contextlib.py", line 144, in __exit__
1697
【2025-02-20 23:44:58】 next(self.gen)
1698
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2105, in _coalescing_manager
1699
【2025-02-20 23:44:58】 work = group._end_coalescing(device)
1700
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1701
【2025-02-20 23:44:58】torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4187, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
1702
【2025-02-20 23:44:58】ncclInternalError: Internal check failed.
1703
【2025-02-20 23:44:58】Last error:
1704
【2025-02-20 23:44:58】Message truncated : received 128 bytes instead of 4

PeterSH6 · 2025-02-24T11:51:46Z

Hi @uygnef , Nice catch!

During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error.

What do you mean by "actor attempted to upload a policy"?

uygnef · 2025-02-24T11:59:43Z

Hi @uygnef , Nice catch!

During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error.

What do you mean by "actor attempted to upload a policy"?

Yes, we have train serveral times (>5) with same setup, it did not happend again. Do you have any suggestions?
Here's the full error stack:

【2025-02-20 23:44:54】 File "verl/verl/single_controller/base/decorator.py", line 404, in inner
【2025-02-20 23:44:54】 return func(*args, **kwargs)
【2025-02-20 23:44:54】 ^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:54】 File "verl/verl/workers/megatron_workers.py", line 345, in update_actor
【2025-02-20 23:44:54】 metrics = self.actor.update_policy(dataloader=dataloader)
【2025-02-20 23:44:54】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "verl/verl/workers/actor/megatron_actor.py", line 360, in update_policy
【2025-02-20 23:44:58】 metric_micro_batch = self.forward_backward_batch(data)
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "verl/verl/workers/actor/megatron_actor.py", line 314, in forward_backward_batch
【2025-02-20 23:44:58】 losses_reduced = forward_backward_func(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1270, in forward_backward_pipelining_without_interleaving
【2025-02-20 23:44:58】 output_tensor_grad = send_forward_recv_backward(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1031, in send_forward_recv_backward
【2025-02-20 23:44:58】 output_tensor_grad = p2p_communication.send_forward_recv_backward(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 447, in send_forward_recv_backward
【2025-02-20 23:44:58】 _, output_tensor_grad, _ = _communicate(

PeterSH6 · 2025-02-24T12:52:31Z

So this issue may related to PP? Are you using VPP?

uygnef · 2025-02-24T13:03:01Z

Based on this issue, I believe it's likely a NCCL bug and not related to PP or VPP. I'm currently using PP, but I haven't utilized VPP. Here's the setup details:

【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'entropy_coeff': 0.001,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'grad_clip': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'kl_loss_coef': 0.001,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'kl_loss_type': 'low_var_kl',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'load_weight': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'optim': {'clip_grad': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'lr': 1e-06,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'lr_warmup_steps_ratio': 0.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'min_lr_ratio': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'total_training_steps': -1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'warmup_style': 'constant'},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_epochs': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size_per_gpu': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_mini_batch_size': 256,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'shuffle': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'strategy': 'megatron',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ulysses_sequence_parallel_size': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_kl_loss': True},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'hybrid_engine': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'model': {'enable_gradient_checkpointing': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'external_lib': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'override_config': {},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'path': '[MODEL_PATH]',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_remove_padding': False},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ref': {'load_weight': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size_per_gpu': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'param_offload': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ulysses_sequence_parallel_size': 1},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'rollout': {'disable_log_stats': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'do_sample': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'dtype': 'bfloat16',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'enable_chunked_prefill': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'enforce_eager': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'free_cache_engine': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'gpu_memory_utilization': 0.8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ignore_eos': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'layer_name_map': {'gate_proj_layer_name': 'gate_up',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'qkv_layer_name': 'qkv'},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'load_format': 'dummy_megatron',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size_per_gpu': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'max_num_batched_tokens': 8192,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'max_num_seqs': 1024,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'n': 8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'name': 'vllm',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'prompt_length': 2048,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'response_length': 8192,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'temperature': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'top_k': -1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'top_p': 1}},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'algorithm': {'adv_estimator': 'grpo',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'gamma': 1.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_penalty': 'kl',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lam': 1.0},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'critic': {'cliprange_value': 0.5,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'load_weight': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 4},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'model': {'enable_gradient_checkpointing': False,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'external_lib': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'override_config': {},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'path': '[MODEL_PATH]',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'tokenizer_path': '[TOKENIZER_PATH]'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'optim': {'clip_grad': 1.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lr': 1e-05,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lr_warmup_steps_ratio': 0.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'min_lr_ratio': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_training_steps': -1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'warmup_style': 'constant'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_epochs': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size_per_gpu': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_mini_batch_size': 256,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'shuffle': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'strategy': 'megatron',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'use_dynamic_bsz': False},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'trainer': {'critic_warmup': 0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'default_hdfs_dir': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'default_local_dir': 'models/qwen_7b_megatron_kl0001_deepscaleR_numina_fix-verl-grpo-deepscale_numina.parquet-1e-6',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'experiment_name': 'qwen_7b_megatron_kl0001_deepscaleR_numina_fix_bs256_mnode',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'logger': ['wandb'],
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'n_gpus_per_node': 8,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'nnodes': 4,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'project_name': 'GRPO_numinamath-TIR',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'save_freq': 1000,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'test_freq': 10,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_epochs': 5,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_training_steps': None}}

CLAassistant · 2025-02-26T00:32:24Z

All committers have signed the CLA.

uygnef force-pushed the fix/fix_load_model_oom branch from 0b6b28f to 7b81634 Compare February 20, 2025 13:23

uygnef changed the title ~~[fix] fix OOM when loading large model by optimizing memory usage~~ [fix] fix OOM when megatron loading large model by only rank 0 loads weights Feb 21, 2025

uygnef marked this pull request as ready for review February 21, 2025 05:22

uygnef marked this pull request as draft February 21, 2025 11:25

uygnef force-pushed the fix/fix_load_model_oom branch from 7b81634 to a9d3b95 Compare February 24, 2025 02:45

uygnef marked this pull request as ready for review February 24, 2025 02:46

[fix] fix OOM when loading large model by load model in rank0 only

eae2b20

uygnef force-pushed the fix/fix_load_model_oom branch from a9d3b95 to eae2b20 Compare February 26, 2025 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

uygnef commented Feb 20, 2025 •

edited

Loading

uygnef commented Feb 21, 2025 •

edited

Loading

PeterSH6 commented Feb 24, 2025

uygnef commented Feb 24, 2025

PeterSH6 commented Feb 24, 2025

uygnef commented Feb 24, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading

[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

Are you sure you want to change the base?

[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

Conversation

uygnef commented Feb 20, 2025 • edited Loading

uygnef commented Feb 21, 2025 • edited Loading

PeterSH6 commented Feb 24, 2025

uygnef commented Feb 24, 2025

PeterSH6 commented Feb 24, 2025

uygnef commented Feb 24, 2025 • edited Loading

CLAassistant commented Feb 26, 2025 • edited Loading

uygnef commented Feb 20, 2025 •

edited

Loading

uygnef commented Feb 21, 2025 •

edited

Loading

uygnef commented Feb 24, 2025 •

edited

Loading

CLAassistant commented Feb 26, 2025 •

edited

Loading