Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] fix OOM when megatron loading large model by only rank 0 loads weights #330

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

uygnef
Copy link
Contributor

@uygnef uygnef commented Feb 20, 2025

Problem
Currently, when the Megatron worker loads a model, every rank loads the checkpoint (ckpt). For large models, this often causes out-of-memory (OOM) errors.

Solution
To address this, we've modified the process so that only rank 0 loads the actual model weights. This significantly reduces memory usage during model loading and prevents OOM issues.

Test
The solution has been tested on a 4 node H800 system, with each node equipped with 800GB of RAM, successfully loading the Megatron Qwen2.5-32B model without encountering OOM errors.

@uygnef uygnef force-pushed the fix/fix_load_model_oom branch from 0b6b28f to 7b81634 Compare February 20, 2025 13:23
@uygnef uygnef changed the title [fix] fix OOM when loading large model by optimizing memory usage [fix] fix OOM when megatron loading large model by only rank 0 loads weights Feb 21, 2025
@uygnef uygnef marked this pull request as ready for review February 21, 2025 05:22
@uygnef
Copy link
Contributor Author

uygnef commented Feb 21, 2025

Note:
During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error. I think this issue is unrelated to the current PR. I was unable to reproduce the error again. Based on the discussion in NVIDIA/nccl#1273, this appears to be a potential NCCL bug.

Error Log:

1682
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1031, in send_forward_recv_backward
1683
【2025-02-20 23:44:58】 output_tensor_grad = p2p_communication.send_forward_recv_backward(
1684
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1685
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 447, in send_forward_recv_backward
1686
【2025-02-20 23:44:58】 _, output_tensor_grad, _ = _communicate(
1687
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^
1688
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 325, in _communicate
1689
【2025-02-20 23:44:58】 reqs = p2p_func(
1690
【2025-02-20 23:44:58】 ^^^^^^^^^
1691
【2025-02-20 23:44:58】 File "/home/Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 158, in _batched_p2p_ops
1692
【2025-02-20 23:44:58】 reqs = torch.distributed.batch_isend_irecv(ops)
1693
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1694
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2158, in batch_isend_irecv
1695
【2025-02-20 23:44:58】 with _coalescing_manager(group, device, async_ops=True) as cm:
1696
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/contextlib.py", line 144, in __exit__
1697
【2025-02-20 23:44:58】 next(self.gen)
1698
【2025-02-20 23:44:58】 File "/home/anaconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2105, in _coalescing_manager
1699
【2025-02-20 23:44:58】 work = group._end_coalescing(device)
1700
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1701
【2025-02-20 23:44:58】torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1720538435607/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:4187, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
1702
【2025-02-20 23:44:58】ncclInternalError: Internal check failed.
1703
【2025-02-20 23:44:58】Last error:
1704
【2025-02-20 23:44:58】Message truncated : received 128 bytes instead of 4

@uygnef uygnef marked this pull request as draft February 21, 2025 11:25
@uygnef uygnef force-pushed the fix/fix_load_model_oom branch from 7b81634 to a9d3b95 Compare February 24, 2025 02:45
@uygnef uygnef marked this pull request as ready for review February 24, 2025 02:46
@PeterSH6
Copy link
Collaborator

Hi @uygnef , Nice catch!

During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error.

What do you mean by "actor attempted to upload a policy"?

@uygnef
Copy link
Contributor Author

uygnef commented Feb 24, 2025

Hi @uygnef , Nice catch!

During testing, one of the test cases failed when the actor attempted to upload a policy, resulting in an NCCL error.

What do you mean by "actor attempted to upload a policy"?

Yes, we have train serveral times (>5) with same setup, it did not happend again. Do you have any suggestions?
Here's the full error stack:

【2025-02-20 23:44:54】 File "verl/verl/single_controller/base/decorator.py", line 404, in inner
【2025-02-20 23:44:54】 return func(*args, **kwargs)
【2025-02-20 23:44:54】 ^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:54】 File "verl/verl/workers/megatron_workers.py", line 345, in update_actor
【2025-02-20 23:44:54】 metrics = self.actor.update_policy(dataloader=dataloader)
【2025-02-20 23:44:54】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "verl/verl/workers/actor/megatron_actor.py", line 360, in update_policy
【2025-02-20 23:44:58】 metric_micro_batch = self.forward_backward_batch(data)
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "verl/verl/workers/actor/megatron_actor.py", line 314, in forward_backward_batch
【2025-02-20 23:44:58】 losses_reduced = forward_backward_func(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1270, in forward_backward_pipelining_without_interleaving
【2025-02-20 23:44:58】 output_tensor_grad = send_forward_recv_backward(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 1031, in send_forward_recv_backward
【2025-02-20 23:44:58】 output_tensor_grad = p2p_communication.send_forward_recv_backward(
【2025-02-20 23:44:58】 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
【2025-02-20 23:44:58】 File "Megatron-LM/megatron/core/pipeline_parallel/p2p_communication.py", line 447, in send_forward_recv_backward
【2025-02-20 23:44:58】 _, output_tensor_grad, _ = _communicate(

@PeterSH6
Copy link
Collaborator

So this issue may related to PP? Are you using VPP?

@uygnef
Copy link
Contributor Author

uygnef commented Feb 24, 2025

Based on this issue, I believe it's likely a NCCL bug and not related to PP or VPP. I'm currently using PP, but I haven't utilized VPP. Here's the setup details:

【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'entropy_coeff': 0.001,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'grad_clip': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'kl_loss_coef': 0.001,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'kl_loss_type': 'low_var_kl',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'load_weight': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'optim': {'clip_grad': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'lr': 1e-06,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'lr_warmup_steps_ratio': 0.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'min_lr_ratio': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'total_training_steps': -1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'warmup_style': 'constant'},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_epochs': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size_per_gpu': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ppo_mini_batch_size': 256,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'shuffle': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'strategy': 'megatron',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ulysses_sequence_parallel_size': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_kl_loss': True},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'hybrid_engine': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'model': {'enable_gradient_checkpointing': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'external_lib': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'override_config': {},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'path': '[MODEL_PATH]',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'use_remove_padding': False},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ref': {'load_weight': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size_per_gpu': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'param_offload': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ulysses_sequence_parallel_size': 1},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'rollout': {'disable_log_stats': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'do_sample': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'dtype': 'bfloat16',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'enable_chunked_prefill': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'enforce_eager': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'free_cache_engine': True,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'gpu_memory_utilization': 0.8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'ignore_eos': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'layer_name_map': {'gate_proj_layer_name': 'gate_up',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'qkv_layer_name': 'qkv'},
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'load_format': 'dummy_megatron',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_max_token_len_per_gpu': 16384,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size': None,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_micro_batch_size_per_gpu': 4,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'log_prob_use_dynamic_bsz': False,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'max_num_batched_tokens': 8192,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'max_num_seqs': 1024,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'n': 8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'name': 'vllm',
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'prompt_length': 2048,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'response_length': 8192,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'temperature': 1.0,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 8,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'top_k': -1,
【2025-02-20 23:16:43】�[36m(main_task pid=8918)�[0m 'top_p': 1}},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'algorithm': {'adv_estimator': 'grpo',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'gamma': 1.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_penalty': 'kl',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lam': 1.0},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'critic': {'cliprange_value': 0.5,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'load_weight': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'megatron': {'num_layers_per_virtual_pipeline_stage': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'pipeline_model_parallel_size': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'seed': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'sequence_parallel': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'tensor_model_parallel_size': 4},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'model': {'enable_gradient_checkpointing': False,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'external_lib': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'override_config': {},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'path': '[MODEL_PATH]',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'tokenizer_path': '[TOKENIZER_PATH]'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'optim': {'clip_grad': 1.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lr': 1e-05,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'lr_warmup_steps_ratio': 0.0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'min_lr_ratio': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_training_steps': -1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'warmup_style': 'constant'},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_epochs': 1,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_micro_batch_size_per_gpu': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'ppo_mini_batch_size': 256,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'shuffle': True,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'strategy': 'megatron',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'use_dynamic_bsz': False},
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'trainer': {'critic_warmup': 0,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'default_hdfs_dir': None,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'default_local_dir': 'models/qwen_7b_megatron_kl0001_deepscaleR_numina_fix-verl-grpo-deepscale_numina.parquet-1e-6',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'experiment_name': 'qwen_7b_megatron_kl0001_deepscaleR_numina_fix_bs256_mnode',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'logger': ['wandb'],
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'n_gpus_per_node': 8,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'nnodes': 4,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'project_name': 'GRPO_numinamath-TIR',
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'save_freq': 1000,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'test_freq': 10,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_epochs': 5,
【2025-02-20 23:19:18】�[36m(main_task pid=8918)�[0m 'total_training_steps': None}}

@CLAassistant
Copy link

CLAassistant commented Feb 26, 2025

CLA assistant check
All committers have signed the CLA.

@uygnef uygnef force-pushed the fix/fix_load_model_oom branch from a9d3b95 to eae2b20 Compare February 26, 2025 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants