GRPOTrainer fails to transfer weights to vLLM with `_move_model_to_vllm` after 7.5 hours of the job running #2840

casper-hansen · 2025-02-12T09:28:34Z

Reproduction

Description: I was running a job that would take about 24 hours. I have seen this repeated many times where the job crashes when using vLLM. However, this is hard to reproduce as it only happens after a long time.

33%|███▎ | 758/2274 [7:31:16<12:37:27, 29.98s/it]

Commit (1 commit behind main at the time of reporting this): 2106b31

GRPOConfig:

training_args = GRPOConfig(
    output_dir=output_dir,
    learning_rate=2e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.07,
    lr_scheduler_type="cosine",
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_generations=7, # 1 per GPU
    max_prompt_length=MAX_PROMPT_LENGTH,
    max_completion_length=MAX_COMPLETION_LENGTH,
    num_train_epochs=3,
    save_steps=100,
    max_grad_norm=0.1,
    report_to="wandb",
    log_on_each_node=False,
    use_vllm=True,
    vllm_max_model_len=TOTAL_LENGTH,
    vllm_gpu_memory_utilization=0.7,
    beta=0.01,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
)

Error:

Traceback (most recent call last):
  File "/workspace/nlp_train/hf_trl/train.py", line 117, in <module>
    trainer.train()
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2171, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/transformers/trainer.py", line 2531, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/transformers/trainer.py", line 3669, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 519, in _prepare_inputs
    self._move_model_to_vllm()
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/trl/trainer/grpo_trainer.py", line 490, in _move_model_to_vllm
    with unwrap_model_for_generation(
  File "/opt/conda/envs/py_3.11/lib/python3.11/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/trl/models/utils.py", line 195, in unwrap_model_for_generation
    with deepspeed.zero.GatheredParameters(model.parameters()):
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2251, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1394, in partition
    self._partition(param_list, has_been_updated=has_been_updated, free_data=True)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1543, in _partition
    self._partition_param(param, has_been_updated=has_been_updated, free_data=True)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1577, in _partition_param
    free_param(param)
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/py_3.11/lib/python3.11/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 284, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: {'id': 0, 'status': 'AVAILABLE', 'numel': 544997376, 'ds_numel': 544997376, 'shape': (152064, 3584), 'ds_shape': (152064, 3584), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {372}, 'ds_tensor.shape': torch.Size([77856768])}

System Info

I use vllm==0.7.1.

TRL env:

Platform: Linux-5.15.0-1063-nvidia-x86_64-with-glibc2.35
Python version: 3.11.11
PyTorch version: 2.5.1
CUDA device(s): NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3, NVIDIA H100 80GB HBM3
Transformers version: 4.48.2
Accelerate version: 1.3.0
Accelerate config: not found
Datasets version: 3.2.0
HF Hub version: 0.28.1
TRL version: 0.15.0.dev0
bitsandbytes version: not installed
DeepSpeed version: 0.16.3
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: 1.61.1
PEFT version: not installed

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

The text was updated successfully, but these errors were encountered:

casper-hansen · 2025-02-12T12:04:25Z

Offending PR might be #2817

AndreiCComan · 2025-02-13T19:43:31Z

Same issue here. In my case this happened immediately after the checkpoint has been saved.

qgallouedec · 2025-02-13T20:15:02Z

Can you try to provide the steps to reproduce? Maybe take only a small part of your dataset could help reproduce without having to wait 24 hours

Superskyyy · 2025-02-14T00:31:14Z

huggingface/open-r1#299 seems to be the same issue referenced in open-r1

casper-hansen · 2025-02-14T08:46:51Z

Can you try to provide the steps to reproduce? Maybe take only a small part of your dataset could help reproduce without having to wait 24 hours

This was with the following dataset https://huggingface.co/datasets/allenai/RLVR-IFeval

hezhefly · 2025-02-17T04:58:52Z

Same issue here. In my case this happened immediately after the checkpoint has been saved.

Same situation

hezhefly · 2025-02-17T10:53:48Z

我根据日志分别查阅了trl和deepspeed的源码，发现是deepspeed.zero.GatheredParameters中对参数的断言引发的错误，进一步查阅断言的逻辑，发现free_param(param)方法希望在执行之前ds_active_sub_modules参数值被清空。我不清楚trl中具体是什么原因造成这个这种ds_active_sub_modules参数值未清空的现象。

所以，我大胆的尝试了一下手动清空ds_active_sub_modules参数值，我尝试在grpo_trainer.py#L490中加入以下清空参数的逻辑：

for param in self.model.parameters():
    param.ds_active_sub_modules.clear()

测试后发现有效，目前已经完成GRPO的训练任务。

wuyifan18 · 2025-02-18T04:20:21Z

Same issue

Superskyyy · 2025-02-18T04:51:50Z

Just cross reference from OpenRLHF issue, seems like related to batch size.

OpenRLHF/OpenRLHF#630

tsrigo · 2025-02-19T02:59:32Z

Same issue here. In my case this happened immediately after the checkpoint has been saved.

@qgallouedec Me too! Have you fix this problem?

tsrigo · 2025-02-20T06:58:48Z

Same issue here. In my case this happened immediately after the checkpoint has been saved.

@qgallouedec Me too! Have you fix this problem?

I fix it by satisfying save_interval % grad_accum == 0.

loxs123 · 2025-02-21T16:20:46Z

我根据日志分别查阅了trl和deepspeed的源码，发现是deepspeed.zero.GatheredParameters中对参数的断言引发的错误，进一步查阅断言的逻辑，发现free_param(param)方法希望在执行之前ds_active_sub_modules参数值被清空。我不清楚trl中具体是什么原因造成这个这种ds_active_sub_modules参数值未清空的现象。

所以，我大胆的尝试了一下手动清空ds_active_sub_modules参数值，我尝试在grpo_trainer.py#L490中加入以下清空参数的逻辑：

for param in self.model.parameters():
param.ds_active_sub_modules.clear()
测试后发现有效，目前已经完成GRPO的训练任务。

我运行代码报了这个错误，AttributeError: 'Parameter' object has no attribute 'ds_active_sub_modules，请问你知道该如何解决吗？或许是某个库的版本不太一致？

github-actions bot added 🏋 GRPO Related to GRPO 🚀 deepspeed Related to deepspeed 🐛 bug Something isn't working labels Feb 12, 2025

This was referenced Feb 17, 2025

training bug or promblems dhcode-cpp/X-R1#22

Closed

How？ dhcode-cpp/X-R1#23

Closed

Superskyyy mentioned this issue Feb 18, 2025

AssertionError grpo #2877

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPOTrainer fails to transfer weights to vLLM with `_move_model_to_vllm` after 7.5 hours of the job running #2840

GRPOTrainer fails to transfer weights to vLLM with `_move_model_to_vllm` after 7.5 hours of the job running #2840

casper-hansen commented Feb 12, 2025

casper-hansen commented Feb 12, 2025

AndreiCComan commented Feb 13, 2025

qgallouedec commented Feb 13, 2025

Superskyyy commented Feb 14, 2025

casper-hansen commented Feb 14, 2025

hezhefly commented Feb 17, 2025

hezhefly commented Feb 17, 2025

wuyifan18 commented Feb 18, 2025

Superskyyy commented Feb 18, 2025

tsrigo commented Feb 19, 2025

tsrigo commented Feb 20, 2025

loxs123 commented Feb 21, 2025

GRPOTrainer fails to transfer weights to vLLM with _move_model_to_vllm after 7.5 hours of the job running #2840

GRPOTrainer fails to transfer weights to vLLM with _move_model_to_vllm after 7.5 hours of the job running #2840

Comments

casper-hansen commented Feb 12, 2025

Reproduction

System Info

Checklist

casper-hansen commented Feb 12, 2025

AndreiCComan commented Feb 13, 2025

qgallouedec commented Feb 13, 2025

Superskyyy commented Feb 14, 2025

casper-hansen commented Feb 14, 2025

hezhefly commented Feb 17, 2025

hezhefly commented Feb 17, 2025

wuyifan18 commented Feb 18, 2025

Superskyyy commented Feb 18, 2025

tsrigo commented Feb 19, 2025

tsrigo commented Feb 20, 2025

loxs123 commented Feb 21, 2025

GRPOTrainer fails to transfer weights to vLLM with `_move_model_to_vllm` after 7.5 hours of the job running #2840

GRPOTrainer fails to transfer weights to vLLM with `_move_model_to_vllm` after 7.5 hours of the job running #2840