-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPOTrainer fails to transfer weights to vLLM with _move_model_to_vllm
after 7.5 hours of the job running
#2840
Comments
Offending PR might be #2817 |
Same issue here. In my case this happened immediately after the checkpoint has been saved. |
Can you try to provide the steps to reproduce? Maybe take only a small part of your dataset could help reproduce without having to wait 24 hours |
huggingface/open-r1#299 seems to be the same issue referenced in open-r1 |
This was with the following dataset https://huggingface.co/datasets/allenai/RLVR-IFeval |
Same situation |
我根据日志分别查阅了trl和deepspeed的源码,发现是 所以,我大胆的尝试了一下手动清空 for param in self.model.parameters():
param.ds_active_sub_modules.clear() 测试后发现有效,目前已经完成GRPO的训练任务。 |
Same issue |
Just cross reference from OpenRLHF issue, seems like related to batch size. |
@qgallouedec Me too! Have you fix this problem? |
I fix it by satisfying |
我运行代码报了这个错误, |
Reproduction
Description: I was running a job that would take about 24 hours. I have seen this repeated many times where the job crashes when using vLLM. However, this is hard to reproduce as it only happens after a long time.
33%|███▎ | 758/2274 [7:31:16<12:37:27, 29.98s/it]
Commit (1 commit behind main at the time of reporting this): 2106b31
GRPOConfig:
Error:
System Info
I use vllm==0.7.1.
TRL env:
Checklist
The text was updated successfully, but these errors were encountered: