-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Version Usage Issue #24724
New Version Usage Issue #24724
Comments
Here's another question, in the new version of the Transformers package, the default loaded model by from_pretrained has become safeTensors. How can I change it to pytorch.bin? Is there any parameter I can specify? |
Hi @Excuses123, thanks for raising this issue. Without knowing the model or dataset, we're unable to reproduce and won't be able to debug this issue. Is there a minimal reproducible snippet with a public dataset and model checkpoint where this issue (increase memory footprint) still occurs and you could share? To force the model to not load safetensor weights you can pass |
@amyeroberts Thank you for your response. I am using the model: The data can be found at: https://huggingface.co/datasets/BelleGroup/train_0.5M_CN/blob/main/Belle_open_source_0.5M.json Below is the execution script:
After testing, The maximum version that can currently run is 4.29.2, and all versions after that cannot run. |
I guess it might be caused by FSDP (Fully Sharded Data Parallelism), but I'm not sure. |
@Excuses123 Have you tried running without FDSP? Which version of accelerate are you running? |
@amyeroberts I have tried it, and without FSDP, both the new and old versions of transformers throw an OOM error. My accelerate version is 0.20.3. |
@Excuses123 Is this including versions <= 4.29.2 ? |
@amyeroberts I have tried version 4.29.0 and it works |
@Excuses123 OK, thanks for confirming. Could you:
|
@amyeroberts I have fixed the code formatting, and the version of my datasets is 2.11.0. My machine is currently running a task, and as soon as it is finished, I will try the latest version. |
Facing the same issue. Code ran smoothly with transformers==4.28.1 but OOM with transformers==4.30.2 |
@Excuses123 @larrylawl OK, thanks for the information and updates. I'm going to cc @pacman100 and @younesbelkada who know more about training in fp16 and torchrun |
I can confirm this. It is a bug introduced recently. It can be reproduced by the Vicuna training example. With 4.31.0, the warning is
To fix it, I followed the guide and changed these lines ( transformers/src/transformers/trainer.py Lines 1646 to 1661 in e42587f
model = self.accelerator.prepare(model)
if delay_optimizer_creation:
self.create_optimizer_and_scheduler(num_training_steps=max_steps)
self.optimizer = self.accelerator.prepare(self.optimizer) Then the warnings and OOM disappeared. @pacman100 @younesbelkada I think my fix is a hack that only works for my case. Could you do a more complete fix in the main branch? |
Hello @Ying1123, Thank you for the detailed info, very helpful. Could you please try out the above PRs for accelerate and transformers and see if it fixes the OOM? |
Thanks @pacman100, cherry-pick the PRs for transformers v4.31.0 and accelerate v0.21.0 works for me. |
@pacman100 Hi, I am still getting out-of-memory issues with the latest main. After accelerate is used for FSDP (from v4.30 - the current main), the example hits OOM. From these observations, I can confirm that the recent refactoring makes the memory usage higher than the older version but I do not know how to debug because I am not familiar with Accelerate. |
Hello @merrymercy, can you post the vram usage with the 4.28 version? |
Hi @pacman100 @Ying1123 , I meet the same issus: OOM ; And I revised my tranfomers to 4.31.0 or 4.30.0 and accelerate=0.21.0, all these are not worked !
And my fsdp are:
|
@pacman100 @Ying1123 And I found another way to add the fsdp_config.json can disappear the all follow warning :
And hacking method can disappear:
But all these still hit on OOM !
I think there is better way to fix this. |
I see same memory usage across versions for the following example:
version 4.28.1 - 5.4GB vram Please provide a minimal example that I can directly run without having to spend time in getting it to work. |
You mean the |
Both Accelerate and Transformers main branch |
With both Accelerate and Transformers main branch works for me |
@Xuekai-Zhu did you fix the problem? i met the same oom as 2xA6000 with both main branch |
I confirm using @Ying1123 's hacking does not work for me. I have 4 A100 card, with |
due to this method. downgrade to transformer==4.28.1 worked for me
|
I tried all the solution still getting OOM on A100 80GB |
If you still have an issue I suggest you to create a new issue, share a reproducer, a traceback and ping @pacman100, otherwise there is no way we can help you 😓 |
System Info
transformers
version: 4.29.0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
##Here is my code.
Expected behavior
Has anyone encountered this problem? I used the same instruction fine-tuning code. It runs successfully with transformers package version 4.29.0, but when I upgrade to version 4.30.2, it fails to run and throws an OOM (Out of Memory) error. Does anyone know the reason behind this?
Below is the GPU status during my successful run.
The text was updated successfully, but these errors were encountered: