-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results on TPU worse than on GPU (using colab) #526
Comments
I'm working on a very large benchmark for this, as this issue/confusion has come up multiple times. But please give a look at this issue first because if you do not adjust the batch size at all you are not actually accurately benchmarking your results: #450 |
Yep, I tried adjust batch size and learning rates (multiply/divide by number) - but still no results, TPU in multi-processing perform much worse. I cannot achieve same performance (0.9+ f1), but I could achieve with single TPU now. |
For us to reproduce your results, please include the exact configurations you used for testing each one, including each batch size as you adjusted it as well as learning rates. These are critically important Since you are in colab, could you try launching via the |
Well, I tried divide or multiply lr/bs by degrees of 2. And when I launching via notebook launcher, I get similar results (well, exactly same results) |
@koba35 great, thanks! Very curious to see about your result there. |
As it turned out, everything is much more complicated. When removing the model from the training function and increasing lr, I was able to achieve normal results, but apparently the point is also that in this case the model is initialized before we set_seed. If we set_seed before we start multiprocessing, then the results fall again. It seems to me that in this example, several different details just converged - a relatively small dataset, a large initial LR, a fixed number of steps in the scheduler, model initialization inside the function, a fixed seed inside each process (if I understand correctly, in every fork we should setting different seeds). As for me, it is worth adding the following tweaks:
Unfortunately, I can’t say more precisely - maybe it's actually something else, but I couldn't find it. |
Hey @koba35, I believe I've finally solved this regression issue. If you could, can you try doing the following in your code: model, optimizer, scheduler, train_dataloader, eval_dataloader = accelerator.prepare(
model, optimizer, scheduler, train_dataloader, eval_dataloader
)
scheduler.split_batches = True # <- THIS PART RIGHT HERE and tell me what your results are? This improved my initial findings as I was looking, and practically perfectly aligned with results when Accelerate wasn't used at all |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I created notebook for reproducing, but the steps are very easy:
When I choose training with TPU, I could get 0.848 f1 score. When I train with GPU, I got more than 0.9. I also tried different scripts, and always get much worse results with TPU. Maybe it something colab-specific, because as I can see in another TPU related issues (for example) people getting results similar to my GPU results
Expected behavior
When I run example scripts in colab, I should get similar results with TPU and GPU.
The text was updated successfully, but these errors were encountered: