Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] enable bf16 training #156

Merged
merged 2 commits into from
Feb 16, 2023
Merged

[core] enable bf16 training #156

merged 2 commits into from
Feb 16, 2023

Conversation

younesbelkada
Copy link
Contributor

Currently on the main branch of trl training in bfloat16 & logging with wandb fails. Users will encounter an error that are hard to interpret.

The fixes are the following:

cc @lvwerra

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Feb 16, 2023

The documentation is not available anymore as the PR was closed or merged.

@lvwerra
Copy link
Member

lvwerra commented Feb 16, 2023

What's the issue with lm_logits being in bf16?

@younesbelkada
Copy link
Contributor Author

So that we can compute the loss in fp32 and I found that this was more stable + sometimes we log a list of tensors, and directly casting the loss in fp32 avoids the issue with numpy & bf16

@RylanSchaeffer
Copy link
Contributor

RylanSchaeffer commented Aug 27, 2024

@younesbelkada @lvwerra I'm running into a problem where this forced upcast causes a massive spike in memory for models with large vocabularies (e.g., Gemma 2 by Google). This then either throws an OOM error or forces me to cut the minibatch size in half, which doubles the PPO runtime

Issue: #1980

Could you please provide more information or evidence about the stability argument?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants