[`core`] enable `bf16` training #156

younesbelkada · 2023-02-16T15:07:57Z

Currently on the main branch of trl training in bfloat16 & logging with wandb fails. Users will encounter an error that are hard to interpret.

The fixes are the following:

remove wandb.watch(model) as it takes care of logging gradients that are in bf16, otherwise users will get an error that is similar to [CLI]: RuntimeError: "histogram_cpu" not implemented for 'BFloat16' wandb/wandb#3332 | I propose to remove this is not a core feature and is a main blocking point
upcast the tensors in fp32 before logging them
force-upcast the lm_logits in fp32

cc @lvwerra

HuggingFaceDocBuilderDev · 2023-02-16T15:11:26Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra · 2023-02-16T15:29:26Z

What's the issue with lm_logits being in bf16?

younesbelkada · 2023-02-16T15:30:52Z

So that we can compute the loss in fp32 and I found that this was more stable + sometimes we log a list of tensors, and directly casting the loss in fp32 avoids the issue with numpy & bf16

RylanSchaeffer · 2024-08-27T13:20:00Z

@younesbelkada @lvwerra I'm running into a problem where this forced upcast causes a massive spike in memory for models with large vocabularies (e.g., Gemma 2 by Google). This then either throws an OOM error or forces me to cut the minibatch size in half, which doubles the PPO runtime

Issue: #1980

Could you please provide more information or evidence about the stability argument?

younesbelkada added 2 commits February 16, 2023 15:01

remove watch

52405a9

fix bf16 issues

ddd6c89

lvwerra approved these changes Feb 16, 2023

View reviewed changes

younesbelkada merged commit 07d3cbe into main Feb 16, 2023

younesbelkada deleted the fix-mixed-prec branch February 16, 2023 15:42

RylanSchaeffer mentioned this pull request Aug 27, 2024

PPOTrainer OOM Error Because of Forced Upcast to torch.float32 #1980

Open

4 tasks

RylanSchaeffer mentioned this pull request Sep 5, 2024

Negative Entropy in TRL PPOv2Trainer TLDR Example #2022

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core`] enable `bf16` training #156

[`core`] enable `bf16` training #156

younesbelkada commented Feb 16, 2023

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading

lvwerra commented Feb 16, 2023 •

edited

Loading

younesbelkada commented Feb 16, 2023

RylanSchaeffer commented Aug 27, 2024 •

edited

Loading

[core] enable bf16 training #156

[core] enable bf16 training #156

Conversation

younesbelkada commented Feb 16, 2023

HuggingFaceDocBuilderDev commented Feb 16, 2023 • edited Loading

lvwerra commented Feb 16, 2023 • edited Loading

younesbelkada commented Feb 16, 2023

RylanSchaeffer commented Aug 27, 2024 • edited Loading

[`core`] enable `bf16` training #156

[`core`] enable `bf16` training #156

HuggingFaceDocBuilderDev commented Feb 16, 2023 •

edited

Loading

lvwerra commented Feb 16, 2023 •

edited

Loading

RylanSchaeffer commented Aug 27, 2024 •

edited

Loading