Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RewardTrainer fails with FSDP #1195

Closed
mgerstgrasser opened this issue Jan 9, 2024 · 1 comment · Fixed by #1196
Closed

RewardTrainer fails with FSDP #1195

mgerstgrasser opened this issue Jan 9, 2024 · 1 comment · Fixed by #1196

Comments

@mgerstgrasser
Copy link
Contributor

I've just run into an odd issue with FSDP & RewardTrainer. It seems then when using FSDP, the output of the (sequence classification) model's forward function isn't as expected.
Normally, it returns a SequenceClassifierOutputWithPast where logits contains a tensor with the logits, and loss is empty or contains some sort of generator object..
When using FSDP, I'm getting a dict inside the loss field (and oddly enough that dict again contains a single key logits, althouh that's not the issue).

Not sure why this happens, but the net effect is that when the RewardTrainer tries to get the logits through model(...)[0] (see here), in the non-FSDP case it gets the logits, while in the FSDP case it gets the dict from the now non-emptyloss field, and then fails a few lines later.

Two questions:

  1. This is easily fixed by doing model(...)["logits"] instead. Any problem with doing that?
  2. Purely out of curiosity, does anyone know why this behaves differently with FSDP?

To reproduce: Run examples/scripts/reward_modeling.py with accelerate + FSDP.

forward output in a single process:

SequenceClassifierOutputWithPast(loss=<generator object gather.<locals>.gather_map.<locals>.<genexpr> at 0x15360f993040>, logits=tensor([[...]], device='cuda:0', grad_fn=<GatherBackward>), past_key_values=None, hidden_states=None, attentions=None)

And in FSDP:

SequenceClassifierOutputWithPast(loss={'logits': tensor([[...]], device='cuda:1', grad_fn=<ToCopyBackward0>)}, logits=tensor([[...]], device='cuda:1', grad_fn=<ToCopyBackward0>), past_key_values=None, hidden_states=None, attentions=None)
@younesbelkada
Copy link
Contributor

thanks for the deepdive, I left a suggestion on the PR, lmk what do you think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants