-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
model.forward
requires num_logits_to_keep
, not logits_to_keep
#2770
Comments
What model do you use? |
I am using Qwen2.5 7B and Llama3 8B. |
Can't reproduce 🤔 (with both transformers 4.49 dev and 4.48): >>> from transformers import AutoModelForCausalLM
>>> import torch
>>> model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B").to("cuda")
>>> input_ids = torch.randint(100, 200, (4, 256), device="cuda")
>>> model(input_ids, logits_to_keep=128)
CausalLMOutputWithPast(loss=None, logits=tensor([[[ 3.1937, 5.0300, 5.1271, ..., 0.9952, 0.9950, 0.9953],
...
[-5.3295, -7.6368, -1.7083, ..., 4.3898, 4.3899, 4.3898]]],
device='cuda:0', grad_fn=<UnsafeViewBackward0>), past_key_values=DynamicCache(), hidden_states=None, attentions=None)
>>> |
@qgallouedec Hey, could you check the shape of the returned logits? out = model(input_ids, logits_to_keep=128)
out.logits.shape On my side, using |
@qgallouedec Also, do you think this is related to #2731? |
Nice finding. Indeed, with 4.48, you'll get 256, while with 4.49, you'll get 128, as expected. |
Right... Transformers' latest modeling_qwen2.py says:
So this is a bug related to some models of some Transformers versions. I've modified my issue to reflect this. |
!!! Thank you so much I was baffled and debugging my code for hours because of this. After applying the fix, the difference is obvious: Before: After: This is a pretty major bug that makes the method unusable for Qwen models (and maybe others) using the latest stable It may also be great to add some sort of check or assertion to make sure the shape of the logits matches what is expected. |
@nopepper Hey, could you share the versions of |
Interestingly even with this bug I was able to train a Qwen2.5 model properly and it actually converged, everything was normal... now my brain exploded. Checked the code and it was indeed as you guys found, 256 not 128 Update: Maybe because I was using a (without num_) commit from the prompt cache PR, which did no_grad on the prompt part. And offsets the issue. Only thing was the bug increased HBM usage. |
Reproduction
In
_get_per_token_logps
method ofgrpo_trainer.py
, the model is called asBut on Transformers 4.48, Qwen 2's
model.forward
has nologits_to_keep
argument. It should benum_logits_to_keep
, i.e.,System Info
Checklist
The text was updated successfully, but these errors were encountered: