Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

Open
5 tasks done
arnavgarg1 opened this issue Jan 30, 2025 · 8 comments
Open
5 tasks done
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information

Comments

@arnavgarg1
Copy link
Contributor

Reproduction

Hi all!

I've been trying to train a variety of models using GRPO, but I noticed that the train/loss metric remains 0 or close to 0 throughout training even after a large number of steps (>200). My mean rewards also don't change significantly during this same period. This seems unexpected and might indicate that the optimization step isn't working as intended.

However, on the official docs page, I see that that the image of the learning curves shared suggests that loss can remain close to 0 for long periods of time and actually increases instead of decreasing: https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig. Similarly, in @philschmid's blogpost from earlier today (https://www.philschmid.de/mini-deepseek-r1), I noticed a similar trend where loss stayed 0 for nearly 200 steps before increasing.

This makes it seem like it is the expected behavior, but I'm having a hard time understanding it.

I had a few questions that I am hoping someone is able to help me understand:

  1. Is there an issue with the loss computation in GRPO?
  2. Should loss values be expected to remain near zero in this setting? What does an increasing loss suggest?
  3. Could this be related to specific hyperparameters or gradient updates not being applied correctly?

Would appreciate any insights or guidance on debugging this! Thanks!

System Info

  • OS: Linux
  • Transformers 4.48.1
  • TRL: Main Branch

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete
@github-actions github-actions bot added 🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information labels Jan 30, 2025
@qgallouedec
Copy link
Member

qgallouedec commented Jan 30, 2025

Interesting question!

The answer is in the math. If you calculate the value of the loss (ignore the gradient), you'll see that it's equal to $\beta \mathrm{KL}$. That's why it starts at 0 and that's why it's increasing.

@arnavgarg1
Copy link
Contributor Author

Thanks for the prompt response @qgallouedec!

Does this mean that the loss itself is not a reliable indicator of training progression and we should primarily rely on KL and reward trends instead?

@qgallouedec
Copy link
Member

You should rely mostly on the reward. And keep an eye on the generations (risk of reward hacking)

@NickyDark1
Copy link

I trained this model:
https://huggingface.co/NickyNicky/Llama-1B-base-GRPO-miniThinky_v1

I have these metrics train:

Image

I see that it is not necessary to wait long for it to come out with a value of 0, I also observe that the sudden changes where the range from 300 to 500 tokens for generation and if it increases more say to 1000 tokens to generate only that change could reach the rewards to zero again and without leaving there

@NickyDark1
Copy link

other model train:

Image

Image

I think token generation changes affect

@XiaofengZHOU
Copy link

Interesting question!
The answer is in the math. If you calculate the value of the loss (ignore the gradient), you'll see that it's equal to β KL . That's why it starts at 0 and that's why it's increasing.

Hi. I am a bit confused. From the implementation about per_token_loss, it should be advantages as torch.exp(0)=1. So how could it be \beta KL for a mount of steps?🤔

trl/trl/trainer/grpo_trainer.py

Line 567 in af4ad47

per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

@qgallouedec
Copy link
Member

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

It is the same, since we do 1 optimization step

@XiaofengZHOU
Copy link

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

It is the same, since we do 1 optimization step

according to the equation,the loss== βKL, which means the bigger the kl, the better the performance?
so how does the reward work?

for example:
rewards = torch.tensor([ 0, 1, 0], dtype=torch.float32)
per_token_logps1 = [[-0.4, -0.3], [-0.6, -0.5], [-1, -1]]
per_token_logps2 = [[-0.6, -0.5], [-0.4, -0.3], [-1, -1]]
the loss calculated are bot tensor(0.0062).

can you point out where is my problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

4 participants