GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

arnavgarg1 · 2025-01-30T18:27:44Z

Reproduction

Hi all!

I've been trying to train a variety of models using GRPO, but I noticed that the train/loss metric remains 0 or close to 0 throughout training even after a large number of steps (>200). My mean rewards also don't change significantly during this same period. This seems unexpected and might indicate that the optimization step isn't working as intended.

However, on the official docs page, I see that that the image of the learning curves shared suggests that loss can remain close to 0 for long periods of time and actually increases instead of decreasing: https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig. Similarly, in @philschmid's blogpost from earlier today (https://www.philschmid.de/mini-deepseek-r1), I noticed a similar trend where loss stayed 0 for nearly 200 steps before increasing.

This makes it seem like it is the expected behavior, but I'm having a hard time understanding it.

I had a few questions that I am hoping someone is able to help me understand:

Is there an issue with the loss computation in GRPO?
Should loss values be expected to remain near zero in this setting? What does an increasing loss suggest?
Could this be related to specific hyperparameters or gradient updates not being applied correctly?

Would appreciate any insights or guidance on debugging this! Thanks!

System Info

OS: Linux
Transformers 4.48.1
TRL: Main Branch

Checklist

I have checked that my issue isn't already filed (see open issues)
I have included my system information
Any code provided is minimal, complete, and reproducible (more on MREs)
Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
Any traceback provided is complete

qgallouedec · 2025-01-30T18:33:05Z

Interesting question!

The answer is in the math. If you calculate the value of the loss (ignore the gradient), you'll see that it's equal to $\beta \mathrm{KL}$. That's why it starts at 0 and that's why it's increasing.

arnavgarg1 · 2025-01-30T18:41:59Z

Thanks for the prompt response @qgallouedec!

Does this mean that the loss itself is not a reliable indicator of training progression and we should primarily rely on KL and reward trends instead?

qgallouedec · 2025-01-30T18:56:29Z

You should rely mostly on the reward. And keep an eye on the generations (risk of reward hacking)

NickyDark1 · 2025-02-02T03:24:30Z

I trained this model:
https://huggingface.co/NickyNicky/Llama-1B-base-GRPO-miniThinky_v1

I have these metrics train:

I see that it is not necessary to wait long for it to come out with a value of 0, I also observe that the sudden changes where the range from 300 to 500 tokens for generation and if it increases more say to 1000 tokens to generate only that change could reach the rewards to zero again and without leaving there

NickyDark1 · 2025-02-02T03:28:50Z

other model train:

I think token generation changes affect

XiaofengZHOU · 2025-02-05T07:04:31Z

Interesting question!
The answer is in the math. If you calculate the value of the loss (ignore the gradient), you'll see that it's equal to β KL . That's why it starts at 0 and that's why it's increasing.

Hi. I am a bit confused. From the implementation about per_token_loss, it should be advantages as torch.exp(0)=1. So how could it be \beta KL for a mount of steps?🤔

trl/trl/trainer/grpo_trainer.py

Line 567 in af4ad47

per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

qgallouedec · 2025-02-05T07:35:49Z

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

It is the same, since we do 1 optimization step

XiaofengZHOU · 2025-02-05T08:51:56Z

I think it's not the same as the original GRPO algorithm(missing ration and clamp)

It is the same, since we do 1 optimization step

according to the equation，the loss== βKL, which means the bigger the kl, the better the performance?
so how does the reward work?

for example:
rewards = torch.tensor([ 0, 1, 0], dtype=torch.float32)
per_token_logps1 = [[-0.4, -0.3], [-0.6, -0.5], [-1, -1]]
per_token_logps2 = [[-0.6, -0.5], [-0.4, -0.3], [-1, -1]]
the loss calculated are bot tensor(0.0062).

can you point out where is my problem?

github-actions bot added 🏋 GRPO Related to GRPO ❓ question Seeking clarification or more information labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

arnavgarg1 commented Jan 30, 2025

qgallouedec commented Jan 30, 2025 •

edited

Loading

arnavgarg1 commented Jan 30, 2025

qgallouedec commented Jan 30, 2025

NickyDark1 commented Feb 2, 2025

NickyDark1 commented Feb 2, 2025

XiaofengZHOU commented Feb 5, 2025

qgallouedec commented Feb 5, 2025

XiaofengZHOU commented Feb 5, 2025

GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

GRPO: Why does loss start at 0 for first K steps and then increase over time? #2703

Comments

arnavgarg1 commented Jan 30, 2025

Reproduction

System Info

Checklist

qgallouedec commented Jan 30, 2025 • edited Loading

arnavgarg1 commented Jan 30, 2025

qgallouedec commented Jan 30, 2025

NickyDark1 commented Feb 2, 2025

NickyDark1 commented Feb 2, 2025

XiaofengZHOU commented Feb 5, 2025

qgallouedec commented Feb 5, 2025

XiaofengZHOU commented Feb 5, 2025

qgallouedec commented Jan 30, 2025 •

edited

Loading