Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative Entropy in TRL PPOv2Trainer TLDR Example #2022

Open
2 of 4 tasks
RylanSchaeffer opened this issue Sep 5, 2024 · 3 comments
Open
2 of 4 tasks

Negative Entropy in TRL PPOv2Trainer TLDR Example #2022

RylanSchaeffer opened this issue Sep 5, 2024 · 3 comments
Labels
🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO ❓ question Seeking clarification or more information

Comments

@RylanSchaeffer
Copy link
Contributor

RylanSchaeffer commented Sep 5, 2024

System Info

  • transformers version: 4.44.0
  • Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
  • Python version: 3.11.9
  • Huggingface_hub version: 0.23.4
  • Safetensors version: 0.4.3
  • Accelerate version: 0.32.1
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: FSDP
    - mixed_precision: bf16
    - use_cpu: False
    - debug: True
    - num_processes: 2
    - machine_rank: 0
    - num_machines: 1
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': True, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
    - dynamo_config: {'dynamo_backend': 'EAGER'}
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: Yes
  • Using GPU in script? Yes
  • GPU type: NVIDIA A100-SXM4-80GB

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

In TRL's PPOv2Trainer TLDR example, run the default command:

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
    examples/scripts/ppo/ppo_tldr.py \
    --output_dir models/minimal/ppo_tldr \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 4 \
    --total_episodes 1000000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    --local_rollout_forward_batch_size 16 \
    --non_eos_penalty \
    --stop_token eos

Expected behavior

Entropy for a discrete distribution (such as that of a language model) must be non-negative. However, when I run the official example, the entropy can be negative:

image

I don't think I'm making a mistake because this negative entropy also appears in the official documentation. Specifically, look early in training, at maybe 20k episodes:

image

The documentation describes objective/entropy as "The mean entropy of the policy, indicating the randomness of the actions chosen by the policy." If this is incorrect, and some other quantity is computed instead, then perhaps the documentation needs to be updated?

@RylanSchaeffer RylanSchaeffer added the 🐛 bug Something isn't working label Sep 5, 2024
@RylanSchaeffer
Copy link
Contributor Author

RylanSchaeffer commented Sep 5, 2024

I don't know if this is the culprit, but I noticed that the tutorial and I both use bf16, and in bf16, the two following quantities don't agree:

torch.einsum("bse,bse->bs", prob_dist, logits) - torch.sum(prob_dist * logits, dim=-1)

The difference is non-zero:

tensor([[ 0.0000,  0.1250, -0.1250,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.1250,  0.0000, ...0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000]], device='cuda:0',
       dtype=torch.bfloat16)

@RylanSchaeffer
Copy link
Contributor Author

Following this previous PR, it might be worthwhile to consider upcasting the tensors before computing logged quantities.

But I don't know if this explains how the entropy is becoming negative...

@RylanSchaeffer
Copy link
Contributor Author

On another PPOv2 run, I again observe negative entropy:

image

@qgallouedec qgallouedec added ❓ question Seeking clarification or more information 🏋 PPO Related to PPO and removed 🐛 bug Something isn't working labels Oct 21, 2024
@qgallouedec qgallouedec added the 🙋 help from community wanted Open invitation for community members to contribute label Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🙋 help from community wanted Open invitation for community members to contribute 🏋 PPO Related to PPO ❓ question Seeking clarification or more information
Projects
None yet
Development

No branches or pull requests

2 participants