Negative Entropy in TRL PPOv2Trainer TLDR Example #2022
Labels
🙋 help from community wanted
Open invitation for community members to contribute
🏋 PPO
Related to PPO
❓ question
Seeking clarification or more information
System Info
transformers
version: 4.44.0- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: True
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- fsdp_config: {'fsdp_activation_checkpointing': True, 'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': True, 'fsdp_offload_params': True, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'SHARDED_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_use_orig_params': True}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'EAGER'}
Information
Tasks
examples
folderReproduction
In TRL's PPOv2Trainer TLDR example, run the default command:
Expected behavior
Entropy for a discrete distribution (such as that of a language model) must be non-negative. However, when I run the official example, the entropy can be negative:
I don't think I'm making a mistake because this negative entropy also appears in the official documentation. Specifically, look early in training, at maybe 20k episodes:
The documentation describes
objective/entropy
as "The mean entropy of the policy, indicating the randomness of the actions chosen by the policy." If this is incorrect, and some other quantity is computed instead, then perhaps the documentation needs to be updated?The text was updated successfully, but these errors were encountered: