Validation loss saved in filename by ModelCheckpoint is incorrect when using DDP with multiple GPUs #6138
Labels
bug
Something isn't working
checkpointing
Related to checkpointing
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 1
Medium priority task
Milestone
🐛 Bug
When using DDP with 2 GPUs and logging validation loss in
validation_step
withself.log('val_loss', loss, sync_dist=True)
, ModelCheckpoint callback embeds validation loss that is multiplied by 2 (number of GPUs?) in the filename. This happens in Lightning 1.2.0.This is a message printed by ModelCheckpoint callback:
To Reproduce
Expected behavior
The loss embedded in the filename should be the same as the loss in the message and logger.
Environment
conda
,pip
, source): pipThe text was updated successfully, but these errors were encountered: