[Bug] Gradients not synchronized #924

mephisto28 · 2023-11-03T08:15:22Z

Line 247 in 2a23713

text_encoder, unet = train_util.transform_if_model_is_DDP(text_encoder, unet)

I have not idea what this line is used for, but this unwrap DDP module so that the training process become unsynchronized, i.e, no gradient communication in multi-gpu training, each node trains independently with part of the data.

I ensure this by adding sleep in one of the worker and finding no hang in the main training process. And by deleting this line the job got properly synchronized.

mephisto28 · 2023-11-03T08:41:23Z

With above mentioned line not deleted:

With above mentioned line deleted:

kohya-ss · 2023-11-05T11:54:41Z

Thank you for opening the issue.

I have not directly developed the training code for multiple GPUs, and have received PRs.

Would these PRs be a reference? #165 and #448

Isotr0py mentioned this issue Dec 6, 2023

Fix gradients synchronization for multi-GPUs training #989

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Gradients not synchronized #924

[Bug] Gradients not synchronized #924

mephisto28 commented Nov 3, 2023

mephisto28 commented Nov 3, 2023

kohya-ss commented Nov 5, 2023

[Bug] Gradients not synchronized #924

[Bug] Gradients not synchronized #924

Comments

mephisto28 commented Nov 3, 2023

mephisto28 commented Nov 3, 2023

kohya-ss commented Nov 5, 2023