Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Gradients not synchronized #924

Open
mephisto28 opened this issue Nov 3, 2023 · 2 comments
Open

[Bug] Gradients not synchronized #924

mephisto28 opened this issue Nov 3, 2023 · 2 comments

Comments

@mephisto28
Copy link

text_encoder, unet = train_util.transform_if_model_is_DDP(text_encoder, unet)

I have not idea what this line is used for, but this unwrap DDP module so that the training process become unsynchronized, i.e, no gradient communication in multi-gpu training, each node trains independently with part of the data.

I ensure this by adding sleep in one of the worker and finding no hang in the main training process. And by deleting this line the job got properly synchronized.

@mephisto28
Copy link
Author

With above mentioned line not deleted:
image

With above mentioned line deleted:
image

@kohya-ss
Copy link
Owner

kohya-ss commented Nov 5, 2023

Thank you for opening the issue.

I have not directly developed the training code for multiple GPUs, and have received PRs.

Would these PRs be a reference? #165 and #448

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants