-
Notifications
You must be signed in to change notification settings - Fork 923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Parameter indices which did not receive grad for rank x", Multi-GPU SDXL Training (unet + both text encoders) #997
Comments
I'm thinking that it has something to do with the text encoders still being wrapped when getting the hidden states |
Unwrapping both text encoders before calling te1 = text_encoder1.module if type(text_encoder1) == DDP else text_encoder1
te2 = text_encoder2.module if type(text_encoder2) == DDP else text_encoder2
# unwrap_model is fine for models not wrapped by accelerator
encoder_hidden_states1, encoder_hidden_states2, pool2 = train_util.get_hidden_states_sdxl(
args.max_token_length,
input_ids1,
input_ids2,
tokenizer1,
tokenizer2,
te1,
te2,
None if not args.full_fp16 else weight_dtype,
) and reverting this change: unwrapped_text_encoder2 = text_encoder2 if accelerator is None else accelerator.unwrap_model(text_encoder2)
pool2 = pool_workaround(unwrapped_text_encoder2, enc_out["last_hidden_state"], input_ids2, tokenizer2.eos_token_id) to pool2 = pool_workaround(text_encoder2, enc_out["last_hidden_state"], input_ids2, tokenizer2.eos_token_id) But with this change, I'm afraid it broke the gradient synchronization |
I reproduced this with training only
For single GPU training, the grad of hidden_layer11 in text_encoder1 is also 0, but this will not raise an error because
|
Anyway, if the zero grad in Lines 400 to 401 in 4a2cef8
-> if train_text_encoder1:
# frozen layers11 and final_layer_norm
text_encoder1.text_model.encoder.layers[11].requires_grad_(False)
text_encoder1.text_model.final_layer_norm.requires_grad_(False)
text_encoder1 = accelerator.prepare(text_encoder1) |
I checked open_clip's implementation and this repo, I don't see anything mentioning that we should freeze |
I don't know the reason for RuntimeError, but SDXL uses the output of the penultimate layer of Text Encoder 1, so I think freezing the last layer and final_layer_norm of CLIP should not change the training result. |
Oops, sorry, I got the text encoders flipped in my mind, I thought that |
Fixed by PR #1000 |
I've encountered an error while training the SDXL UNet with both text encoders using the latest development branch.
Here's part of the traceback:
This error only appears when training the text encoders. The parameter indices that consistently identified as not receiving gradients are
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
.I also tried adding
find_unused_parameters=True
to theDistributedDataParallelKwargs
but the problem persists.I'm pretty sure this happens after PR #989 was merged to the dev branch.
The text was updated successfully, but these errors were encountered: