Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to train using DAdaptation optimizer #1160

Closed
aearone opened this issue Jul 11, 2023 · 3 comments
Closed

Error when trying to train using DAdaptation optimizer #1160

aearone opened this issue Jul 11, 2023 · 3 comments

Comments

@aearone
Copy link

aearone commented Jul 11, 2023

I don't know what's wrong, but I guess I just don't have DAdaptation installed correctly in the kohya_ss\venv\Lib\site-packages directory, although the dadaptation folder is present. I tried updating/installing dadaptation via activate with the pip install dadaptation command, but it didn't work for me. I still get an error when trying to practice. In the venv\Lib\site-packages directory there is a dadaptation-3.1.dist-info folder with no executables, only metadata. Maybe I'm doing something wrong?

C:\Windows\System32>cd /d g:\kohya_ss\kohya_ss

g:\kohya_ss\kohya_ss>gui-user.bat
18:47:42-349135 INFO     nVidia toolkit detected
18:47:48-212559 INFO     Torch 2.0.1+cu118
18:47:48-250872 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8800
18:47:48-255871 INFO     Torch detected GPU: NVIDIA GeForce RTX 3060 Ti VRAM 8192 Arch (8, 6) Cores 38
18:47:48-257877 INFO     Verifying requirements
18:47:48-265180 INFO     Installing package: diffusers[torch]==0.10.2
18:47:55-969572 INFO     headless: False
18:47:55-977889 INFO     Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
18:52:24-768713 INFO     Loading config...
18:53:05-329841 INFO     Start training LoRA Standard ...
18:53:05-331840 INFO     Valid image folder names found in: G:/LoraTraining/img
18:53:06-852814 INFO     Folder 10_N4: 108 images found
18:53:06-853815 INFO     Folder 10_N4: 1080 steps
18:53:06-855817 INFO     Total steps: 1080
18:53:06-856815 INFO     Train batch size: 2
18:53:06-858814 INFO     Gradient accumulation steps: 1
18:53:06-859816 INFO     Epoch: 7
18:53:06-861817 INFO     Regulatization factor: 1
18:53:06-862815 INFO     max_train_steps (1080 / 2 / 1 * 7 * 1) = 3780
18:53:06-863815 INFO     stop_text_encoder_training = 0
18:53:06-864814 INFO     lr_warmup_steps = 0
18:53:06-866815 INFO     accelerate launch --num_cpu_threads_per_process=2 "train_network.py" --enable_bucket
                         --pretrained_model_name_or_path="G:/SDCache/diffusers/stable-diffusion-v1-5/v1-5-pruned.ckpt"
                         --train_data_dir="G:/LoraTraining/img" --resolution="512,512"
                         --output_dir="G:/LoraTraining/model" --logging_dir="G:/LoraTraining/log" --network_alpha="128"
                         --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.5 --unet_lr=1.0
                         --network_dim=128 --output_name="DrakeV1" --lr_scheduler_num_cycles="7" --learning_rate="1.0"
                         --lr_scheduler="constant" --train_batch_size="2" --max_train_steps="3780"
                         --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1234"
                         --caption_extension=".txt" --cache_latents --optimizer_type="DAdaptation" --optimizer_args
                         decouple=True weight_decay=0.01 betas=0.9,0.99 --max_data_loader_n_workers="0"
                         --max_token_length=225 --bucket_reso_steps=64 --min_snr_gamma=5 --xformers --bucket_no_upscale
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
prepare tokenizer
update token length: 225
Using DreamBooth method.
prepare images.
found directory G:\LoraTraining\img\10_N4 contains 108 image files
1080 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
  batch_size: 2
  resolution: (512, 512)
  enable_bucket: True
  min_bucket_reso: 256
  max_bucket_reso: 1024
  bucket_reso_steps: 64
  bucket_no_upscale: True

  [Subset 0 of Dataset 0]
    image_dir: "G:\LoraTraining\img\10_N4"
    image_count: 108
    num_repeats: 10
    shuffle_caption: False
    keep_tokens: 0
    caption_dropout_rate: 0.0
    caption_dropout_every_n_epoches: 0
    caption_tag_dropout_rate: 0.0
    color_aug: False
    flip_aug: False
    face_crop_aug_range: None
    random_crop: False
    token_warmup_min: 1,
    token_warmup_step: 0,
    is_reg: False
    class_tokens: NFE4
    caption_extension: .txt


[Dataset 0]
loading image sizes.
100%|███████████████████████████████████████████████████████████████████████████████| 108/108 [00:00<00:00, 122.60it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (384, 512), count: 80
bucket 1: resolution (384, 576), count: 80
bucket 2: resolution (384, 640), count: 170
bucket 3: resolution (448, 448), count: 10
bucket 4: resolution (448, 512), count: 100
bucket 5: resolution (448, 576), count: 30
bucket 6: resolution (512, 384), count: 20
bucket 7: resolution (512, 512), count: 580
bucket 8: resolution (576, 384), count: 10
mean ar error (without repeats): 0.013095076889679085
preparing accelerator
G:\kohya_ss\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:258: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
Using accelerator 0.15.0 or above.
loading model for process 0/1
load StableDiffusion checkpoint: G:/SDCache/diffusers/stable-diffusion-v1-5/v1-5-pruned.ckpt
loading u-net: <All keys matched successfully>
loading vae: <All keys matched successfully>
loading text encoder: <All keys matched successfully>
CrossAttention.forward has been replaced to enable xformers.
import network module: networks.lora
[Dataset 0]
caching latents.
100%|████████████████████████████████████████████████████████████████████████████████| 108/108 [00:19<00:00,  5.60it/s]
create LoRA network. base dim (rank): 128, alpha: 128.0
neuron dropout: p=None, rank dropout: p=None, module dropout: p=None
create LoRA for Text Encoder: 72 modules.
create LoRA for U-Net: 192 modules.
enable LoRA for text encoder
enable LoRA for U-Net
preparing optimizer, data loader etc.
when multiple learning rates are specified with dadaptation (e.g. for Text Encoder and U-Net), only the first one will take effect / D-Adaptationで複数の学習率を指定した場合(Text EncoderとU-Netなど)、最初の学習率のみが有効になります: lr=0.5
use D-Adaptation AdamPreprint optimizer | {'decouple': True, 'weight_decay': 0.01, 'betas': (0.9, 0.99)}
Using decoupled weight decay
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 1080
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 540
  num epochs / epoch数: 7
  batch size per device / バッチサイズ: 2
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 3780
steps:   0%|                                                                                  | 0/3780 [00:00<?, ?it/s]
epoch 1/7
G:\kohya_ss\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ g:\kohya_ss\kohya_ss\train_network.py:864 in <module>                                            │
│                                                                                                  │
│   861 │   args = parser.parse_args()                                                             │
│   862 │   args = train_util.read_config_from_file(args, parser)                                  │
│   863 │                                                                                          │
│ ❱ 864 │   train(args)                                                                            │
│   865                                                                                            │
│                                                                                                  │
│ g:\kohya_ss\kohya_ss\train_network.py:679 in train                                               │
│                                                                                                  │
│   676 │   │   │   │   │   params_to_clip = network.get_trainable_params()                        │
│   677 │   │   │   │   │   accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)        │
│   678 │   │   │   │                                                                              │
│ ❱ 679 │   │   │   │   optimizer.step()                                                           │
│   680 │   │   │   │   lr_scheduler.step()                                                        │
│   681 │   │   │   │   optimizer.zero_grad(set_to_none=True)                                      │
│   682                                                                                            │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\accelerate\optimizer.py:140 in step                  │
│                                                                                                  │
│   137 │   │   │   │   # If we reduced the loss scale, it means the optimizer step was skipped    │
│   138 │   │   │   │   self._is_overflow = scale_after < scale_before                             │
│   139 │   │   │   else:                                                                          │
│ ❱ 140 │   │   │   │   self.optimizer.step(closure)                                               │
│   141 │                                                                                          │
│   142 │   def _switch_parameters(self, parameters_map):                                          │
│   143 │   │   for param_group in self.optimizer.param_groups:                                    │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\torch\optim\lr_scheduler.py:69 in wrapper            │
│                                                                                                  │
│     66 │   │   │   │   instance = instance_ref()                                                 │
│     67 │   │   │   │   instance._step_count += 1                                                 │
│     68 │   │   │   │   wrapped = func.__get__(instance, cls)                                     │
│ ❱   69 │   │   │   │   return wrapped(*args, **kwargs)                                           │
│     70 │   │   │                                                                                 │
│     71 │   │   │   # Note that the returned function here is no longer a bound method,           │
│     72 │   │   │   # so attributes like `__func__` and `__self__` no longer exist.               │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\torch\optim\optimizer.py:280 in wrapper              │
│                                                                                                  │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return None or a tuple of (   │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")                       │
│   279 │   │   │   │                                                                              │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                                                │
│   281 │   │   │   │   self._optimizer_step_code()                                                │
│   282 │   │   │   │                                                                              │
│   283 │   │   │   │   # call optimizer step post hooks                                           │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\dadaptation\experimental\dadapt_adam_preprint.py:142 │
│ in step                                                                                          │
│                                                                                                  │
│   139 │   │   │   eps = group['eps']                                                             │
│   140 │   │   │                                                                                  │
│   141 │   │   │   if group_lr not in [lr, 0.0]:                                                  │
│ ❱ 142 │   │   │   │   raise RuntimeError(f"Setting different lr values in different parameter    │
│   143 │   │   │                                                                                  │
│   144 │   │   │   for p in group['params']:                                                      │
│   145 │   │   │   │   if p.grad is None:                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Setting different lr values in different parameter groups is only supported for values of 0
steps:   0%|                                                                                  | 0/3780 [00:03<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ G:\program\Program Files\Python\Python310\lib\runpy.py:196 in _run_module_as_main                │
│                                                                                                  │
│   193 │   main_globals = sys.modules["__main__"].__dict__                                        │
│   194 │   if alter_argv:                                                                         │
│   195 │   │   sys.argv[0] = mod_spec.origin                                                      │
│ ❱ 196 │   return _run_code(code, main_globals, None,                                             │
│   197 │   │   │   │   │    "__main__", mod_spec)                                                 │
│   198                                                                                            │
│   199 def run_module(mod_name, init_globals=None,                                                │
│                                                                                                  │
│ G:\program\Program Files\Python\Python310\lib\runpy.py:86 in _run_code                           │
│                                                                                                  │
│    83 │   │   │   │   │      __loader__ = loader,                                                │
│    84 │   │   │   │   │      __package__ = pkg_name,                                             │
│    85 │   │   │   │   │      __spec__ = mod_spec)                                                │
│ ❱  86 │   exec(code, run_globals)                                                                │
│    87 │   return run_globals                                                                     │
│    88                                                                                            │
│    89 def _run_module_code(code, init_globals=None,                                              │
│                                                                                                  │
│ in <module>:7                                                                                    │
│                                                                                                  │
│   4 from accelerate.commands.accelerate_cli import main                                          │
│   5 if __name__ == '__main__':                                                                   │
│   6 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 7 │   sys.exit(main())                                                                         │
│   8                                                                                              │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py:45 in main     │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:918 in launch_command  │
│                                                                                                  │
│   915 │   elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA   │
│   916 │   │   sagemaker_launcher(defaults, args)                                                 │
│   917 │   else:                                                                                  │
│ ❱ 918 │   │   simple_launcher(args)                                                              │
│   919                                                                                            │
│   920                                                                                            │
│   921 def main():                                                                                │
│                                                                                                  │
│ G:\kohya_ss\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py:580 in simple_launcher │
│                                                                                                  │
│   577 │   process.wait()                                                                         │
│   578 │   if process.returncode != 0:                                                            │
│   579 │   │   if not args.quiet:                                                                 │
│ ❱ 580 │   │   │   raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)    │
│   581 │   │   else:                                                                              │
│   582 │   │   │   sys.exit(1)                                                                    │
│   583                                                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
CalledProcessError: Command '['G:\\kohya_ss\\kohya_ss\\venv\\Scripts\\python.exe', 'train_network.py',
'--enable_bucket', '--pretrained_model_name_or_path=G:/SDCache/diffusers/stable-diffusion-v1-5/v1-5-pruned.ckpt',
'--train_data_dir=G:/LoraTraining/img', '--resolution=512,512', '--output_dir=G:/LoraTraining/model',
'--logging_dir=G:/LoraTraining/log', '--network_alpha=128', '--save_model_as=safetensors',
'--network_module=networks.lora', '--text_encoder_lr=0.5', '--unet_lr=1.0', '--network_dim=128',
'--output_name=DrakeV1', '--lr_scheduler_num_cycles=7', '--learning_rate=1.0', '--lr_scheduler=constant',
'--train_batch_size=2', '--max_train_steps=3780', '--save_every_n_epochs=1', '--mixed_precision=bf16',
'--save_precision=bf16', '--seed=1234', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=DAdaptation',
'--optimizer_args', 'decouple=True', 'weight_decay=0.01', 'betas=0.9,0.99', '--max_data_loader_n_workers=0',
'--max_token_length=225', '--bucket_reso_steps=64', '--min_snr_gamma=5', '--xformers', '--bucket_no_upscale']' returned
non-zero exit status 1.

@bmaltais
Copy link
Owner

Make sure you follow the error recommendation: Setting different lr values in different parameter groups is only supported for values of 0

You can't set different LR unless you set it to 0.

@aearone
Copy link
Author

aearone commented Jul 11, 2023

Make sure you follow the error recommendation: Setting different lr values in different parameter groups is only supported for values of 0

You can't set different LR unless you set it to 0.

Indeed, setting the Learning rate and Unet learning rate to 0 fixed the error and the training successfully started. But the following message appeared: learning rate is too low. If using dadaptation, set learning rate around 1.0
And if I try to set it to 1.0, I get an error. I don't understand the logic, ahhh

@aearone
Copy link
Author

aearone commented Jul 11, 2023

Setting the value to 1 for all LR types cleared the error. I realized that I should set the same value of 1 or 0.5 for LR, UNet LR, TE LR and then there will be no error and the training will run fine. I'm also not sure if I should specify any value for LR at all, or if it should only be set for UNet LR and TE LR. It is also possible to set the value 0 for all, in this case the learning speed will increase many times, but I am not sure that it will give good training quality.

I also found the information in this thread very useful, it helped me understand what values I should set for DAdaptation: #181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants