Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Dylora #24

Closed
sdbds opened this issue May 5, 2023 · 15 comments
Closed

Support for Dylora #24

sdbds opened this issue May 5, 2023 · 15 comments

Comments

@sdbds
Copy link

sdbds commented May 5, 2023

Thank you for great works!
I often use dadapataion for model training, but it seems to be ineffective with this algorithm.
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
https://arxiv.org/abs/2210.07558

all dadapataion use default d0 such as 1e-6 and not change.

@adefazio
Copy link
Contributor

adefazio commented May 5, 2023

We are always looking for examples where it doesn't work so that we can improve the algorithm. Is there a particular open source codebase you use?

@sdbds
Copy link
Author

sdbds commented May 6, 2023

We are always looking for examples where it doesn't work so that we can improve the algorithm. Is there a particular open source codebase you use?

i try to use 2 of dylora implements,and they both don't work.
here are their open source code.

https://github.com/kohya-ss/sd-scripts/blob/main/networks/dylora.py
image

https://github.com/KohakuBlueleaf/LyCORIS/blob/main/lycoris/dylora.py
image

@drhead
Copy link

drhead commented May 8, 2023

I have also been using D-Adaptation extensively with the same codebase, using it to make Stable Diffusion LoRAs. I am mostly pleased with its results, but quite frequently find myself having to change the learn rate to something other than 1 or using a rather restrictive growth_rate to stop it from choosing a high learning rate that often ends up destroying quite a bit of the base model.

As an example, I am using DAdaptAdanIP. My best training run so far was when I used a growth rate of 1.06. This caused DAdaptation to settle on a d*lr close to 0.001 (typical for Adan, and produces results that are of good quality). If I don't set a growth rate, it settles on a d*lr of around 0.0027, which tends to be bad at prior preservation. Another thing I have noticed is that I have to do further manual learning rate adjustments when I adjust network rank/dim, which is something I would normally expect to be somewhat accounted for.

Please note that my problem includes the Dreambooth regularization technique described in https://arxiv.org/abs/2208.12242 , and I have also been using the min-SNR-gamma technique described in https://arxiv.org/pdf/2303.09556.pdf . I have had similar results in regards to being unable to rely on D-Adaptation's learning rate estimate without using the min-SNR-gamma technique, as well as when using D-Adaptation Adam.

If there is any further information you would like from me, any tests you'd like me to run, or if I seem to be misunderstanding the best practices for this optimizer, please reach out to me! The prospect of learning rate free learning for the problems I am working on is simply too appealing for me to not try my hardest to make it work.

@sdbds
Copy link
Author

sdbds commented May 8, 2023

I have also been using D-Adaptation extensively with the same codebase, using it to make Stable Diffusion LoRAs. I am mostly pleased with its results, but quite frequently find myself having to change the learn rate to something other than 1 or using a rather restrictive growth_rate to stop it from choosing a high learning rate that often ends up destroying quite a bit of the base model.

As an example, I am using DAdaptAdanIP. My best training run so far was when I used a growth rate of 1.06. This caused DAdaptation to settle on a dlr close to 0.001 (typical for Adan, and produces results that are of good quality). If I don't set a growth rate, it settles on a dlr of around 0.0027, which tends to be bad at prior preservation. Another thing I have noticed is that I have to do further manual learning rate adjustments when I adjust network rank/dim, which is something I would normally expect to be somewhat accounted for.

Please note that my problem includes the Dreambooth regularization technique described in https://arxiv.org/abs/2208.12242 , and I have also been using the min-SNR-gamma technique described in https://arxiv.org/pdf/2303.09556.pdf . I have had similar results in regards to being unable to rely on D-Adaptation's learning rate estimate without using the min-SNR-gamma technique, as well as when using D-Adaptation Adam.

If there is any further information you would like from me, any tests you'd like me to run, or if I seem to be misunderstanding the best practices for this optimizer, please reach out to me! The prospect of learning rate free learning for the problems I am working on is simply too appealing for me to not try my hardest to make it work.

thank u for sharing these useful experience about LORA training.
What social media platforms do you frequently use?Discord or others? I'd like to discuss some related experiences with you.
For instance, I've recently been testing the latest training parameters, such as multi noise, and have achieved quite impressive results.

@adefazio
Copy link
Contributor

adefazio commented May 8, 2023

Thanks for the detailed information, I will investigate. The NeurIPS submission deadline is approaching (May 17th) so I probably won't have time to investigate until then.

One thing you can try is the Adam version in the v3 pull request I've put up. It's a major change to the method and could help in your setting. It can give LR values about ~1/2 as big on some unstable problems which could be exactly what you need. I ran it through the full suite of experiments in my paper and it works as well or better then the previous Adam version on everything.

DAdaptAdan is very experimental and I haven't experimented enough with it yet to trust it.

@sdbds
Copy link
Author

sdbds commented May 8, 2023

Thanks for the detailed information, I will investigate. The NeurIPS submission deadline is approaching (May 17th) so I probably won't have time to investigate until then.

One thing you can try is the Adam version in the v3 pull request I've put up. It's a major change to the method and could help in your setting. It can give LR values about ~1/2 as big on some unstable problems which could be exactly what you need. I ran it through the full suite of experiments in my paper and it works as well or better then the previous Adam version on everything.

DAdaptAdan is very experimental and I haven't experimented enough with it yet to trust it.

Alright, I will patiently await your message. In the meantime, I'll give adptlion and the new adam method a try.

@drhead
Copy link

drhead commented May 8, 2023

Thanks for the quick response -- I have just had a chance to compare how the v3 Adam performs compared to the old version. I have run both of them on decouple=True and weight_decay=0.01, and with all parameters and the seed being the same across runs. From this first run I am finding that both optimizers are choosing similar learn rates, with the new Adam settling on 0.0002099 while the old one settled on 0.0002012. Interestingly, despite the difference in learning rate being relatively minor, the results are dramatically different, with the new Adam slightly overfitting, and the old Adam slightly underfitting. Neither are wrongly fit to an unusable degree. Unfortunately, what constitutes "success" on this problem is highly subjective, so it is hard for me to say for certain whether this is an improvement or not.

I will have to perform more tests to see if I can find a good configuration for the new Adam implementation without manual learning rate adjustments and try to test it on other datasets, because it looks promising, and I will be sure to compare with the old Adam implementation to see what has changed. I will hopefully have something by the time you are able to look into this further, and wish you luck with your NeurIPS submission.

@sdbds
Copy link
Author

sdbds commented May 9, 2023

Thanks for the quick response -- I have just had a chance to compare how the v3 Adam performs compared to the old version. I have run both of them on decouple=True and weight_decay=0.01, and with all parameters and the seed being the same across runs. From this first run I am finding that both optimizers are choosing similar learn rates, with the new Adam settling on 0.0002099 while the old one settled on 0.0002012. Interestingly, despite the difference in learning rate being relatively minor, the results are dramatically different, with the new Adam slightly overfitting, and the old Adam slightly underfitting. Neither are wrongly fit to an unusable degree. Unfortunately, what constitutes "success" on this problem is highly subjective, so it is hard for me to say for certain whether this is an improvement or not.

I will have to perform more tests to see if I can find a good configuration for the new Adam implementation without manual learning rate adjustments and try to test it on other datasets, because it looks promising, and I will be sure to compare with the old Adam implementation to see what has changed. I will hopefully have something by the time you are able to look into this further, and wish you luck with your NeurIPS submission.

In my opinion,Dreambooth regularization has a significant impact on the final results and loss; judging through image results may not be very accurate.

@konstmish
Copy link

Thank you very much for sharing the feedback! We are still working on improving the method and the challenging instances are especially useful to us. One thing that could be helpful is the description of the training setting that you have, including the dataset and the batch size. If you could share a training script that includes these things together with any regularization that you use, that would be amazing.

Our new version of Adam has a different estimation of D, which should be somewhat smoother and more stable. However, we are also working on other variants, so having examples where D-adaptation fails is extremely valuable.

@drhead
Copy link

drhead commented May 10, 2023

I have done some experiments on the dog dataset from the original Dreambooth paper and have found that it exhibits the problems much more clearly. A known working configuration for AdamW8bit was to generate 200 regularization images, use network dim of 128, run for 800 steps, to use a learning rate of 4e-6 for both the unet and text encoder, and to use a cosine LR scheduler -- this deviates from some examples but is much closer to my normal training conditions. Doubling the learning rate for the unet to 8e-6 also works and seems to produce slightly better results, and is closer to my typical use cases -- training the text encoder at half the rate of the unet produces better results in most cases in my experience and others, and I have also run into a few circumstances where I have gotten better results by controlling learning rate and dims of individual blocks of the network, which D-Adaptation unfortunately doesn't seem to support, at least as it is implemented in the scripts I am using right now.

When using DAdaptation with the learning rates set to 1.0 and the arguments "decouple=True" "weight_decay=0.01" to match AdamW, a learning rate of around 5.22e-5 is selected, which quickly overfits the model. It does achieve a much lower loss score. Lowering beta2 to 0.99 (as I have seen commonly recommended by other people attempting this) produces much better results in terms of prior preservation but still shows signs of overtraining such as highly saturated "deep fried" images, lack of editability of the trained concept, and too much resemblance to training images. Lowering beta2 even further progressively lowers the learning rate chosen and increases loss, but produces much better results (my sweep included 0.99, 0.98, 0.95, 0.92, and 0.901).

D-Adaptation definitely responds the wrong way to changes in dimensions of a LoRA model. Adjusting the network dim from 128 to 64 causes D-Adaptation to double its learning rate estimate in response. Normally, changes in network dim require a proportional change to learning rate, but D-Adaptation responds inversely proportional. Most interestingly, the learning rate estimate is close to reasonable near the maximum possible network dim (768) but it still overtrains quickly. The net effect of decreasing network dim is different depending on the beta2 value -- at the default of 0.999, the model lost all editability and learned no distinction of the trained subject. At beta2 of 0.901, I got much better result at network dim of 64 than at 128.

Overall, I was able to achieve results that were better in most respects with D-Adaptation with tuning of some hyperparameters, but the optimizer responds in very counterintuitive ways at some times and it is overall unclear which hyperparameters should be tuned to best take advantage of the adaptive optimizer.

I have attached a set of test scripts that I use as well as the dog dataset in this zipfile. Since the regularization images take up quite a bit of space, I have not included them, and you will have to generate them yourself (I have included a script that will generate them for you in the correct location).

I have been using this script with bmaltais' fork of kohya_ss on Linux, which has install instructions here: https://github.com/bmaltais/kohya_ss#linux-and-macos

You will need the Stable Diffusion v1.5 checkpoint located here: https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors (this is a safetensors version and my script points to the .ckpt version, be sure to correct that)

If you unzip this file in the root directory of the kohya_ss repo, you shouldn't have to edit any paths other than the path to the Stable Diffusion base model (which will have to be added to both scripts), and you should be able to simply run ./make_dog_class_imgs.sh once and run ./testlora.sh which is currently configured to use DAdaptAdam with decoupled weight decay. Let me know if there are any issues getting my script to run or if you need any more testing on my end.

test_scripts.zip

@konstmish
Copy link

Thanks for the scripts and the detailed feedback! Here are some of my thoughts:

  1. You can try using larger values of weight decay, especially when D-Adaptation gives small values of learning rate. Weight decay is multiplied by the learning rate, which might lead to overfitting if the estimated learning rate is small. We might look into this a bit more later and test if that's a good choice.
  2. Although this would take some major rewriting, there is nothing that prevents us from allowing the optimizer to use different learning rate estimates for different parameter groups. The drawback of this would be potentially smaller learning rates, since D-Adaptation estimates the maximum distance over all coordinates.
  3. Tuning D-Adaptation is indeed nontrivial because its behavior depends on the observed gradients, which are unpredictable. There is no obvious solution to that. D-Adapted Adagrad seems to be more reliable in that regard, but it's also very pessimistic and seems to perform worse overall. Perhaps a variant between D-Adagrad and D-Adam would work well, but we haven't explored that direction.

@konstmish
Copy link

We have made some progress on the method and released a new version called Prodigy. I haven't tested it on the dog dataset, so I'm not sure if it'd help. All in all, seems like we are not done yet with finding the right method that works for all applications, but we're still working on that.

@sdbds
Copy link
Author

sdbds commented Jun 12, 2023

We have made some progress on the method and released a new version called Prodigy. I haven't tested it on the dog dataset, so I'm not sure if it'd help. All in all, seems like we are not done yet with finding the right method that works for all applications, but we're still working on that.

thank you for great work!i will test it when i have free time.

@sdbds
Copy link
Author

sdbds commented Jun 12, 2023

We have made some progress on the method and released a new version called Prodigy. I haven't tested it on the dog dataset, so I'm not sure if it'd help. All in all, seems like we are not done yet with finding the right method that works for all applications, but we're still working on that.

thank u!it works well!

@sdbds
Copy link
Author

sdbds commented Jun 12, 2023

image

i will finetune it , thank u for hard work again~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants