Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How long does it takes to train. #4

Open
qsh-zh opened this issue Feb 24, 2021 · 24 comments
Open

How long does it takes to train. #4

qsh-zh opened this issue Feb 24, 2021 · 24 comments

Comments

@qsh-zh
Copy link

qsh-zh commented Feb 24, 2021

Thanks for your clean implementation sharing.

I try on celeba datasets. After 150k steps, the generated images are not well as it claimed in the paper and the flowers you show in the readme.

Is it something to do with the datasets or I need more time to train?

image

@ariel415el
Copy link

Hi, I'm also traying to train this repo.
What image resolution are you using?
In the paper (Appendix B) they say they trained 256x256 CelebA-HQ for 500k steps of 64 batchsize.
Did your loss plateau or is it still decreasing?
And by the way, how much time did it take to train these 150k steps? what batchsize?

@IceClear
Copy link

Similar results after 145k on cifar. I wonder if it is harder to be trained than GAN or it is not stable enough yet...

@qsh-zh
Copy link
Author

qsh-zh commented May 29, 2021

@ariel415el The loss plateau for the figure I show if my memory serves me well. I forget some details about the experiments, it ran like 36-48 hours on one 2080Ti. Batchsize is 32 with fp16, unet dim 64.

@qsh-zh
Copy link
Author

qsh-zh commented May 29, 2021

@IceClear Do you mind sharing sample images if you could?

@IceClear
Copy link

@IceClear Do you mind sharing sample images if you could?

Sure, here it is (sample 186) after 186k.
sample-186

@Smith42
Copy link

Smith42 commented Jun 12, 2021

I've been training using this repo and am getting (very) good results on 256x256 images after around 800,000 global steps (batch of 16). Score-based models are known to take more compute to train vs a comparable GAN, so perhaps more training time is required in your cases?

@ariel415el
Copy link

Thanks @Smith42 ,
The thing is for me and @qshzh the train loss plateaus so I'm not sure how more steps can help. Did your loss continue decreasing throughout training?
Can you share some of your result images here so that we know what to expect?
BTW, for how long did you train the model? i guess it was more than 2 days.

@Smith42
Copy link

Smith42 commented Jun 14, 2021

Can you share some of your result images here so that we know what to expect?

@ariel415el unfortunately I can't share the results just yet, but should have a preprint out soon that I can share.

The thing is for me and @qshzh the train loss plateaus so I'm not sure how more steps can help. Did your loss continue decreasing throughout training?

The loss didn't seem to plateau for me until very late in the training cycle, but this is with training on a dataset with order 10^6 examples.

BTW, for how long did you train the model? i guess it was more than 2 days.

On a single V100 it took around 2 weeks of training.

@qsh-zh
Copy link
Author

qsh-zh commented Jun 14, 2021

@IceClear @ariel415el This is the fid curve on cifar10 for sampled 1k images.
image
The 26 step in the figure is global 108000 steps. For 50k samples, its fid is 15.13.

@Sumching
Copy link

The image size is 256, batchsize is 32, and 480k steps, which does not look good.
image

@gwang-kim
Copy link

gwang-kim commented Sep 8, 2021

@Sumching, @qshzh, @IceClear @ariel415el, @Smith42 Guys, how low are your training losses? In my case, the noise prediction losses are several hundred ~ thousands. Is this right?

@Smith42
Copy link

Smith42 commented Sep 23, 2021

@Sumching, @qshzh, @IceClear @ariel415el, @Smith42 Guys, how low are your training losses? In my case, the noise prediction losses are several hundred ~ thousands. Is this right?

That's way too high, I'm getting sub 0.1 once fully trained. Have you checked your normalisations?

@jiangxiluning
Copy link

@Smith42
hi, I trained it with cifar-10. The batch size is 16. The image size is 128. The loss is about 0.05. But the generated images are seemed as being blurred.

@Smith42
Copy link

Smith42 commented Mar 8, 2022

@Smith42 hi, I trained it with cifar-10. The batch size is 16. The image size is 128. The loss is about 0.05. But the generated images are seemed as being blurred.

I use a fork of Phil's code in my paper and am not getting blurring problems. Maybe there is something up with your hyperparameters?

@cajoek
Copy link

cajoek commented Mar 18, 2022

Hi @Smith42 & @jiangxiluning when you say you get a loss below 0.1 are you using a L1 or L2 loss?

@jiangxiluning
Copy link

@cajoek for me, it is L1.

@Smith42
Copy link

Smith42 commented Mar 22, 2022

L1 for me too

@cajoek
Copy link

cajoek commented Mar 22, 2022

Thanks @jiangxiluning @Smith42!

My loss unfortunately plateaus at about 0.10-0.15 so I decided to plot the mean L1 loss over one epoch versus the timestep t and I noticed that the loss stays quite high for low values of t, as can be seen i this figure. Do you know if that is expected?
Loss_vs_timestep
(L1 loss vs timestep t after many epochs on a small dataset. Convergence is not quite reached yet)

@cajoek cajoek mentioned this issue Mar 22, 2022
@malekinho8
Copy link

@Smith42 Would you be able to show some samples/results from training your CelebA model? It seems that a lot of other people are struggling to reproduce the results shown in the paper.

@Smith42
Copy link

Smith42 commented Jun 5, 2022

@Smith42 Would you be able to show some samples/results from training your CelebA model? It seems that a lot of other people are struggling to reproduce the results shown in the paper.

@malekinho8 I ran a fork of lucidrains' model on a large galaxy image data set here, not on CelebA. However, the galaxy imagery is well replicated with this codebase, so I expect it will work okay on CelebA too.

@DushyantSahoo
Copy link

@jiangxiluning Can you pleasure share your code? I am also training on cifar10 and the loss does not go below 0.7. Below is my trainer model
trainer = Trainer(
diffusion,
new_train,
train_batch_size = 32,
train_lr = 1e-4,
train_num_steps = 500000, # total training steps
gradient_accumulate_every = 2, # gradient accumulation steps
ema_decay = 0.995, # exponential moving average decay
amp = True # turn on mixed precision
)
model = Unet(
dim = 16,
dim_mults = (1, 2, 4)
)

@greens007
Copy link

Hi, I got the same problem in cifar10. The model generated failed images even after 150k steps. Did you succeeded?

@jiangxiluning Can you pleasure share your code? I am also training on cifar10 and the loss does not go below 0.7. Below is my trainer model trainer = Trainer( diffusion, new_train, train_batch_size = 32, train_lr = 1e-4, train_num_steps = 500000, # total training steps gradient_accumulate_every = 2, # gradient accumulation steps ema_decay = 0.995, # exponential moving average decay amp = True # turn on mixed precision ) model = Unet( dim = 16, dim_mults = (1, 2, 4) )

@yiyixuxu
Copy link

Hi: so cifar10 contains tiny pictures 32x32 - it is naturally going to look blurry if you resize to 128x128

@177488ZL
Copy link

Thanks for your clean implementation sharing.

I try on celeba datasets. After 150k steps, the generated images are not well as it claimed in the paper and the flowers you show in the readme.

Is it something to do with the datasets or I need more time to train?

image

Excuse me, do you modify the code or parameters during training, or load the pre-training weight file, the loss will drop to nan during my training

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests