Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training speed #3

Closed
linjing7 opened this issue Oct 2, 2022 · 2 comments
Closed

training speed #3

linjing7 opened this issue Oct 2, 2022 · 2 comments

Comments

@linjing7
Copy link

linjing7 commented Oct 2, 2022

Hi, thanks for your excellent work. I found that the training is quite slow and it seems that the data loading time is the bottleneck. Could you please tell me how long it takes to train a model? Besides, what are the values of CPUS_PER_TASK and workers_per_gpu when training with SLURM? BTW, are there any other measures you used to speed up the training processure?

@smplbody
Copy link
Owner

smplbody commented Oct 3, 2022

@linjing7 Each experiment took roughly 2 days on 8 GPUs. I have also experienced the long data loading bottleneck, which seems to be caused by recent upgrades in the mmhuman3d pipeline. I resolved it by training with cache.

I have added an example config file here 34641c6 . Before training, create an empty folder data/cache and the cache files will be generated automatically during training.

  • workers_per_gpu in the config refers to the number of worker to pre-fetch data for each single GPU. It needs to match your CPU cores.
  • CPUS_PER_TASK in the slurm script refers to the number of CPUs to be allocated per task.

@linjing7
Copy link
Author

linjing7 commented Oct 3, 2022

Thank you very much. The training procedure obviously speeds up when I train it with cache.

@linjing7 linjing7 closed this as completed Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants