training speed #3

linjing7 · 2022-10-02T14:53:03Z

Hi, thanks for your excellent work. I found that the training is quite slow and it seems that the data loading time is the bottleneck. Could you please tell me how long it takes to train a model? Besides, what are the values of CPUS_PER_TASK and workers_per_gpu when training with SLURM? BTW, are there any other measures you used to speed up the training processure?

The text was updated successfully, but these errors were encountered:

smplbody · 2022-10-03T11:28:38Z

@linjing7 Each experiment took roughly 2 days on 8 GPUs. I have also experienced the long data loading bottleneck, which seems to be caused by recent upgrades in the mmhuman3d pipeline. I resolved it by training with cache.

I have added an example config file here 34641c6 . Before training, create an empty folder data/cache and the cache files will be generated automatically during training.

workers_per_gpu in the config refers to the number of worker to pre-fetch data for each single GPU. It needs to match your CPU cores.
CPUS_PER_TASK in the slurm script refers to the number of CPUs to be allocated per task.

linjing7 · 2022-10-03T12:30:38Z

Thank you very much. The training procedure obviously speeds up when I train it with cache.

linjing7 closed this as completed Oct 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training speed #3

training speed #3

linjing7 commented Oct 2, 2022 •

edited

Loading

smplbody commented Oct 3, 2022 •

edited

Loading

linjing7 commented Oct 3, 2022 •

edited

Loading

training speed #3

training speed #3

Comments

linjing7 commented Oct 2, 2022 • edited Loading

smplbody commented Oct 3, 2022 • edited Loading

linjing7 commented Oct 3, 2022 • edited Loading

linjing7 commented Oct 2, 2022 •

edited

Loading

smplbody commented Oct 3, 2022 •

edited

Loading

linjing7 commented Oct 3, 2022 •

edited

Loading