Skip to content
This repository has been archived by the owner on Oct 13, 2022. It is now read-only.

prepare.py slow speed #18

Open
danpovey opened this issue Nov 16, 2020 · 12 comments
Open

prepare.py slow speed #18

danpovey opened this issue Nov 16, 2020 · 12 comments

Comments

@danpovey
Copy link
Contributor

I have been running some experiments to debug why prepare.py is so slow.
I'll post the results here in a bit, but let me just give you a preview of why I believe it's so slow.
By default, torch seems to use a lot of threads, but it does so in a very inefficient way which actually just wastes CPU.
We attempt to get around this with:

  torch.set_num_threads(1)
  torch.set_num_interop_threads(1)

but now that we are using "spawn" as the parallelization method and the main() is guarded by:

if __name__ == '__main__':
    main()

that code is never reached.

@danpovey
Copy link
Contributor Author

Here are my notes from investigating the time to prepare dev-clean, including writing llc files.
The script prepare.py was modified to skip the prepare_librispeech set and just load the relevant manifest from
disk so the time is just the time taken in cut_set.compute_and_store_features().

Below I show various setups. TL:DR: the following statements:

torch.set_num_threads(1)
torch.set_num_interop_threads(1)

need to be moved to OUTSIDE OF MAIN() for a 100-fold speed imrovement.
This should be done regardless of the num-jobs or execution setup.

with executor = None; torch.set_num_threads(1);  torch.set_num_interop_threads(1):
       real 1m17, user: 0m51, sys: 0m22

with executor = None; and no torch.setXXX, which means it uses about 25 CPUs:
        real    1m48, user: 45m45, sys: 1m6

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [in main.py: torch.set_num_threads(1);  torch.set_num_interop_threads(1)]; num_jobs = 1:
      Uses about 25 CPUs within 1 thread:
      real 1m35, user 39m30, sys 1m16.

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [in main.py: torch.set_num_threads(10);  torch.set_num_interop_threads(1)]; num_jobs = 10:
    Each thread uses about 500% CPU.
    real    10m16.768s,  user   299m8.986s, sys 2m26.761s

with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
  [outside main.py: torch.set_num_threads(10);  torch.set_num_interop_threads(1)]; num_jobs = 10:
    Each thread uses about 500% CPU.
    real    real    0m13.860s,   user   1m8.146s, sys   0m44.639s

@danpovey
Copy link
Contributor Author

Also notice there is about a factor of 300 in speed between the last 2 runs, if you look at the user time taken (i.e. total across CPUs).

@jimbozhang
Copy link
Contributor

Let me try it under your suggestion and make a PR to fix it.

@jimbozhang
Copy link
Contributor

jimbozhang commented Nov 16, 2020

Yes, it is much faster! 👍

Use 48 cores on ('dev-clean', 'test-clean'):

real    0m47.141s
user    3m11.562s
sys     1m39.085s

Use 72 cores on ('dev-clean', 'test-clean', 'train-clean-100'):

real    6m20.349s
user    32m31.692s
sys     7m57.317s

It is slower than Dan's result, I think it should be caused by the NFS delay of my home directory.

@pzelasko
Copy link
Collaborator

🎉

@mthrok
Copy link

mthrok commented Nov 20, 2020

I have disabled the OpenMP support for libsox on torchaudio side. pytorch/audio#1026 It will be available through nightly binary distribution starting tomorrow. (Please check out pytorch.org for the installation of nightly build)

This should remove the suspected cause of crashing (GNU OpenMP and Intel OpenMP co-exist and interfere), and will allow you to go back to the fork method without being worried about OMP_NUM_THREADS configuration (or torch.set_num_threads).

We have minor release scheduled in the early December, so we appreciate if you can test it and confirm it resolves the issue (or not). pytorch/audio#1021

@danpovey
Copy link
Contributor Author

@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).

@jimbozhang
Copy link
Contributor

jimbozhang commented Nov 21, 2020

@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).

OK, I'll work on this (and k2 NVTX stuffs) next Monday. @danpovey

@jimbozhang
Copy link
Contributor

jimbozhang commented Nov 24, 2020

With the latest version's k2+lhotse+snowfall, under the path snowfall/egs/librispeech/asr/simple_v1:

$ time python prepare.py
dev-clean
Extracting and storing features in parallel: 100%|_________| 2703/2703 [00:07<00:00, 364.28it/s]

test-clean
Extracting and storing features in parallel: 100%|_________| 2620/2620 [00:07<00:00, 342.19it/s]

train-clean-100
Extracting and storing features in parallel: 100%|_________| 28539/28539 [01:26<00:00, 331.35it/s]

real    3m41.975s
user    33m55.149s
sys     5m25.221s

Didn't modify the job number setting, so the job number num_jobs = min(15, os.cpu_count()) should be 15.

@danpovey
Copy link
Contributor Author

danpovey commented Nov 24, 2020 via email

@jimbozhang
Copy link
Contributor

Ah, apologies. I misunderstood it. I didn't notice mthrok's comment at all 😓 and just simply tested the latest snowfall. Sorry about that ...

I'll try torchaudio with fork method tomorrow.

@jimbozhang
Copy link
Contributor

I tested it. The snowfall/egs/librispeech/asr/simple_v1/prepare.py worked fine when I just replaced "spawn" with "fork", with torchaudio (0.8.0.dev20201125) + torch (1.8.0.dev20201126+cu101).

But after I removed the torch.set_num_threads settings, it became extremely slow, same as before. @mthrok @danpovey

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants