Parallelize `probs` with OpenMP #800

vincentmr · 2024-07-16T13:09:06Z

Before submitting

Please complete the following checklist when submitting a PR:

All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to the
tests directory!
All new functions and code must be clearly commented and documented.
If you do make documentation changes, make sure that the docs build and
render correctly by running make docs.
Ensure that the test suite passes, by running make test.
Add a new entry to the .github/CHANGELOG.md file, summarizing the
change, and including a link back to the PR.
Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.

Context:
probs is central in circuit simulation measurements.

Description of the Change:
Parallelize probs loops using OpenMP.

Benefits:
Faster execution with several threads.
The following benchmarks are performed on ISAIC's AMD EPYC-Milan Processor using a several core/threads. The times are obtained averaging the computation of probs(target) 5 times for various number of targets. We use the last release implementation as a reference. Since #795 brings some speed-ups even for a single thread, this is why we observe speed-ups > number of threads.

Another view on the data is the strong scaling efficiency. It is almost perfect for 2-4 threads, fairly good for 8 threads and diminishes significantly for 16 threads.

Possible Drawbacks:

Related GitHub Issues:

…ements/MeasurementKernels.hpp Co-authored-by: Amintor Dusko <[email protected]>

codecov · 2024-07-16T13:20:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.36%. Comparing base (c164fe5) to head (4b7b794).
Report is 92 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (c164fe5) and HEAD (4b7b794). Click for more details.

HEAD has 5 uploads less than BASE

Flag BASE (c164fe5) HEAD (4b7b794)

12 7

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
- Coverage   98.64%   92.36%   -6.29%     
==========================================
  Files         114       73      -41     
  Lines       17653    11167    -6486     
==========================================
- Hits        17414    10314    -7100     
- Misses        239      853     +614

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

LuisAlfredoNu · 2024-07-16T15:23:34Z

I think that it would be nice to add a scaling efficiency plot. What do you think @vincentmr ?

pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp

LuisAlfredoNu

The implementation has good scaling and achieves an important speedup.

AmintorDusko

Nice work! Thank you for that!

### Before submitting Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** `sample` call `generate_samples` which computes the full probabilities and uses the alias method to generate samples for all wires. This is wasteful whenever samples are required only for a subset of all wires. **Description of the Change:** Move alias method logic to the `discrete_random_variable` class. **Benefits:** Compute minimal probs and samples. We benchmark the current changes against `master`, which already benefits from some good speed-ups introduce in #795 and #800 . We use ISAIC's AMD EPYC-Milan Processor using a single core/thread. The times are obtained using at least 5 experiments and running for at least 250 milliseconds. We begin comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)`. For 4-12 qubits, overheads dominate the calculation (the absolute times range from 6 microseconds to 18 milliseconds, which is not a lot. Already at 12 qubits however, a trend appears where our implementation is significantly faster. This is to be expected for two reason: `probs(wires)` itself is faster than `probs()` (for enough qubits) and `sample(wires)` starts also requiring significantly less work than `sample()`. ![speedup_vs_nthreads_1w](https://github.com/user-attachments/assets/472748e9-d812-489c-a00f-2b2b74c7e682) Next we turn to comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits/2}, num_samples)`. The situation there is similar, with speed-ups close to 1 for the smaller qubit counts and (sometimes) beyond 20x for qubit counts above 20. ![speedup_vs_nthreads_hfw](https://github.com/user-attachments/assets/f39e3ccd-8051-4a57-a857-9cd13f547865) Finally we compare `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits-1}, num_samples)` (i.e. computing samples on all wires). We expect similar performance since the main difference comes from the caching mechanism in `master`'s discrete random variable generator. The data suggests caching samples is counter-productive compared with calculating the sample values on the fly. ![speedup_vs_nthreads_fullw](https://github.com/user-attachments/assets/2c70ed21-2236-479e-be3d-6017b42fdc5e) Turning OMP ON, using 16 threads and comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)` we get good speed-ups above 12 qubits. Below that the overhead of spawning threads isn't repaid, but absolute times remain low. ![speedup_vs_omp16_1w](https://github.com/user-attachments/assets/e3e90a55-399f-4a5b-b90e-7059a0486228) **Possible Drawbacks:** **Related GitHub Issues:** [sc-65127] --------- Co-authored-by: ringo-but-quantum <[email protected]> Co-authored-by: Ali Asadi <[email protected]>

vincentmr and others added 25 commits July 10, 2024 21:10

WIP

d7980ca

Special 1q.

522926a

2q WIP

c9a2944

Rewrite kernel with macros.

dacb7d6

Add cases 4-8.

05dc16b

Put stuff in MeasurementKernels.hpp

4875d29

Update changelog

60aeed3

Auto update version from '0.38.0-dev6' to '0.38.0-dev7'

bdbe24c

Bake parity2indices into probs.

c83e522

Clean up.

cdffbd3

Revert CMakeLists.

a91dbeb

Polish and add tests.

c7ed60e

Less specific cache keys for Kokkos-CUDA

17a2e9d

NOLINT templated probs.

3261186

wires => device_wires

866369c

Update pennylane_lightning/core/src/simulators/lightning_qubit/measur…

d1f99cd

…ements/MeasurementKernels.hpp Co-authored-by: Amintor Dusko <[email protected]>

Add docstring for measkernels macros.

d671830

Add to docstring.

aa09c3a

make format

6019bce

Parallelize probs with OpenMP.

f8ed946

Use pointers to vector data.

9b59a54

Auto update version from '0.38.0-dev7' to '0.38.0-dev8'

747a0f0

Merge remote-tracking branch 'origin/master' into probs_omp

13ee61c

Merge remote-tracking branch 'origin/probs_omp' into probs_omp

039d6e1

Update changelog.

baf9942

vincentmr added 4 commits July 16, 2024 14:10

target_compile_definitions

cfee77b

Update CMakeLists.txt

ee865d3

add_definitions => add_compile_definitions

5c9d3f8

Add target_compile_definitions back.

4b7b794

vincentmr marked this pull request as ready for review July 16, 2024 14:49

vincentmr requested a review from a team July 16, 2024 15:22

LuisAlfredoNu reviewed Jul 16, 2024

View reviewed changes

pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp Show resolved Hide resolved

LuisAlfredoNu approved these changes Jul 16, 2024

View reviewed changes

AmintorDusko approved these changes Jul 16, 2024

View reviewed changes

vincentmr merged commit 4ec49b8 into master Jul 16, 2024
68 of 69 checks passed

vincentmr deleted the probs_omp branch July 16, 2024 17:00

vincentmr mentioned this pull request Jul 22, 2024

Add sample(wires) support in LightningQubit #809

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize `probs` with OpenMP #800

Parallelize `probs` with OpenMP #800

vincentmr commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading

LuisAlfredoNu commented Jul 16, 2024

LuisAlfredoNu left a comment

AmintorDusko left a comment

Parallelize probs with OpenMP #800

Parallelize probs with OpenMP #800

Conversation

vincentmr commented Jul 16, 2024 • edited Loading

Before submitting

codecov bot commented Jul 16, 2024 • edited Loading

Codecov Report

LuisAlfredoNu commented Jul 16, 2024

LuisAlfredoNu left a comment

Choose a reason for hiding this comment

AmintorDusko left a comment

Choose a reason for hiding this comment

Parallelize `probs` with OpenMP #800

Parallelize `probs` with OpenMP #800

vincentmr commented Jul 16, 2024 •

edited

Loading

codecov bot commented Jul 16, 2024 •

edited

Loading