Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate parallelization #1

Merged
merged 5 commits into from
Mar 14, 2024
Merged

Evaluate parallelization #1

merged 5 commits into from
Mar 14, 2024

Conversation

Cesar199999
Copy link

@Cesar199999 Cesar199999 commented Mar 6, 2024

The current implementation of EqPolynomial::evaluate is not parallelized. Using rayon's into_par_iter reduces the function's performance.

The following time benchmarks show the performance of the function with and without parallelization:

image

In my machine, a speedup of roughly 4x is achieved when parallelizing the function, which is already quite significant.

@mmagician
Copy link

What's up with the jumps in running time in the parallel case? E.g. it's non-increasing between 2^5 and 2^6

@Antonio95
Copy link

The speed up seems great! As Marcin says, there are some steps with unexpected time jumps. A possible explanation is some degree of parallelisation only activating if some runtime threshold is reached (just guessing).

Another thing to take into account: when benchmarking such low-work code as the first few entries of the table, it is best to run that code many times (say N) inside a loop, and only measure the elapsed time before and after the loop (not inside each iteration), then divide that total time by N. Sometimes that's the only way to get reasonably accurate measurements for e.g. code in the order of a few microseconds. Even for the higher-cost entries further down the table, it is also good to use that procedure (perhaps reducing N, since individual iteration will be more consistent time-wise anyway).

Two questions:

  • Was the above table produced with that kind of total-then-divide averaging method? Otherwise that would be another possible reason for the unexpected time jumps.
  • Since the i in the loop is only used to index two arrays, perhaps you can have a look at this rayon function and zip the two arrays. I'm curious if the runtime will change at all (and in which direction), likely not. But in any case the code would be a bit cleaner in my opinion. Also note that the current code indexes each array at the same position twice for each iteration, which doesn't look great (although the compiler likely optimises that away).

@Cesar199999
Copy link
Author

The measurements are produced in a single round; I assumed these results already show a clear speedup for large input instances. I can run multiple rounds to have more significant results though.

@Cesar199999
Copy link
Author

@Antonio95 you are right, It definitely makes sense to zip the arrays for readability.

@Cesar199999 Cesar199999 force-pushed the evaluate-parallelization branch 2 times, most recently from 8cfc291 to ae3ed75 Compare March 7, 2024 14:34
@Cesar199999
Copy link
Author

New benchmarks with respect to the original implementation, computed using criterion:

     Running benches/bench.rs (target/release/deps/bench-507167fb87136d31)
[...]
evaluate_incremental/2^1
                        time:   [11.974 µs 11.997 µs 12.021 µs]
                        change: [+3566.1% +3600.1% +3631.2%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe
evaluate_incremental/2^3
                        time:   [15.089 µs 15.245 µs 15.372 µs]
                        change: [+1233.8% +1274.4% +1313.5%] (p = 0.00 < 0.05)
                        Performance has regressed.
evaluate_incremental/2^5
                        time:   [18.728 µs 18.848 µs 18.979 µs]
                        change: [+348.19% +360.39% +372.67%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
evaluate_incremental/2^7
                        time:   [28.957 µs 29.187 µs 29.460 µs]
                        change: [+99.432% +103.98% +108.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
evaluate_incremental/2^9
                        time:   [51.231 µs 51.524 µs 51.848 µs]
                        change: [-18.139% -16.317% -14.388%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
evaluate_incremental/2^11
                        time:   [111.06 µs 112.03 µs 113.14 µs]
                        change: [-52.575% -51.008% -48.892%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
evaluate_incremental/2^13
                        time:   [315.19 µs 316.68 µs 318.05 µs]
                        change: [-66.222% -65.011% -63.865%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
evaluate_incremental/2^15
                        time:   [927.52 µs 935.10 µs 943.68 µs]
                        change: [-75.439% -74.528% -73.713%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
evaluate_incremental/2^17
                        time:   [3.2199 ms 3.2313 ms 3.2434 ms]
                        change: [-76.511% -76.253% -76.017%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  11 (11.00%) high mild
  2 (2.00%) high severe
evaluate_incremental/2^19
                        time:   [13.064 ms 13.172 ms 13.299 ms]
                        change: [-77.693% -77.435% -77.149%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

@mmagician
Copy link

I wonder if there are any tricks that would allow you to use the non-parallel case for <2^7.
IDK if it matters that much, since the absolute cost is small anyway - but would be good to at least investigate if maybe rayon allows for that.

@mmagician
Copy link

Also make sure that the poly definition used here is the one that is not redundant (due to microsoft#9)

@Cesar199999
Copy link
Author

The obsolete poly definition is in polynomial.rs: https://github.com/HungryCatsStudio/Spartan2/blob/main/src/spartan/polynomial.rs

@mmagician
Copy link

Cool. You can remove the obsolete definition in this PR or a follow up, up to you.

@Cesar199999 Cesar199999 requested a review from mmagician March 14, 2024 12:12
@mmagician
Copy link

Yeah looks good. Could you repost the updated benches?

@Cesar199999
Copy link
Author

Cool. You can remove the obsolete definition in this PR or a follow up, up to you

Alright, I guess it does make sense to do it here.

Yeah looks good. Could you repost the updated benches?

Sorry, I didn't mean to re-request review. The benchmarks did not change much; I'm running them again, though.

@Cesar199999
Copy link
Author

I wonder if there are any tricks that would allow you to use the non-parallel case for <2^7. IDK if it matters that much, since the absolute cost is small anyway - but would be good to at least investigate if maybe rayon allows for that.

Unfortunately, any trick would require a hard coded check on the length of the array and not parallelizing if it is too small. This is the problem with parallelizing small instances (rayon-rs/rayon#648).

@mmagician
Copy link

Right, so it's as we all thought. Thanks for pointing to a source for this!

@Cesar199999
Copy link
Author

New benches:

evaluate_incremental/2^1
                       time:   [13.666 µs 13.717 µs 13.768 µs]
                       change: [+6812.6% +6937.9% +7055.6%] (p = 0.00 < 0.05)
                       Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
 3 (3.00%) low mild
 2 (2.00%) high mild
 1 (1.00%) high severe
evaluate_incremental/2^3
                       time:   [14.974 µs 15.108 µs 15.262 µs]
                       change: [+1966.1% +2019.2% +2074.8%] (p = 0.00 < 0.05)
                       Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
 4 (4.00%) high mild
 7 (7.00%) high severe
evaluate_incremental/2^5
                       time:   [18.359 µs 18.562 µs 18.785 µs]
                       change: [+545.64% +552.71% +559.60%] (p = 0.00 < 0.05)
                       Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
 3 (3.00%) high mild
evaluate_incremental/2^7
                       time:   [28.191 µs 28.402 µs 28.628 µs]
                       change: [+145.80% +148.43% +151.12%] (p = 0.00 < 0.05)
                       Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
 3 (3.00%) high mild
 1 (1.00%) high severe
evaluate_incremental/2^9
                       time:   [55.980 µs 57.917 µs 60.159 µs]
                       change: [+21.740% +23.970% +26.544%] (p = 0.00 < 0.05)
                       Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
 4 (4.00%) high mild
 4 (4.00%) high severe
evaluate_incremental/2^11
                       time:   [112.90 µs 114.63 µs 116.48 µs]
                       change: [-38.754% -37.528% -35.992%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
 2 (2.00%) high mild
evaluate_incremental/2^13
                       time:   [277.95 µs 282.42 µs 287.06 µs]
                       change: [-62.248% -61.689% -61.053%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
 1 (1.00%) high mild
 1 (1.00%) high severe
evaluate_incremental/2^15
                       time:   [953.95 µs 986.72 µs 1.0333 ms]
                       change: [-69.195% -68.426% -67.543%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
 2 (2.00%) high mild
 1 (1.00%) high severe
evaluate_incremental/2^17
                       time:   [3.4105 ms 3.4747 ms 3.5640 ms]
                       change: [-71.229% -70.680% -69.911%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
 2 (2.00%) high mild
 3 (3.00%) high severe
evaluate_incremental/2^19
                       time:   [11.854 ms 12.042 ms 12.241 ms]
                       change: [-75.035% -74.632% -74.186%] (p = 0.00 < 0.05)
                       Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
 1 (1.00%) high mild

@mmagician
Copy link

mmagician commented Mar 14, 2024

Top.
Two remaining things then:

  1. resign commits ✅
  2. remove duplicated files ✅

Then we can PR directly to upstream I think

Cesar Descalzo Blanco added 3 commits March 14, 2024 14:31
Fix tabulation

Add missing import
Fix random `Fp` slice creation

Fix formatting
@Cesar199999 Cesar199999 force-pushed the evaluate-parallelization branch from c3b8b3e to c87e6ae Compare March 14, 2024 13:33
@Cesar199999 Cesar199999 force-pushed the evaluate-parallelization branch from c87e6ae to 3ccd639 Compare March 14, 2024 13:35
@mmagician mmagician merged commit 264ed93 into main Mar 14, 2024
1 check passed
Antonio95 added a commit that referenced this pull request Mar 19, 2024
…ization"

This reverts commit 264ed93, reversing
changes made to d78a504.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants