Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel optimization - behind a test flag --test:ParallelOptimization #14390

Merged
merged 70 commits into from
May 19, 2023

Conversation

safesparrow
Copy link
Contributor

@safesparrow safesparrow commented Nov 24, 2022

NOTE: The feature as it stands has a major issue caused by shared mutable state used and mutated by optimisation code.
See comments in this PR and #14836 for details.

Summary

Release optimization for FSharp.Compiler.Service can be sped up by ~43% by making different phases run partially in parallel.

Compiler optimization can be parallelised in two ways:

  1. EncodeSignatureData and ApplyAllOptimizations steps are completely independent.
  2. In ApplyAllOptimizations, optimization of a given file in a given phase (1-7) depends on optimization results of all previous phases of that file, and on results of previous files for the same phase - but it doesn't depend on previous files' optimization for different phases.

In this PR we focus on 2.
We schedule individual file * phase work items as soon as possible.

This is mainly an optimization for Release compilation - in Debug most phases are very quick/a no-op, so there is little to parallelize.

Details of implementation

The old sequential code works in the following way:

  • Files are optimized sequentially. Each file's results consist of a few items, and are carried over to the next file's optimization.
  • Each file's optimization can be split into 7 segments/phases. These phases are arbitrary parts of the per-file optimization function that I extracted in a black-box way, without any knowledge of what they were doing, and instead purely based on their inputs and outputs. All that was relevant to perform this change was to know that phases 1-7 happen in order for each file.
  • The crucial observation is that for a given file, its phase 1 calculation does not depend on results of next segments of previous files, and so on for all phases. This observation can be directly concluded from observing the segments' code in ApplyAllOptimizations, their input and output variables. The only way for them to have such dependency would therefore be through static state. EDIT: However, this is not entirely true due to the same shared mutable state reused by multiple phases and multiple files. See comments for details.

The changes in this PR can be summarised as follows:

  • The old sequential optimization code is refactored to distinguish seven optimization phases explicitly.
  • There is a new test flag in the compiler: --test:ParallelOptimization.
  • With the flag turned off, the refactored sequential code is used, which should behave exactly the same as the old code.
  • With the flag turned on, we schedule individual file * phase pairs independently, as soon as all their dependencies have finished processing.
  • The processing 'graph' consists of up to N*7 nodes, where N=number of source files.
  • As soon as processing of file X, phase Y has finished, we schedule file X, phase Y+1 for processing.
  • As soon as processing of file X, phase Y and file X+1, phase Y-1 has finished, we schedule file X+1, phase Y for processing.

Diagram comparing sequential and partially parallel processing flow (example with 3 phases)

parallel_optimization

Timings for FSharp.Compiler.Service

Test run on a 8-core/16-thread CPU.
It's worth noting that no more than 7 items are processed at a time.

Optimize Mode Optimization Time Total Time
+ Sequential 12.9s 31.0s
+ Parallel 7.1s (-45%) 25.5s (-18%)
- Sequential 5.6s 24.3s
- Parallel 3.7s (-34%) 23.6s (-3%)
Details

Sequential - -optimize+:

Real: 23.4 Realdelta: 12.9 Cpu: 45.0 Cpudelta: 10.5 Mem: 1575 G0:  61 G1: 25 G2:  1 [Optimizations]
...
Real: 31.0 Realdelta:  2.6 Cpu: 62.4 Cpudelta:  2.8 Mem: 3018 G0:   1 G1:  0 G2:  0 [Write .NET Binary]

Parallel - -optimize+:

Real: 17.4 Realdelta:  7.1 Cpu: 45.2 Cpudelta: 10.7 Mem: 1956 G0:  49 G1: 21 G2:  1 [Optimizations]
...
Real: 25.5 Realdelta:  2.8 Cpu: 58.3 Cpudelta:  3.6 Mem: 3041 G0:   1 G1:  1 G2:  0 [Write .NET Binary]

Sequential - -optimize-:

Real: 15.8 Realdelta:  5.6 Cpu: 30.4 Cpudelta:  4.5 Mem: 1462 G0:  22 G1:  1 G2:  0 [Optimizations]
...
Real: 24.3 Realdelta:  3.5 Cpu: 39.5 Cpudelta:  2.9 Mem: 1975 G0:   7 G1:  4 G2:  0 [Write .NET Binary]

Parallel - -optimize-:

Real: 14.6 Realdelta:  3.7 Cpu: 33.5 Cpudelta:  4.0 Mem: 1564 G0:  21 G1:  3 G2:  0 [Optimizations]
...
Real: 23.6 Realdelta:  3.4 Cpu: 44.3 Cpudelta:  2.8 Mem: 2049 G0:   7 G1:  3 G2:  0 [Write .NET Binary]

Traces showing how the phased optimization happens in sequential (left) and parallel (right) mode.
Note that on the right we start processing file X, phase Y before file X-1, phase Y+1/Y+2 finishes, which is the source of speedup.
image

TODOs:

  • cleanup
  • Use proper async/NodeCode
  • Wire up DiagnosticsLogger to worker tasks correctly, if needed.
  • Revisit test flag name
  • Add test results
  • Simplify graph scheduling
  • Potentially only use the flag in Release mode/when using --optimize+ - as the test results demonstrate, significant gains are possible even with --optimize-, so I think the feature should always apply (when enabled)
  • Other?

@safesparrow safesparrow marked this pull request as ready for review December 1, 2022 23:51
@safesparrow safesparrow requested a review from a team as a code owner December 1, 2022 23:51
@safesparrow
Copy link
Contributor Author

safesparrow commented Dec 1, 2022

I think the PR is now in a reasonable state.

The main thing I'm unsure of is these two points:

  • Use proper async/NodeCode
  • Wire up DiagnosticsLogger to worker tasks correctly, if needed.

I'm not entirely sure if they are relevant in the compilation case, and how to do it exactly, if they are.

@baronfel @vzarytovskii Would you be able to have a look at the PR and send over some suggestions? Thanks!

@safesparrow safesparrow changed the title [DRAFT] WIP - partially parallel release optimization Partially parallel release optimization - behind a test flag --test:PartiallyParallelOptimization Dec 2, 2022
@safesparrow safesparrow changed the title Partially parallel release optimization - behind a test flag --test:PartiallyParallelOptimization Partially parallel optimization - behind a test flag --test:PartiallyParallelOptimization Dec 2, 2022
src/Compiler/Utilities/lib.fs Outdated Show resolved Hide resolved
Copy link
Contributor

@TIHan TIHan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very interesting change. Definitely, if we could do these optimizations in parallel, it would provide a significant compile-time perf boost.

When I was looking at parallelism over a year ago, I did look at the optimizer to see what could be done. I gave up pretty quickly because the order of files matter with respect to inlining functions, so I didn't look further. I would really really check to make sure that a change like this wouldn't affect it.

Another thing to note, you need to make sure that the compiler is still deterministic.

@TIHan
Copy link
Contributor

TIHan commented Dec 2, 2022

but it doesn't depend on previous files' optimization for different phases.

I'm not completely sure how true this is, but it may be plausible in certain cases.

@safesparrow
Copy link
Contributor Author

safesparrow commented Dec 2, 2022

but it doesn't depend on previous files' optimization for different phases.

I'm not completely sure how true this is, but it may be plausible in certain cases.

This conclusion is implied from the code. If you look at the sequential code and follow its inputs and outputs, you should see that phase1 doesn't use any outputs from previous files' phase2, and phase2 doesn't use any outputs from previous files ' phase3.

This is more easily seen in the refactored sequential code, which should be equivalent to the old version, unless I introduced a bug.

So my claim is it is true in all cases and that's guaranteed by the code structure. Happy to be proven wrong though.

When I was looking at parallelism over a year ago, I did look at the optimizer to see what could be done. I gave up pretty quickly because the order of files matter with respect to inlining functions, so I didn't look further. I would really really check to make sure that a change like this wouldn't affect it.

The order of files does matter and this PR does maintain file order - that's why file X, phase Y depends on file X-1, phase Y.
It's just that file X, phase Y does not depend on file X-1, phase Y+1.

I added a chart illustrating this flow in the description.

@T-Gro
Copy link
Member

T-Gro commented May 9, 2023

I ran the test in an isolated loop over 200 times - not a single failure.
Also when running it together with a few other tests did not fail.

I have a new suspicion which would make things horrible to troubleshoot:

  • This is not flakiness in terms of test execution, but rather in the self-built compiler
  • Meaning, a boostraped compiler which is then used for testing either has a flaw or hasnt.
  • If it hasn't, the test obviously will not fail not matter how many times repeated.

@nojaf
Copy link
Contributor

nojaf commented May 10, 2023

The biggest problem is reliably reproducing the error.

Did you try adding some random Thread.Sleep calls? I found that to be useful when working on the typar problem.

@safesparrow
Copy link
Contributor Author

safesparrow commented May 10, 2023

This is not flakiness in terms of test execution, but rather in the self-built compiler

This would imply that the compiler from the SDK, used to build the bootstrap compiler, is not deterministic and can produce a buggy bootstrap compiler that then produces a buggy target compiler, which fails this particular test on this particular PR.

This seems very unlikely to me.

EDIT: Unless you mean that the bootstrap compiler is built deterministically from the SDK, but it in turn gives non-deterministic target compiler which either works or doesn't.

My biggest suspect is the ArrayParallel module change and different behavior when it comes to threading, under load. When reverting that behavior and running the test in isolation, it never failed

I never got a failed test when I ran it in isolation. So I don't think we can imply anything from the above.
Having said that, I'm happy to revert that change, push the branch, and do some testing locally.
It would be "funny" if it's that in the end 🙂

@T-Gro
Copy link
Member

T-Gro commented May 17, 2023

Finally I was able to create a methodology for investigating this issue.
Out of the things that worked, this is a reproducible scenario

  1. Create sibling tests next to 2766 which do the same thing, just different name. I did it to have 9 of them in total
  2. Include them in Stress/env.lst
  3. Comment out all the other tests both in Stress/ as well as in the overall SOurce/test.lst ;; to do ONLY those stress tests and nothing else
  4. In eng/Build.ps1 , I played with the the -procs parameter passed to perl runner (runall.pl) to checked what was happening.

This PR fails when multiple processes are involved, AND the hostedcompiler is used. It does not fail when a single process is used.

I now reverted the change in ArrayParallel module, and am again trying various levels of process parallelism for the hosted compiler (1,2,4,8), and the tests are passing. I kept all the other optimization related code of this PR.

YET to verify by me - I have only verified in an isolated scenario a previous failure and now a success. I will run the full suite using the new code many times in a loop to empirically observe if it ever fails.

@T-Gro
Copy link
Member

T-Gro commented May 17, 2023

Why is it that way?
The only reason I can think of is what MaxDegreeOfParallelism does in case of tight resources.
I will research the inner workings more, very like there might be special cases/optimizations based on which it decides to either run the code on current thread which requested the op, or use threadpool for it.

Nevertheless, the fact that parsing behavior depends on the threading situation is wrong as well, will check that more to for usage of thread locals (or indirect/hidden shared data, such as usage of array pools)

@T-Gro
Copy link
Member

T-Gro commented May 17, 2023

The change is in. Locally, this does not fail for me in neither the isolated case nor in the full .\Build.cmd -c Release -testFsharpQA -norestore .

I will restart CI on the server several times as well and watch it pass .

@T-Gro T-Gro requested review from TIHan, dsyme and vzarytovskii May 17, 2023 17:23
@T-Gro
Copy link
Member

T-Gro commented May 17, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@T-Gro
Copy link
Member

T-Gro commented May 17, 2023

Now it is stable also in CI, as well as locally when running either stress tests in isolation, or in the full fsharpQA test suite (in parallel).

The innocent refactoring of ArrayParallel module meant that a different code path could have been hit for the inner actions.
And the Parallel class of dotnet has optimized special cases for sometimes doing actions synchronously rather then on the threadpool. Which means, that both threadlocal as well as remaining stack size situation can be different.

I do not think this refactoring is worth investigating that deep what happened in a degenerated scenario (a stack of 500 unfinished expressions, this is what the stress test was about it) and I rather reverted it a kept it as it was.

The main body of this PR, the parallel optimization, is unaffected by this and is now ready to be re-reviewed & merged.

Copy link
Member

@vzarytovskii vzarytovskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking it to not forget to review it tomorrow.

@safesparrow
Copy link
Contributor Author

Can we merge it now @vzarytovskii ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants