-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel optimization - behind a test flag --test:ParallelOptimization
#14390
Conversation
…el for now as almost no gain.
I think the PR is now in a reasonable state. The main thing I'm unsure of is these two points:
I'm not entirely sure if they are relevant in the compilation case, and how to do it exactly, if they are. @baronfel @vzarytovskii Would you be able to have a look at the PR and send over some suggestions? Thanks! |
--test:PartiallyParallelOptimization
--test:PartiallyParallelOptimization
--test:PartiallyParallelOptimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting change. Definitely, if we could do these optimizations in parallel, it would provide a significant compile-time perf boost.
When I was looking at parallelism over a year ago, I did look at the optimizer to see what could be done. I gave up pretty quickly because the order of files matter with respect to inlining functions, so I didn't look further. I would really really check to make sure that a change like this wouldn't affect it.
Another thing to note, you need to make sure that the compiler is still deterministic.
I'm not completely sure how true this is, but it may be plausible in certain cases. |
This conclusion is implied from the code. If you look at the sequential code and follow its inputs and outputs, you should see that phase1 doesn't use any outputs from previous files' phase2, and phase2 doesn't use any outputs from previous files ' phase3. This is more easily seen in the refactored sequential code, which should be equivalent to the old version, unless I introduced a bug. So my claim is it is true in all cases and that's guaranteed by the code structure. Happy to be proven wrong though.
The order of files does matter and this PR does maintain file order - that's why I added a chart illustrating this flow in the description. |
I ran the test in an isolated loop over 200 times - not a single failure. I have a new suspicion which would make things horrible to troubleshoot:
|
Did you try adding some random |
This would imply that the compiler from the SDK, used to build the bootstrap compiler, is not deterministic and can produce a buggy bootstrap compiler that then produces a buggy target compiler, which fails this particular test on this particular PR. This seems very unlikely to me. EDIT: Unless you mean that the bootstrap compiler is built deterministically from the SDK, but it in turn gives non-deterministic target compiler which either works or doesn't.
I never got a failed test when I ran it in isolation. So I don't think we can imply anything from the above. |
Finally I was able to create a methodology for investigating this issue.
This PR fails when multiple processes are involved, AND the hostedcompiler is used. It does not fail when a single process is used. I now reverted the change in ArrayParallel module, and am again trying various levels of process parallelism for the hosted compiler (1,2,4,8), and the tests are passing. I kept all the other optimization related code of this PR. YET to verify by me - I have only verified in an isolated scenario a previous failure and now a success. I will run the full suite using the new code many times in a loop to empirically observe if it ever fails. |
Why is it that way? Nevertheless, the fact that parsing behavior depends on the threading situation is wrong as well, will check that more to for usage of thread locals (or indirect/hidden shared data, such as usage of array pools) |
The change is in. Locally, this does not fail for me in neither the isolated case nor in the full I will restart CI on the server several times as well and watch it pass . |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
Now it is stable also in CI, as well as locally when running either stress tests in isolation, or in the full fsharpQA test suite (in parallel). The innocent refactoring of ArrayParallel module meant that a different code path could have been hit for the inner actions. I do not think this refactoring is worth investigating that deep what happened in a degenerated scenario (a stack of 500 unfinished expressions, this is what the stress test was about it) and I rather reverted it a kept it as it was. The main body of this PR, the parallel optimization, is unaffected by this and is now ready to be re-reviewed & merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking it to not forget to review it tomorrow.
Can we merge it now @vzarytovskii ? |
NOTE: The feature as it stands has a major issue caused by shared mutable state used and mutated by optimisation code.
See comments in this PR and #14836 for details.
Summary
Release optimization for
FSharp.Compiler.Service
can be sped up by ~43% by making different phases run partially in parallel.Compiler optimization can be parallelised in two ways:
EncodeSignatureData
andApplyAllOptimizations
steps are completely independent.ApplyAllOptimizations
, optimization of a given file in a given phase (1-7) depends on optimization results of all previous phases of that file, and on results of previous files for the same phase - but it doesn't depend on previous files' optimization for different phases.In this PR we focus on 2.
We schedule individual
file * phase
work items as soon as possible.This is mainly an optimization for
Release
compilation - inDebug
most phases are very quick/a no-op, so there is little to parallelize.Details of implementation
The old sequential code works in the following way:
phases
. Thesephases
are arbitrary parts of the per-file optimization function that I extracted in a black-box way, without any knowledge of what they were doing, and instead purely based on their inputs and outputs. All that was relevant to perform this change was to know that phases 1-7 happen in order for each file.phase 1
calculation does not depend on results of next segments of previous files, and so on for all phases. This observation can be directly concluded from observing the segments' code inApplyAllOptimizations
, their input and output variables. The only way for them to have such dependency would therefore be through static state. EDIT: However, this is not entirely true due to the same shared mutable state reused by multiple phases and multiple files. See comments for details.The changes in this PR can be summarised as follows:
--test:ParallelOptimization
.file * phase
pairs independently, as soon as all their dependencies have finished processing.N*7
nodes, where N=number of source files.file X, phase Y
has finished, we schedulefile X, phase Y+1
for processing.file X, phase Y
andfile X+1, phase Y-1
has finished, we schedulefile X+1, phase Y
for processing.Diagram comparing sequential and partially parallel processing flow (example with 3 phases)
Timings for
FSharp.Compiler.Service
Test run on a 8-core/16-thread CPU.
It's worth noting that no more than 7 items are processed at a time.
Details
Sequential -
-optimize+
:Parallel -
-optimize+
:Sequential -
-optimize-
:Parallel -
-optimize-
:Traces showing how the phased optimization happens in sequential (left) and parallel (right) mode.
![image](https://user-images.githubusercontent.com/2478401/205182447-ecdc41de-a643-4d4f-9533-4916697c6670.png)
Note that on the right we start processing
file X, phase Y
beforefile X-1, phase Y+1/Y+2
finishes, which is the source of speedup.TODOs:
Potentially only use the flag in Release mode/when using- as the test results demonstrate, significant gains are possible even with--optimize+
--optimize-
, so I think the feature should always apply (when enabled)