Parallel optimization - behind a test flag `--test:ParallelOptimization` #14390

safesparrow · 2022-11-24T23:37:44Z

NOTE: The feature as it stands has a major issue caused by shared mutable state used and mutated by optimisation code.
See comments in this PR and #14836 for details.

Summary

Release optimization for FSharp.Compiler.Service can be sped up by ~43% by making different phases run partially in parallel.

Compiler optimization can be parallelised in two ways:

EncodeSignatureData and ApplyAllOptimizations steps are completely independent.
In ApplyAllOptimizations, optimization of a given file in a given phase (1-7) depends on optimization results of all previous phases of that file, and on results of previous files for the same phase - but it doesn't depend on previous files' optimization for different phases.

In this PR we focus on 2.
We schedule individual file * phase work items as soon as possible.

This is mainly an optimization for Release compilation - in Debug most phases are very quick/a no-op, so there is little to parallelize.

Details of implementation

The old sequential code works in the following way:

Files are optimized sequentially. Each file's results consist of a few items, and are carried over to the next file's optimization.
Each file's optimization can be split into 7 segments/phases. These phases are arbitrary parts of the per-file optimization function that I extracted in a black-box way, without any knowledge of what they were doing, and instead purely based on their inputs and outputs. All that was relevant to perform this change was to know that phases 1-7 happen in order for each file.
The crucial observation is that for a given file, its phase 1 calculation does not depend on results of next segments of previous files, and so on for all phases. This observation can be directly concluded from observing the segments' code in ApplyAllOptimizations, their input and output variables. The only way for them to have such dependency would therefore be through static state. EDIT: However, this is not entirely true due to the same shared mutable state reused by multiple phases and multiple files. See comments for details.

The changes in this PR can be summarised as follows:

The old sequential optimization code is refactored to distinguish seven optimization phases explicitly.
There is a new test flag in the compiler: --test:ParallelOptimization.
With the flag turned off, the refactored sequential code is used, which should behave exactly the same as the old code.
With the flag turned on, we schedule individual file * phase pairs independently, as soon as all their dependencies have finished processing.
The processing 'graph' consists of up to N*7 nodes, where N=number of source files.
As soon as processing of file X, phase Y has finished, we schedule file X, phase Y+1 for processing.
As soon as processing of file X, phase Y and file X+1, phase Y-1 has finished, we schedule file X+1, phase Y for processing.

Diagram comparing sequential and partially parallel processing flow (example with 3 phases)

Timings for `FSharp.Compiler.Service`

Test run on a 8-core/16-thread CPU.
It's worth noting that no more than 7 items are processed at a time.

Optimize	Mode	Optimization Time	Total Time
+	Sequential	12.9s	31.0s
+	Parallel	7.1s (-45%)	25.5s (-18%)
-	Sequential	5.6s	24.3s
-	Parallel	3.7s (-34%)	23.6s (-3%)

Details

Sequential - -optimize+:

Real: 23.4 Realdelta: 12.9 Cpu: 45.0 Cpudelta: 10.5 Mem: 1575 G0:  61 G1: 25 G2:  1 [Optimizations]
...
Real: 31.0 Realdelta:  2.6 Cpu: 62.4 Cpudelta:  2.8 Mem: 3018 G0:   1 G1:  0 G2:  0 [Write .NET Binary]

Parallel - -optimize+:

Real: 17.4 Realdelta:  7.1 Cpu: 45.2 Cpudelta: 10.7 Mem: 1956 G0:  49 G1: 21 G2:  1 [Optimizations]
...
Real: 25.5 Realdelta:  2.8 Cpu: 58.3 Cpudelta:  3.6 Mem: 3041 G0:   1 G1:  1 G2:  0 [Write .NET Binary]

Sequential - -optimize-:

Real: 15.8 Realdelta:  5.6 Cpu: 30.4 Cpudelta:  4.5 Mem: 1462 G0:  22 G1:  1 G2:  0 [Optimizations]
...
Real: 24.3 Realdelta:  3.5 Cpu: 39.5 Cpudelta:  2.9 Mem: 1975 G0:   7 G1:  4 G2:  0 [Write .NET Binary]

Parallel - -optimize-:

Real: 14.6 Realdelta:  3.7 Cpu: 33.5 Cpudelta:  4.0 Mem: 1564 G0:  21 G1:  3 G2:  0 [Optimizations]
...
Real: 23.6 Realdelta:  3.4 Cpu: 44.3 Cpudelta:  2.8 Mem: 2049 G0:   7 G1:  3 G2:  0 [Write .NET Binary]

Traces showing how the phased optimization happens in sequential (left) and parallel (right) mode.
Note that on the right we start processing file X, phase Y before file X-1, phase Y+1/Y+2 finishes, which is the source of speedup.

TODOs:

cleanup
Use proper async/NodeCode
Wire up DiagnosticsLogger to worker tasks correctly, if needed.
Revisit test flag name
Add test results
Simplify graph scheduling
~~Potentially only use the flag in Release mode/when using --optimize+~~ - as the test results demonstrate, significant gains are possible even with --optimize-, so I think the feature should always apply (when enabled)
Other?

…el for now as almost no gain.

src/Compiler/Driver/CompilerOptions.fs

safesparrow · 2022-12-01T23:55:45Z

I think the PR is now in a reasonable state.

The main thing I'm unsure of is these two points:

Use proper async/NodeCode
Wire up DiagnosticsLogger to worker tasks correctly, if needed.

I'm not entirely sure if they are relevant in the compilation case, and how to do it exactly, if they are.

@baronfel @vzarytovskii Would you be able to have a look at the PR and send over some suggestions? Thanks!

src/Compiler/Driver/OptimizeInputs.fs

src/Compiler/Utilities/lib.fs

src/Compiler/Driver/OptimizeInputs.fs

TIHan

This is a very interesting change. Definitely, if we could do these optimizations in parallel, it would provide a significant compile-time perf boost.

When I was looking at parallelism over a year ago, I did look at the optimizer to see what could be done. I gave up pretty quickly because the order of files matter with respect to inlining functions, so I didn't look further. I would really really check to make sure that a change like this wouldn't affect it.

Another thing to note, you need to make sure that the compiler is still deterministic.

TIHan · 2022-12-02T02:13:32Z

but it doesn't depend on previous files' optimization for different phases.

I'm not completely sure how true this is, but it may be plausible in certain cases.

safesparrow · 2022-12-02T09:28:49Z

but it doesn't depend on previous files' optimization for different phases.

I'm not completely sure how true this is, but it may be plausible in certain cases.

This conclusion is implied from the code. If you look at the sequential code and follow its inputs and outputs, you should see that phase1 doesn't use any outputs from previous files' phase2, and phase2 doesn't use any outputs from previous files ' phase3.

This is more easily seen in the refactored sequential code, which should be equivalent to the old version, unless I introduced a bug.

So my claim is it is true in all cases and that's guaranteed by the code structure. Happy to be proven wrong though.

When I was looking at parallelism over a year ago, I did look at the optimizer to see what could be done. I gave up pretty quickly because the order of files matter with respect to inlining functions, so I didn't look further. I would really really check to make sure that a change like this wouldn't affect it.

The order of files does matter and this PR does maintain file order - that's why file X, phase Y depends on file X-1, phase Y.
It's just that file X, phase Y does not depend on file X-1, phase Y+1.

I added a chart illustrating this flow in the description.

T-Gro · 2023-05-09T16:49:53Z

I ran the test in an isolated loop over 200 times - not a single failure.
Also when running it together with a few other tests did not fail.

I have a new suspicion which would make things horrible to troubleshoot:

This is not flakiness in terms of test execution, but rather in the self-built compiler
Meaning, a boostraped compiler which is then used for testing either has a flaw or hasnt.
If it hasn't, the test obviously will not fail not matter how many times repeated.

nojaf · 2023-05-10T06:12:18Z

The biggest problem is reliably reproducing the error.

Did you try adding some random Thread.Sleep calls? I found that to be useful when working on the typar problem.

safesparrow · 2023-05-10T07:36:57Z

This is not flakiness in terms of test execution, but rather in the self-built compiler

This would imply that the compiler from the SDK, used to build the bootstrap compiler, is not deterministic and can produce a buggy bootstrap compiler that then produces a buggy target compiler, which fails this particular test on this particular PR.

This seems very unlikely to me.

EDIT: Unless you mean that the bootstrap compiler is built deterministically from the SDK, but it in turn gives non-deterministic target compiler which either works or doesn't.

My biggest suspect is the ArrayParallel module change and different behavior when it comes to threading, under load. When reverting that behavior and running the test in isolation, it never failed

I never got a failed test when I ran it in isolation. So I don't think we can imply anything from the above.
Having said that, I'm happy to revert that change, push the branch, and do some testing locally.
It would be "funny" if it's that in the end 🙂

T-Gro · 2023-05-17T11:56:16Z

Finally I was able to create a methodology for investigating this issue.
Out of the things that worked, this is a reproducible scenario

Create sibling tests next to 2766 which do the same thing, just different name. I did it to have 9 of them in total
Include them in Stress/env.lst
Comment out all the other tests both in Stress/ as well as in the overall SOurce/test.lst ;; to do ONLY those stress tests and nothing else
In eng/Build.ps1 , I played with the the -procs parameter passed to perl runner (runall.pl) to checked what was happening.

This PR fails when multiple processes are involved, AND the hostedcompiler is used. It does not fail when a single process is used.

I now reverted the change in ArrayParallel module, and am again trying various levels of process parallelism for the hosted compiler (1,2,4,8), and the tests are passing. I kept all the other optimization related code of this PR.

YET to verify by me - I have only verified in an isolated scenario a previous failure and now a success. I will run the full suite using the new code many times in a loop to empirically observe if it ever fails.

T-Gro · 2023-05-17T11:59:59Z

Why is it that way?
The only reason I can think of is what MaxDegreeOfParallelism does in case of tight resources.
I will research the inner workings more, very like there might be special cases/optimizations based on which it decides to either run the code on current thread which requested the op, or use threadpool for it.

Nevertheless, the fact that parsing behavior depends on the threading situation is wrong as well, will check that more to for usage of thread locals (or indirect/hidden shared data, such as usage of array pools)

T-Gro · 2023-05-17T14:59:09Z

The change is in. Locally, this does not fail for me in neither the isolated case nor in the full .\Build.cmd -c Release -testFsharpQA -norestore .

I will restart CI on the server several times as well and watch it pass .

T-Gro · 2023-05-17T17:23:39Z

/azp run

azure-pipelines · 2023-05-17T17:23:55Z

Azure Pipelines successfully started running 2 pipeline(s).

T-Gro · 2023-05-17T17:26:22Z

Now it is stable also in CI, as well as locally when running either stress tests in isolation, or in the full fsharpQA test suite (in parallel).

The innocent refactoring of ArrayParallel module meant that a different code path could have been hit for the inner actions.
And the Parallel class of dotnet has optimized special cases for sometimes doing actions synchronously rather then on the threadpool. Which means, that both threadlocal as well as remaining stack size situation can be different.

I do not think this refactoring is worth investigating that deep what happened in a degenerated scenario (a stack of 500 unfinished expressions, this is what the stress test was about it) and I rather reverted it a kept it as it was.

The main body of this PR, the parallel optimization, is unaffected by this and is now ready to be re-reviewed & merged.

vzarytovskii

Blocking it to not forget to review it tomorrow.

src/Compiler/Driver/OptimizeInputs.fs

safesparrow · 2023-05-19T07:19:01Z

Can we merge it now @vzarytovskii ?

safesparrow added 12 commits November 23, 2022 21:10

WIP - partially parallel release optimization

007f802

Make it compile, wire up a test option

a74c016

Run two opt phases in parallel

315e214

WIP - partially parallel release optimization

cf2bb26

Make it compile, wire up a test option

947e4b5

Run two opt phases in parallel

227aa32

Merge branch 'opt' of https://github.com/safesparrow/fsharp into opt

c144b1a

Update

dbe2ee6

Fix test with a hack

2ad72b7

WIP

9c31d28

Revert changes to fsc.fs - no need to run Encode & Optimize in parall…

3e4e279

…el for now as almost no gain.

Cleanup + fantomas

7970ba3

safesparrow commented Dec 1, 2022

View reviewed changes

src/Compiler/Driver/CompilerOptions.fs Outdated Show resolved Hide resolved

safesparrow added 5 commits December 1, 2022 22:34

Revert unneeded changes

0627a4f

Cleanup

2093a73

Cleanup

c739b4e

Cleanup

4db7786

Merge remote-tracking branch 'upstream/main' into opt

7cc58a5

safesparrow marked this pull request as ready for review December 1, 2022 23:51

safesparrow requested a review from a team as a code owner December 1, 2022 23:51

Fantomas

9534d4e

safesparrow changed the title ~~[DRAFT] WIP - partially parallel release optimization~~ Partially parallel release optimization - behind a test flag --test:PartiallyParallelOptimization Dec 2, 2022

safesparrow changed the title ~~Partially parallel release optimization - behind a test flag --test:PartiallyParallelOptimization~~ Partially parallel optimization - behind a test flag --test:PartiallyParallelOptimization Dec 2, 2022

TIHan reviewed Dec 2, 2022

View reviewed changes

src/Compiler/Driver/OptimizeInputs.fs Outdated Show resolved Hide resolved

TIHan reviewed Dec 2, 2022

View reviewed changes

src/Compiler/Utilities/lib.fs Outdated Show resolved Hide resolved

TIHan reviewed Dec 2, 2022

View reviewed changes

src/Compiler/Driver/OptimizeInputs.fs Outdated Show resolved Hide resolved

TIHan reviewed Dec 2, 2022

View reviewed changes

safesparrow and others added 2 commits May 10, 2023 08:37

Merge branch 'main' into opt

9769d1b

Merge branch 'opt' of https://github.com/safesparrow/fsharp into opt

1f83842

Reverting ArrayParallel functionality

ec2202f

Merge branch 'main' into opt

06e155f

T-Gro approved these changes May 17, 2023

View reviewed changes

T-Gro requested review from TIHan, dsyme and vzarytovskii May 17, 2023 17:23

vzarytovskii requested changes May 17, 2023

View reviewed changes

safesparrow added 2 commits May 17, 2023 22:19

Dummy commit to trigger CI rerun

c3ab412

Merge branch 'main' into opt

9f44734

vzarytovskii reviewed May 18, 2023

View reviewed changes

src/Compiler/Driver/OptimizeInputs.fs Outdated Show resolved Hide resolved

vzarytovskii approved these changes May 18, 2023

View reviewed changes

safesparrow added 3 commits May 18, 2023 17:23

Use backgroundTask

546c08d

Merge remote-tracking branch 'origin/opt' into opt

9573d66

Merge branch 'main' into opt

80eb1a3

vzarytovskii merged commit 40178e3 into dotnet:main May 19, 2023

safesparrow deleted the opt branch May 19, 2023 19:12

allisonchou mentioned this pull request May 22, 2023

[Automated] PRs inserted in VS build main-33722.80 #15249

Closed

vzarytovskii added the release/17.7 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel optimization - behind a test flag `--test:ParallelOptimization` #14390

Parallel optimization - behind a test flag `--test:ParallelOptimization` #14390

safesparrow commented Nov 24, 2022 •

edited

Loading

safesparrow commented Dec 1, 2022 •

edited

Loading

TIHan left a comment •

edited

Loading

TIHan commented Dec 2, 2022 •

edited

Loading

safesparrow commented Dec 2, 2022 •

edited

Loading

T-Gro commented May 9, 2023

nojaf commented May 10, 2023

safesparrow commented May 10, 2023 •

edited

Loading

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

azure-pipelines bot commented May 17, 2023

T-Gro commented May 17, 2023

vzarytovskii left a comment

safesparrow commented May 19, 2023

Parallel optimization - behind a test flag --test:ParallelOptimization #14390

Parallel optimization - behind a test flag --test:ParallelOptimization #14390

Conversation

safesparrow commented Nov 24, 2022 • edited Loading

Summary

Details of implementation

Diagram comparing sequential and partially parallel processing flow (example with 3 phases)

Timings for FSharp.Compiler.Service

TODOs:

safesparrow commented Dec 1, 2022 • edited Loading

TIHan left a comment • edited Loading

Choose a reason for hiding this comment

TIHan commented Dec 2, 2022 • edited Loading

safesparrow commented Dec 2, 2022 • edited Loading

T-Gro commented May 9, 2023

nojaf commented May 10, 2023

safesparrow commented May 10, 2023 • edited Loading

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

T-Gro commented May 17, 2023

azure-pipelines bot commented May 17, 2023

T-Gro commented May 17, 2023

vzarytovskii left a comment

Choose a reason for hiding this comment

safesparrow commented May 19, 2023

Parallel optimization - behind a test flag `--test:ParallelOptimization` #14390

Parallel optimization - behind a test flag `--test:ParallelOptimization` #14390

safesparrow commented Nov 24, 2022 •

edited

Loading

Timings for `FSharp.Compiler.Service`

safesparrow commented Dec 1, 2022 •

edited

Loading

TIHan left a comment •

edited

Loading

TIHan commented Dec 2, 2022 •

edited

Loading

safesparrow commented Dec 2, 2022 •

edited

Loading

safesparrow commented May 10, 2023 •

edited

Loading