-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: slow to compile large map literals containing dynamic elements #19751
Comments
@josharian here's a new issue originally described at #14082 |
Thanks, @myleshorton. The problematic file is a giant composite literal, a map whose elements contain maps. The compiler converts this into static initialization code in which each element in turn is initialized, and then placed into the top-level map. It all ends up in one giant function. Almost all the time is spent in phi insertion, at least on tip. The file is 300k lines long. 1k lines take 0.2s to compile. 10k lines take 2s. 50k lines take 30s. That's as far as I got. I didn't see any memory blow-up in liveness, and most memory is allocated in the usual places (nodes, stackalloc), although tip's liveness implementation is significantly different than 1.8's, so maybe that particular problem is fixed. Here's one idea. We could break up giant composite literal initialization into multiple functions. Each function by itself would be much faster to compile, and as long as they don't get too granular, I doubt we'd lose much optimization opportunity or slow things down due to call overhead. The main problem here is probably implementation--coordinating compilation and calling of all the extra functions. Maybe a binary splitting approach until each chunk's size is manageable. This has the additional benefit of (theoretically!) allowing concurrent compilation of giant init functions. This suggestion pains me a little, since I'm trying to figure out how to eliminate the convolution caused by the "drop everything and compile this function" thing we currently do all over the compiler, but maybe if I find a good general solution to the problem, this can fit into it. Thoughts, @randall77 or @mdempsky? |
I took a stab at this but haven't gotten anything working yet. One sticking point is that the dynamic entries may refer to other local variables. Another sticking point is that setting up a new function manually is a pain. In the meantime, @myleshorton, would you mind running an experiment? (I'd do it myself, but I don't see your template/generator code checked in anywhere obvious. Or just point me to it.) Instead of generating code like: var targetsDict = map[string]*Targets{
"": &Targets{
Plain: map[string]bool{
"s.lair.io": true,
"xserver.*.jp": true,
"youtube.co.ke": true,
},
},
"0": &Targets{
Plain: map[string]bool{
"bit.0.md": true,
},
},
"0-60": &Targets{
Plain: map[string]bool{
"0-60.in": true,
},
},
// ... Try generating: var targetsDict = map[string]*Targets{}
func init() {
targetsDict[""] = &Targets{
Plain: map[string]bool{
"s.lair.io": true,
"xserver.*.jp": true,
"youtube.co.ke": true,
},
}
targetsDict["0"] = &Targets{
Plain: map[string]bool{
"bit.0.md": true,
},
}
// ...
}
func init() {
targetsDict["0-60"] = &Targets{
Plain: map[string]bool{
"0-60.in": true,
},
}
// ...
} where each of the init functions contains (say) 100 or 1000 of the entries. Each of the functions should be called |
I wrote a quick parser and did some manual editing and hacked together a version of targets.go that puts every single targetsDict assignment in its own init function. (This was easier for a quick hack than grouping them into 100s or 1000s.) This reduced the compilation time to 33 seconds on my machine. (It used to take 22 minutes to finally panic in plive.) So this approach has some promise. |
I'd like to understand what isn't linear about the compilation, that makes breaking functions up worthwhile. I'd be much happier if we could find and fix that nonlinearity instead. |
@randall77 I'm already looking into it. :) |
CL https://golang.org/cl/39294 mentions this issue. |
This triggers 119 times during make.bash. This CL reduces the time it takes for the compiler to panic while compiling the code in #19751 from 22 minutes to 15 minutes. Yay, I guess. Updates #19751 Change-Id: I8ca7f1ae75f89d1eb2a361d67b3055a975221734 Reviewed-on: https://go-review.googlesource.com/39294 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
CL https://golang.org/cl/39351 mentions this issue. |
Instead of walking the list of nodes twice, once to find static entries to add to an array and once to find dynamic entries to generate code for, do the split once up front, into two slices. Then process each slice individually. This makes the code easier to read and more importantly, easier to modify. While we're here, add a TODO to avoid using temporaries for mapassign_fast calls. It's not an important TODO; the generated code would be basically identical. It would just avoid a minor amount of pointless SSA optimization work. Passes toolstash-check. No measureable compiler performance impact. Updates #19751 Change-Id: I84a8f2c22f9025c718ef34639059d7bd02a3c406 Reviewed-on: https://go-review.googlesource.com/39351 Run-TryBot: Josh Bleecher Snyder <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
OK, after the previous commits are in, the compiler panics much faster. Yay! The panic occurs because there are lots of basic blocks, and we try to bulk allocate a slice with length longer than an int32 can hold. Eep. I'll send a CL soon to at least turn that into a Fatal instead of a panic. The basic blocks in this case all come from branches from checking whether runtime.writeBarrier is set. I see lots of possible fixes, but the one I find most appealing is asserting (and ensuring!) that runtime.writeBarrier is off during (Other fixes including more gracefully handling lots of blocks in liveness or following my earlier suggestion of breaking up giant init functions.) @randall77 @cherrymui opinions about this? |
CL https://golang.org/cl/39354 mentions this issue. |
We can't assume write barriers are off during What part of the compiler is doing the bulk allocating? The actual block ID allocator will panic if we try to allocate an ID that's too big. Or is something else trying to preallocate and gets too aggressive? Sounds like we should put some more effort into optimizing phi insertion, at least. |
When a map is small, it's not worth putting the contents in an array and then looping over the array. Just generate code instead. This makes smaller binaries. It might also be better for cache lines. It also can avoids adding control flow in the middle of the init function, which can be very large. Eliminating this source of extra blocks makes phi insertion easier for temp-heavy init functions. This reduces the time required for compiler to panic while compiling the code in #19751 from 15 minutes to 45 seconds. The cutoff of 25 was chosen fairly unscientifically by looking at the size of cmd/go. Cutoff of 0: 10689604 Cutoff of 5: 10683572 Cutoff of 15: 10682324 Cutoff of 25: 10681700 Cutoff of 50: 10685476 Cutoff of 100: 10689412 There are probably more sophisticated mechanisms available. For example, the smaller the key/value sizes, the better generated code will be vs a table. Nevertheless this is simple and seems like a good start. Updates #19751 name old time/op new time/op delta Template 204ms ± 6% 202ms ± 5% -0.78% (p=0.027 n=47+45) Unicode 84.8ms ± 6% 85.2ms ± 7% ~ (p=0.146 n=46+45) GoTypes 551ms ± 2% 556ms ± 3% +0.76% (p=0.004 n=43+45) SSA 3.93s ± 3% 3.95s ± 4% ~ (p=0.179 n=50+49) Flate 123ms ± 4% 123ms ± 5% ~ (p=0.201 n=47+49) GoParser 145ms ± 3% 145ms ± 4% ~ (p=0.937 n=50+50) Reflect 356ms ± 3% 354ms ± 5% -0.44% (p=0.048 n=46+50) Tar 107ms ± 6% 106ms ± 6% ~ (p=0.188 n=50+49) XML 201ms ± 4% 200ms ± 4% ~ (p=0.085 n=50+49) name old user-ns/op new user-ns/op delta Template 252M ± 9% 250M ± 7% ~ (p=0.206 n=49+47) Unicode 106M ± 7% 106M ± 9% ~ (p=0.331 n=47+46) GoTypes 724M ± 5% 729M ± 5% ~ (p=0.160 n=47+49) SSA 5.64G ± 2% 5.62G ± 4% ~ (p=0.148 n=47+50) Flate 147M ± 6% 147M ± 5% ~ (p=0.466 n=50+49) GoParser 179M ± 5% 179M ± 6% ~ (p=0.584 n=50+49) Reflect 448M ± 6% 441M ± 8% -1.39% (p=0.027 n=50+49) Tar 124M ± 6% 123M ± 5% ~ (p=0.221 n=50+47) XML 244M ± 5% 243M ± 4% ~ (p=0.275 n=49+49) name old alloc/op new alloc/op delta Template 39.9MB ± 0% 39.4MB ± 0% -1.28% (p=0.008 n=5+5) Unicode 29.8MB ± 0% 29.8MB ± 0% ~ (p=0.310 n=5+5) GoTypes 113MB ± 0% 113MB ± 0% ~ (p=0.421 n=5+5) SSA 854MB ± 0% 854MB ± 0% ~ (p=0.151 n=5+5) Flate 25.3MB ± 0% 25.3MB ± 0% ~ (p=1.000 n=5+5) GoParser 31.8MB ± 0% 31.8MB ± 0% ~ (p=0.222 n=5+5) Reflect 78.2MB ± 0% 78.2MB ± 0% ~ (p=1.000 n=5+5) Tar 26.7MB ± 0% 26.7MB ± 0% ~ (p=0.841 n=5+5) XML 42.3MB ± 0% 42.3MB ± 0% -0.15% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Template 390k ± 1% 386k ± 1% -1.05% (p=0.016 n=5+5) Unicode 319k ± 0% 320k ± 0% ~ (p=0.310 n=5+5) GoTypes 1.14M ± 0% 1.14M ± 0% ~ (p=0.421 n=5+5) SSA 7.60M ± 0% 7.59M ± 0% ~ (p=0.310 n=5+5) Flate 234k ± 0% 235k ± 1% ~ (p=1.000 n=5+5) GoParser 315k ± 1% 317k ± 0% ~ (p=0.151 n=5+5) Reflect 978k ± 0% 978k ± 0% ~ (p=0.841 n=5+5) Tar 251k ± 1% 251k ± 1% ~ (p=0.690 n=5+5) XML 394k ± 0% 392k ± 0% ~ (p=0.056 n=5+5) Change-Id: Ic53a18627082abe075a1cbc33330ce015e50850a Reviewed-on: https://go-review.googlesource.com/39354 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
Could we do it during autogenerated inits? IIRC those all complete before user-provided inits run, or at least, they could probably be made to. But I guess maybe that's overkill for this problem.
It's the call to bvbulkalloc in newliveness. We're calling it with nbit=48645 and count=2074940, so nword=1521 and nword*count=3155983740, leading to the call I think it might take some significant refactoring to avoid this. It might be better and easier to try to split up large init functions, as proposed above. Note that there is some history of doing this manually: https://codereview.appspot.com/72590045. I'd say we could just keep asking users to split them up manually, since any init that big must be autogenerated, but splitting them up also has the benefit of allowing it to be compiler concurrently.
That's definitely true in general, but the CLs that I've submitted so far in mostly solve the phi problem, at least for this particular bit of code. |
CL https://golang.org/cl/39358 mentions this issue. |
This provides better diagnostics when it occurs. Updates #19751 Change-Id: I87db54c22e1345891b418c1741dc76ac5fb8ed00 Reviewed-on: https://go-review.googlesource.com/39358 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Another idea: have static initialization track and reuse autotmps, much as order.go does. Will investigate. |
Yes, the write barrier pass creates many blocks. I filed issue #19838 for grouping write barriers more aggresively. But it actually doesn't help here: there are calls (newobject, mapassign_faststr, makemap) between each write barrier. |
When I tried to compile this program with |
Challenge with this: We flip back and forth between sinit and walk, so it's hard to maintain state cleanly. Probably still worth doing, but it's going to require some significant refactoring. |
I hacked this in with some globals. There's lots of autotmp reuse, but (of course) the exact variable in use disappears by the time we reach liveness anyway, so it doesn't help with the liveness issue. And it doesn't help as much as you'd hope with the speed of getting to the panic. So for now, I think that is a dead end. |
On a second look, I think it would. Though there are calls between each writebarrier-enabled check, calls don't create a new SSA block. Basically all the blocks in this function come from write barriers, and the size of the problematic alloc is directly proportional to the number of blocks. That's not to say that if we fix this there wouldn't also be problems with the size of the generated stackmap. But it's possible those could be fruitfully addressed (for all functions, not just this one) by doing liveness compaction as we go, rather than all at the end. Sample SSA output for a random chunk from the middle of this function, at the point that it arrives at liveness:
|
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
CL https://golang.org/cl/41192 mentions this issue. |
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in golang#19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb
When using a concurrent backend, the overall compilation time is bounded in part by the slowest function to compile. The number of top-level statements in a function is an easily calculated and fairly reliable proxy for compilation time. Here's a standard compilecmp output for -c=8 with this CL: name old time/op new time/op delta Template 127ms ± 4% 125ms ± 6% -1.33% (p=0.000 n=47+50) Unicode 84.8ms ± 4% 84.5ms ± 4% ~ (p=0.217 n=49+49) GoTypes 289ms ± 3% 287ms ± 3% -0.78% (p=0.002 n=48+50) Compiler 1.36s ± 3% 1.34s ± 2% -1.29% (p=0.000 n=49+47) SSA 2.95s ± 3% 2.77s ± 4% -6.23% (p=0.000 n=50+49) Flate 70.7ms ± 3% 70.9ms ± 2% ~ (p=0.112 n=50+49) GoParser 85.0ms ± 3% 83.0ms ± 4% -2.31% (p=0.000 n=48+49) Reflect 229ms ± 3% 225ms ± 4% -1.83% (p=0.000 n=49+49) Tar 70.2ms ± 3% 69.4ms ± 3% -1.17% (p=0.000 n=49+49) XML 115ms ± 7% 114ms ± 6% ~ (p=0.158 n=49+47) name old user-time/op new user-time/op delta Template 352ms ± 5% 342ms ± 8% -2.74% (p=0.000 n=49+50) Unicode 117ms ± 5% 118ms ± 4% +0.88% (p=0.005 n=46+48) GoTypes 986ms ± 3% 980ms ± 4% ~ (p=0.110 n=46+48) Compiler 4.39s ± 2% 4.43s ± 4% +0.97% (p=0.002 n=50+50) SSA 12.0s ± 2% 13.3s ± 3% +11.33% (p=0.000 n=49+49) Flate 222ms ± 5% 219ms ± 6% -1.56% (p=0.002 n=50+50) GoParser 271ms ± 5% 268ms ± 4% -0.83% (p=0.036 n=49+48) Reflect 560ms ± 4% 571ms ± 3% +1.90% (p=0.000 n=50+49) Tar 183ms ± 3% 183ms ± 3% ~ (p=0.903 n=45+50) XML 364ms ±13% 391ms ± 4% +7.16% (p=0.000 n=50+40) A more interesting way of viewing the data is by looking at the ratio of the time taken to compile the slowest-to-compile function to the overall time spent compiling functions. If this ratio is small (near 0), then increased concurrency might help. If this ratio is big (near 1), then we're bounded by that single function. I instrumented the compiler to emit this ratio per-package, ran 'go build -a -gcflags=-c=C -p=P std cmd' three times, for varying values of C and P, and collected the ratios encountered into an ASCII histogram. Here's c=1 p=1, which is a non-concurrent backend, single process at a time: 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 The x-axis is floor(10*ratio), so the first column indicates the percent of ratios that fell in the 0% to 9.9999% range. We can see in this histogram that more concurrency will help; in most cases, the ratio is small. Here's c=8 p=1, before this CL: 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%|* * * 0%|********** ----+---------- |0123456789 In 30-40% of cases, we're mostly bound by the compilation time of a single function. Here's c=8 p=1, after this CL: 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 The sorting pays off; we are bound by the compilation time of a single function in over half of packages. The single * in the histogram indicates 0-10%. The actual values for this chart are: 0: 5%, 1: 1%, 2: 1%, 3: 4%, 4: 5%, 5: 7%, 6: 7%, 7: 7%, 8: 9%, 9: 55% This indicates that efforts to increase or enable more concurrency, e.g. by optimizing mutexes or increasing the value of c, will probably not yield fruit. That matches what compilecmp tells us. Further optimization efforts should thus focus instead on one of: (1) making more functions compile concurrently (2) improving the compilation time of the slowest functions (3) speeding up the remaining serial parts of the compiler (4) automatically splitting up some large autogenerated functions into small ones, as discussed in #19751 I hope to spend more time on (1) before the freeze. Adding process parallelism doesn't change the story much. For example, here's c=8 p=8, after this CL: 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| *** 0%|********** ----+---------- |0123456789 Since we don't need to worry much about p, these histograms can help us select a good general value of c to use as a default, assuming we're not bounded by GOMAXPROCS. Here are some charts after this CL, for c from 1 to 8: c=1 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%|** 10%|*** 0%|********* ----+---------- |0123456789 c=2 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| 10%| **** * 0%|********** ----+---------- |0123456789 c=3 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| 20%| * 10%| ** * * 0%|********** ----+---------- |0123456789 c=4 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=5 p=1 90%| 80%| 70%| 60%| 50%| 40%| 30%| * 20%| * 10%| * * 0%|********** ----+---------- |0123456789 c=6 p=1 90%| 80%| 70%| 60%| 50%| 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 c=7 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| ** 0%|********** ----+---------- |0123456789 c=8 p=1 90%| 80%| 70%| 60%| 50%| * 40%| * 30%| * 20%| * 10%| * 0%|********** ----+---------- |0123456789 Given the increased user-CPU costs as c increases, it looks like c=4 is probably the sweet spot, at least for now. Pleasingly, this matches (and explains) the results of the standard benchmarking that I have done. Updates #15756 Change-Id: I82b606c06efd34a5dbd1afdbcf66a605905b2aeb Reviewed-on: https://go-review.googlesource.com/41192 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Robert Griesemer <[email protected]> Reviewed-by: Matthew Dempsky <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
A possible alternative to splitting these init functions up (which I have tried and failed at multiple times) is to try to prevent generating some of these problematic write barrier calls by avoiding the mapmake calls by emitting some static empty map buckets to fill in. |
CL https://golang.org/cl/42178 mentions this issue. |
DO NOT SUBMIT [I am not sure the best way to identify whether a particular OMAPLIT is safe for a static hmap; see the TODO in the CL] Prior to this CL, we did code generation for a map literal like this: m := make(maptype, number of elements) m[k] = v // for all elements, sometimes using a loop The assignment to m generates a write barrier. Each write barrier generates a new SSA block. In giant composite literals, this can lead to lots of blocks. This led to a panic when compiling the code in golang#19751. This CL changes code generation when compiling a static composite literal to do: var stat hmap // static symbol hp := &stat m := (map)(hp) initmap(hp, number of elements) m[k] = v This involves no assignments, and thus no write barriers. As a result, the code in golang#19751 compiles. Slowly still, but the compilation completes. Change-Id: Ic0c05f5892e50fadc5e6d6adb5a3f5275ec38224
Looks like CL 42178 is unlikely to make 1.9. Moving to 1.10. |
What version of Go are you using (
go version
)?go version go1.8 darwin/amd64
What operating system and processor architecture are you using (
go env
)?What did you do?
Used
text/template
to generate a very large go file to embed data, and then just rango build
What did you expect to see?
A successfully built binary
What did you see instead?
The text was updated successfully, but these errors were encountered: