-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: defer is slow #14939
Comments
One way to improve defer for such simple cases is to |
Some prior discussion at #6980. |
I also had a CL recently where a single defer would have been the preferred solution but was not usable because of a 65ns performance hit. CL20512 So improving simple cases like a single defer in some branch (might even call no other methods or only select few) would have helped there.
|
Given the current cost of defers, I think it's acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don't think the runtime side of this is very complicated. This is on the list for 1.8. I'm planning to either do it myself or get someone else to do it. :) If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue. |
Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn't have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn't need to call systemstack. |
Apparently defer is not free: golang/go#14939
@bmizerany sent https://golang.org/cl/29379 to replace a defer with explicit mutex unlocks in x/time/rate. Consider reverting that optimization if this more general optimization goes in. |
/cc @dr2chase, who I've been talking with about compiler changes to support this. |
@dr2chase and I have been discussing an alternate approach to this that's somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies. The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic. Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did). To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset. When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them. |
This is related to golang/go#14939 . It's probably the only occurrence where it matters.
Will it handle this case?
defer f0()
if f1() {
defer f2()
}
f3() // panics here
At f3, the set of deferred functions to execute can not be statically
determined. C++ exception handling doesn't need to handle this case,
because object destruction is also block scoped.
|
I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed? |
A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame. You could even choose to always start the list from the same place in the stack frame, just below the return address/frame pointer, so you would only need a single bit in the traceback information. Then each time you defer a function, you update the linked list to point to the new stack allocated closure. At the end of the function, you remove each closure from the list as you execute it. |
@minux - no. If there's any point in the generated code (which is not the same as the input program) that two different defer sets can reach, then the PC-range technique won't work. Unless, say, we record which defers need running by storing into a bitmask of defers-to-(not-)run; this would also handles Ian's problem. Store a zero byte on entry, store a different value on the branches to the common area that don't enable all the defers. PC range has to mention the location of the which-defers-not-to-run byte, of course. |
That's a good question. If you're simply in the function epilogue executing local defers and one of them panics, then the PC in the epilogue tells you what's left. However, if you're running defers because of a panic and one of them panics, then you're right that things get more complicated. One possibility would be that when a panic happens, before running any defers, you walk the stack to find any frames that have PC value-based defers and weave those into the linked list of defers at the right points. Then panic just works from the linked list. If a second panic happens while walking this linked list, we know that we're already handling a panic, so it wouldn't rebuild the list, it would just keep going from the current list. |
Possibly. I was hoping to avoid the overhead of even dealing with a linked list in the regular return case. If I understand your proposal, it seems like the cost at function exit wouldn't be substantially different from the cost of |
I am imagining that given
the defer statements would be compiled as
and the function exit sequence would be compiled as
So I don't think the performance would be like that of |
I see. I thought you were saying the function return would use the linked list (rather than just unwinding it for the benefit of a panic). I imagine that would perform well. |
Change https://golang.org/cl/190098 mentions this issue: |
…ode and extra funcdata Generate inline code at defer time to save the args of defer calls to unique (autotmp) stack slots, and generate inline code at exit time to check which defer calls were made and make the associated function/method/interface calls. We remember that a particular defer statement was reached by storing in the deferBits variable (always stored on the stack). At exit time, we check the bits of the deferBits variable to determine which defer function calls to make (in reverse order). These low-cost defers are only used for functions where no defers appear in loops. In addition, we don't do these low-cost defers if there are too many defer statements or too many exits in a function (to limit code increase). When a function uses open-coded defers, we produce extra FUNCDATA_OpenCodedDeferInfo information that specifies the number of defers, and for each defer, the stack slots where the closure and associated args have been stored. The funcdata also includes the location of the deferBits variable. Therefore, for panics, we can use this funcdata to determine exactly which defers are active, and call the appropriate functions/methods/closures with the correct arguments for each active defer. In order to unwind the stack correctly after a recover(), we need to add an extra code segment to functions with open-coded defers that simply calls deferreturn() and returns. This segment is not reachable by the normal function, but is returned to by the runtime during recovery. We set the liveness information of this deferreturn() to be the same as the liveness at the first function call during the last defer exit code (so all return values and all stack slots needed by the defer calls will be live). I needed to increase the stackguard constant from 880 to 896, because of a small amount of new code in deferreturn(). The -N flag disables open-coded defers. '-d defer' prints out the kind of defer being used at each defer statement (heap-allocated, stack-allocated, or open-coded). Cost of defer statement [ go test -run NONE -bench BenchmarkDefer$ runtime ] With normal (stack-allocated) defers only: 35.4 ns/op With open-coded defers: 5.6 ns/op Cost of function call alone (remove defer keyword): 4.4 ns/op Text size increase (including funcdata) for go cmd without/with open-coded defers: 0.09% The average size increase (including funcdata) for only the functions that use open-coded defers is 1.1%. The cost of a panic followed by a recover got noticeably slower, since panic processing now requires a scan of the stack for open-coded defer frames. This scan is required, even if no frames are using open-coded defers: Cost of panic and recover [ go test -run NONE -bench BenchmarkPanicRecover runtime ] Without open-coded defers: 62.0 ns/op With open-coded defers: 255 ns/op A CGO Go-to-C-to-Go benchmark got noticeably faster because of open-coded defers: CGO Go-to-C-to-Go benchmark [cd misc/cgo/test; go test -run NONE -bench BenchmarkCGoCallback ] Without open-coded defers: 443 ns/op With open-coded defers: 347 ns/op Updates #14939 (defer performance) Updates #34481 (design doc) Change-Id: I51a389860b9676cfa1b84722f5fb84d3c4ee9e28 Reviewed-on: https://go-review.googlesource.com/c/go/+/190098 Reviewed-by: Austin Clements <[email protected]>
@rhysh Go 1.1.3 improved defer performance already by allocating defer records on the stack. And we have just checked into the main tree (for release in Go 1.14) a bigger change https://golang.org/cl/190098 to make defer calls directly (inline) at normal exits. This should reduce overhead even more significantly in many cases. So, if you are still seeing defer overheads for your server, it will be great to see if the overheads have gone down with Go 1.13 (if you haven't already upgraded) or with the changes in the main tree (or with the beta release of Go 1.14 in early November). |
Change https://golang.org/cl/202340 mentions this issue: |
…ode and extra funcdata Generate inline code at defer time to save the args of defer calls to unique (autotmp) stack slots, and generate inline code at exit time to check which defer calls were made and make the associated function/method/interface calls. We remember that a particular defer statement was reached by storing in the deferBits variable (always stored on the stack). At exit time, we check the bits of the deferBits variable to determine which defer function calls to make (in reverse order). These low-cost defers are only used for functions where no defers appear in loops. In addition, we don't do these low-cost defers if there are too many defer statements or too many exits in a function (to limit code increase). When a function uses open-coded defers, we produce extra FUNCDATA_OpenCodedDeferInfo information that specifies the number of defers, and for each defer, the stack slots where the closure and associated args have been stored. The funcdata also includes the location of the deferBits variable. Therefore, for panics, we can use this funcdata to determine exactly which defers are active, and call the appropriate functions/methods/closures with the correct arguments for each active defer. In order to unwind the stack correctly after a recover(), we need to add an extra code segment to functions with open-coded defers that simply calls deferreturn() and returns. This segment is not reachable by the normal function, but is returned to by the runtime during recovery. We set the liveness information of this deferreturn() to be the same as the liveness at the first function call during the last defer exit code (so all return values and all stack slots needed by the defer calls will be live). I needed to increase the stackguard constant from 880 to 896, because of a small amount of new code in deferreturn(). The -N flag disables open-coded defers. '-d defer' prints out the kind of defer being used at each defer statement (heap-allocated, stack-allocated, or open-coded). Cost of defer statement [ go test -run NONE -bench BenchmarkDefer$ runtime ] With normal (stack-allocated) defers only: 35.4 ns/op With open-coded defers: 5.6 ns/op Cost of function call alone (remove defer keyword): 4.4 ns/op Text size increase (including funcdata) for go binary without/with open-coded defers: 0.09% The average size increase (including funcdata) for only the functions that use open-coded defers is 1.1%. The cost of a panic followed by a recover got noticeably slower, since panic processing now requires a scan of the stack for open-coded defer frames. This scan is required, even if no frames are using open-coded defers: Cost of panic and recover [ go test -run NONE -bench BenchmarkPanicRecover runtime ] Without open-coded defers: 62.0 ns/op With open-coded defers: 255 ns/op A CGO Go-to-C-to-Go benchmark got noticeably faster because of open-coded defers: CGO Go-to-C-to-Go benchmark [cd misc/cgo/test; go test -run NONE -bench BenchmarkCGoCallback ] Without open-coded defers: 443 ns/op With open-coded defers: 347 ns/op Updates #14939 (defer performance) Updates #34481 (design doc) Change-Id: I63b1a60d1ebf28126f55ee9fd7ecffe9cb23d1ff Reviewed-on: https://go-review.googlesource.com/c/go/+/202340 Reviewed-by: Austin Clements <[email protected]>
@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of For go1.12.12, go1.13.3, and be64a19, I counted profile samples that have Zooming out to the application's total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer "in" a loop in Thank you! |
Change https://golang.org/cl/203481 mentions this issue: |
How’s that defer “in a loop”? It does look like it from a source code point of view, but in SSA it is in a block which is no more “part of the loop” than the first block after loop in source code order. It sounds like the loop detection made by the defer optimization is slightly off. |
@rasky the loop detection occurs during escape analysis, not SSA. |
Rhys Hiltner noted in #14939 that this defer was syntactically inside a loop, but was only ever executed once. Now that defer in a loop is significantly slower, pull this one out. name old time/op new time/op delta Throughput/MaxPacket/1MB/TLSv12-8 3.94ms ± 8% 3.93ms ±13% ~ (p=0.967 n=15+15) Throughput/MaxPacket/1MB/TLSv13-8 4.33ms ± 3% 4.51ms ± 7% +4.00% (p=0.000 n=14+14) Throughput/MaxPacket/2MB/TLSv12-8 6.80ms ± 6% 7.01ms ± 4% +3.15% (p=0.000 n=14+14) Throughput/MaxPacket/2MB/TLSv13-8 6.96ms ± 5% 6.80ms ± 5% -2.43% (p=0.006 n=15+14) Throughput/MaxPacket/4MB/TLSv12-8 12.0ms ± 3% 11.7ms ± 2% -2.88% (p=0.000 n=15+13) Throughput/MaxPacket/4MB/TLSv13-8 12.1ms ± 3% 11.7ms ± 2% -3.54% (p=0.000 n=13+13) Throughput/MaxPacket/8MB/TLSv12-8 22.2ms ± 3% 21.6ms ± 3% -2.97% (p=0.000 n=15+15) Throughput/MaxPacket/8MB/TLSv13-8 22.5ms ± 5% 22.0ms ± 3% -2.34% (p=0.004 n=15+15) Throughput/MaxPacket/16MB/TLSv12-8 42.4ms ± 3% 41.3ms ± 3% -2.49% (p=0.001 n=15+15) Throughput/MaxPacket/16MB/TLSv13-8 43.4ms ± 5% 42.3ms ± 3% -2.33% (p=0.006 n=15+14) Throughput/MaxPacket/32MB/TLSv12-8 83.1ms ± 4% 80.6ms ± 3% -2.98% (p=0.000 n=15+15) Throughput/MaxPacket/32MB/TLSv13-8 85.2ms ± 8% 82.6ms ± 4% -3.02% (p=0.005 n=15+15) Throughput/MaxPacket/64MB/TLSv12-8 167ms ± 7% 158ms ± 2% -5.21% (p=0.000 n=15+15) Throughput/MaxPacket/64MB/TLSv13-8 170ms ± 4% 162ms ± 3% -4.83% (p=0.000 n=15+15) Throughput/DynamicPacket/1MB/TLSv12-8 4.13ms ± 7% 4.00ms ± 8% ~ (p=0.061 n=15+15) Throughput/DynamicPacket/1MB/TLSv13-8 4.72ms ± 6% 4.64ms ± 7% ~ (p=0.377 n=14+15) Throughput/DynamicPacket/2MB/TLSv12-8 7.29ms ± 7% 7.09ms ± 7% ~ (p=0.070 n=15+14) Throughput/DynamicPacket/2MB/TLSv13-8 7.18ms ± 5% 6.59ms ± 4% -8.34% (p=0.000 n=15+15) Throughput/DynamicPacket/4MB/TLSv12-8 12.3ms ± 3% 11.9ms ± 4% -3.31% (p=0.000 n=15+14) Throughput/DynamicPacket/4MB/TLSv13-8 12.2ms ± 4% 12.0ms ± 4% -1.91% (p=0.019 n=15+15) Throughput/DynamicPacket/8MB/TLSv12-8 22.4ms ± 3% 21.9ms ± 3% -2.18% (p=0.000 n=15+15) Throughput/DynamicPacket/8MB/TLSv13-8 22.7ms ± 3% 22.2ms ± 3% -2.35% (p=0.000 n=15+15) Throughput/DynamicPacket/16MB/TLSv12-8 42.3ms ± 3% 42.1ms ± 3% ~ (p=0.505 n=14+15) Throughput/DynamicPacket/16MB/TLSv13-8 42.7ms ± 3% 43.3ms ± 7% ~ (p=0.123 n=15+14) Throughput/DynamicPacket/32MB/TLSv12-8 82.8ms ± 3% 81.9ms ± 3% ~ (p=0.112 n=14+15) Throughput/DynamicPacket/32MB/TLSv13-8 84.6ms ± 6% 83.9ms ± 4% ~ (p=0.624 n=15+15) Throughput/DynamicPacket/64MB/TLSv12-8 166ms ± 4% 163ms ± 6% ~ (p=0.081 n=15+15) Throughput/DynamicPacket/64MB/TLSv13-8 165ms ± 3% 168ms ± 3% +1.56% (p=0.029 n=15+15) Change-Id: I22409b05afe761b8ed1912b15c67fc03f88d3d1f Reviewed-on: https://go-review.googlesource.com/c/go/+/203481 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Can we close this issue? It doesn't seem there's much left. Defers in a loop are always going to be somewhat slower. |
@rhysh Glad to hear the results that you have measured! And, as a bunch of folks have already pointed out, there are a bunch of further optimizations that we can do to tighten up the inline defer code, eliminate some of the checks, etc. |
@danscales, thanks a lot for fixing defer slowness! |
I think this is done, so I am going to close it. We can always make things faster. If there are specific ideas for making defers faster, let's do them in separate issues. Thanks to all especially @danscales . |
There is a known issue in Go with deferred operations being much slower than explicit operations, both directly and indirectly impacting locking. Generally, on OS X and Windows, the change gives between 10-15% performance gain. Relates: golang/go#14939
On my system, BenchmarkDefer uses 77.7ns/op. This issue
arises from investigation of #9704: if I remove the "defer endcgo(mp)"
and place the call at the end of the func cgocall, the benchmark in
#9704 will improve from 144ns/op to 63.7ns/op. (Note: we can't
eliminate the defer in func cgocall though, as it will break defer/recover
in Go->C->Go scenario.)
The text was updated successfully, but these errors were encountered: