-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: markroot over finalizers is stop-the-world #11485
Comments
See issue #11484 for most of the other half of mark termination time. |
@RLH and I talked about marking objects reachable from finalizers during concurrent scan rather than mark termination today. We concluded that it probably should Just Work modulo one quirk: if SetFinalizer is called during a GC cycle, it must itself call scanobject on the object. Otherwise, it's possible to have objects A -> B where A is reachable at the beginning of GC, but does not have a finalizer at the time of concurrent scan; then during concurrent mark, the mutator sets a finalizer on A, then makes A unreachable before it gets marked. If SetFinalizer doesn't call scanobject, B won't be marked and may be collected before A's finalizer runs. We might also need more locking. There are comments in addspecial/removespecial claiming that it's okay for the GC to access the specials list without locks, but they don't explain why and I'm not sure I believe them. |
I believe the primary expense in finalizers is treating them as roots, right? For the *os.File finalizer, what we care about is getting at the 'fd int' field, not any pointers the *os.File contains. One option is to have a runtime.unsafeSetFinalizer that arranges for the finalizer to be called as usual but without treating the object as a root. The finalizer must be careful not to access pointers contained in the object. We could use that for *os.File but not expose it otherwise. A safer option, which I like better, might be to introduce an optimization in SetFinalizer that if the object contains no pointers, it doesn't count as a root during GC (since it points at nothing). Then we can put the 'fd int' in its own finalized mini-object that satisfies this property. That would eliminate basically all the finalizer roots in network heavy programs with tons of open network connections, which should kill off the finalizer contribution to stop-the-world in those programs. Would that help enough? The sweep of a span containing finalized objects would still need to arrange to call the finalizers, but that runs concurrently so I assume it is not as critical to the speed. |
The current cost of finalizers is O(# of finalizers + # of spans). The constant on the (# of spans) term is about 1ms/heap GB. I haven't done the profiling to determine the constant on the (# of finalizers) term, but it does mean that even if we eliminate the marking for some finalizers, we still have the (# of spans) term. If the constant on O(# of finalizers) turns out to be high, then I think your suggestion may be a good direction. But I suspect it is low and just moving it to concurrent mark should be enough, since the cost of the whole mark phase is much higher than 1ms/heap GB. Of course, improving on the (# of spans) term is good, too. Currently, we just loop over all of the spans to find spans with finalizers. We could keep a better data structure instead. Note that the finalizer on net.netFD objects is more complex than on os.File objects. However, it looks like everything the finalizer actually touches is still pointer-free (except on nacl!). |
CL https://golang.org/cl/13112 mentions this issue. |
Marking of span roots can represent a significant fraction of the time spent in mark termination. Simply traversing the span list takes about 1ms per GB of heap and if there are a large number of finalizers (for example, for network connections), it may take much longer. Improve the situation by splitting the span scan into 128 subtasks that can be executed in parallel and load balanced by the markroots parallel for. This lets the GC balance this job across the Ps. A better solution is to do this during concurrent mark, or to improve it algorithmically, but this is a simple change with a lot of bang for the buck. This was suggested by Rhys Hiltner. Updates #11485. Change-Id: I8b281adf0ba827064e154a1b6cc32d4d8031c03c Reviewed-on: https://go-review.googlesource.com/13112 Reviewed-by: Keith Randall <[email protected]>
It's certainly not safe for GC to traverse the specials list during removefinalizer/removespecial. Memory for finalizers is manually managed, so at best the GC could try to read from freed memory while traversing the list. This isn't a problem today because everywhere that traverses this list is either STW or holds span.speciallock. We could traverse this list during concurrent GC without a large cost if we peeked at the list head without holding span.speciallock and only if it's non-nil take the lock and walk the list. Traversing the list during addfinalizer/addspecial without locking could be safe, but isn't in theory under the Go memory model and may not be in practice on non-TSO machines because there's no publication fence between filling in the special and splicing it in to the list. |
I think I understand the synchronization (and lack of synchronization) involved here. I'm going to try moving this in to the concurrent phase. We can separately optimize it by reducing the loop to O(# of finalizers) or even O(# of non-trivial finalizers), but at the end of the day, we could still have a program with a legitimately large number of finalizers and I don't want STW to be proportional to that. Plus, it won't be a big time sink compared to other things in the concurrent phase, so this will shift the focus to other things that affect overall throughput more. |
CL https://golang.org/cl/14982 mentions this issue. |
On multi-gigabyte heaps, about half of our mark termination time (~5ms of ~10ms at 4GB) is spent in the loop over all spans in the _RootSpans case of markroot. We should fix this for 1.6.
@RLH, @dr2chase, and I discussed a few possible fixes. One is to maintain a global list of finalizers in addition to the per-span lists so that this loop is proportional to the number of finalizers rather than the number of spans. This could still be slow in a hypothetical program with many finalizers (maybe we don't care) and might complicate synchronization around the specials and finalizers lists.
Another possibility is to scan finalizers during the concurrent scan phase and perform the appropriate write barriers for finalizers added during GC and then simply eliminate this re-scan from mark termination. It's entirely possible we already have all or most of the write barriers we need for this. There is one quirk to this solution: currently we only scan the object with the finalizer during mark termination. However, it's unclear why we do this and it's probably a bad idea since it's means we're delaying mark work we could have done concurrently.
The text was updated successfully, but these errors were encountered: