-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-108362: Incremental GC implementation #108038
Conversation
5e5833c
to
97f762a
Compare
Numbers for a recent commit, which is not tuned for performance, beyond using incremental collection. Speedup: 3-4%
Shortest pauses go up, but no one cares about those. Throughput is a few percent better, but that can also be achieved by increasing thresholds or only using one generation. The above numbers are from the pyperformance suite. |
@pablogsal @nascheme want to take a look? |
I can take a look this Thursday 👍 |
Hey Mark, sorry for the lack of review but unfortunately I had an accident last week where I broke my finger and required some surgery. Currently, I am recovering from the surgery and the injury. I will try to review it ASAP, but it may take a bit more time. Apologies for the delay |
Take care. |
My impression is this is a good idea. The long pause you can get from the current full collections could be quite undesirable, depending on your app. Regarding the statement that "It is guaranteed to collect all cycles eventually", I have some concern about what the worst case might be. E.g. if it collects eventually but takes 1 million runs of the GC to do it, that's not so great. This property sounds similar to what you get with the "Train Algorithm" for M&S style collection. I suppose we don't want to provide an option to switch between the current "collect everything" and incremental approaches. We could probably turn on the incremental by default and then let people turn it off if they run into trouble. I guess the other solution would be to downgrade to an older Python version. |
All garbage cycles present in the old generation will be collected in a single traversal of the old generation.
Obviously how many incremental collections it takes to traverse the whole old generation depends on how big the old generation is, and how big the increments are. |
If there is a garbage cycle with more than |
Choosing an increment is done depth first, so if part of a cycle is in an increment, all of it must be. |
Oh, I see. In that case, if the collector encounters an object with many references, many more objects could be included in the collection. E.g. if you encounter |
In the worse case that all the objects form a giant cycle, there is no way to avoid visiting all objects. We can get long pauses if a large number of objects are reachable from a single object that isn't part of a cycle. Because we track the number of objects seen, if we end up doing a large collection then we take a larger pause until the next one, so we do no more work. It is just done in bigger chunks. Possible mitigations (for another PR)At the start of the full cycle (after swapping the pending and scanning spaces) we could do a marking stage to mark live objects. Scanning the roots on the stack probably isn't a good idea as many of those could soon become garbage, but scanning |
I want to try one more thing (which is to simulate a large app with lots of modules, classes, functions...) just to see how that interacts with |
Modules/gcmodule.c
Outdated
uintptr_t aging = cf->aging_space; | ||
if (_PyObject_IS_GC(op)) { | ||
PyGC_Head *gc = AS_GC(op); | ||
if (_PyObject_GC_IS_TRACKED(op) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think including objects from the other space can lead to some problematic behavior. If you have a large object graph which is referenced from smaller more frequently allocated objects this will continuously pull that large object graph in and blow the budget. This is basically what I was concerned with regarding sys.modules
- you can have a module which imports sys, you can be creating lambda's in that module which are short lived but become cyclical trash, and to collect the lambda you need to traverse the world.
I also think given this behavior the need for two different lists of objects isn't really necessary, you could instead just move the objects to the end of the collecting space and we'd get back to them when we can, and I think the behavior would be identical to the existing algorithm (except maybe we'd pick up some extra objects when we'd flip the spaces).
I think another potential problem with this behavior is that you're not eagerly adding objects in the current space to this list transitively. That means if we visit a large object graph and blow the budget then we may not get to other transitively referenced objects that are in the space we're collecting from added to the container, and therefore despite collecting a huge object graph we still won't have collected enough to clear the cycle.
Below's a program that seems to grow unbounded, I had briefly experimented with using the _PyGC_PREV_MASK_COLLECTING
flag here to mark objects we want to include instead of using the aging space, but that also didn't work (I would have expected on some collections we get a bunch of these little cycles, and then after a flip we need to walk the large object graph once), so I'm not certain what exactly needs to be done to fix this.
class LinkedList:
def __init__(self, next=None, prev=None):
self.next = next
if next is not None:
next.prev = self
self.prev = prev
if prev is not None:
prev.next = self
def make_ll(depth):
head = LinkedList()
for i in range(depth):
head = LinkedList(head, head.prev)
return head
head = make_ll(10000)
while True:
newhead = make_ll(200)
newhead.surprise = head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any cycle present in the old generation at the start of the full scan (when we flip spaces) will be collected during that scan (before we flip spaces again) See #108038 (comment)
There will always be object graphs that perform badly, but all cycles will be collected eventually (assume the program runs for long enough).
This program looks like most cycles created will be handled in the young collection, so I don't see a problem there, but if we increase depth
so that the cycles will outlive the young collection, then it might take a while to collect the cycles, and the first increment will likely have to visit all the nodes reachable from the global variable head
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I wasn't clear enough (and maybe you didn't run it) because this program actually runs out of memory :) You're right that most cycles should be collected in young, but I'm guessing one survives every young collection, and those build up and are uncollectible. I think we probably only successfully collect 1 of the non-long lived objects every collection because we repeatedly re-process the long-lived linked list.
If I modify the program to hold onto 100 of these at a time before clearing them out it runs out of memory even faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing the memory use grow slowly, although seemingly without bounds. So something is wrong.
There are two spaces in the old generation. aging
and oldest
.
After a young or incremental collection we add survivors to the end of aging
.
We only collect cycles within the oldest
space. After a flip, all objects will be in the oldest
space, so if there are any cycles they will be collected, not moved back to the aging space.
I modified your program to include gc.set_threshold(2000, 2, 0)
which makes the incremental collector process objects five times as fast, in which case the memory appears to stays bounded.
I was hoping to merge this and then play with thresholds, but it looks like we will need some sort of adaptive threshold before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only collect cycles in the oldest
space but the reason I placed this comment here is that you do gather the transitive closure from the aging
space. Therefore think the statement "We only collect cycles within the oldest space." is incorrect given this code - once you've included a single object from aging
you will consider its transitive closure as well.
But including these objects seems like it should be unnecessary though, once you flip, you'll re-consider those objects and their transitive closure.
And as I said before I think this basically eliminates any usefulness of the two spaces... you may as well just be moving the objects to the end of the oldest
space if you're willing to suck them into the transitive closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think we collect objects from the aging space?
That is not the intention, and I don't see where that happens.
With this code: class LinkedList:
def __init__(self, next=None, prev=None):
self.next = next
if next is not None:
next.prev = self
self.prev = prev
if prev is not None:
prev.next = self
def make_ll(depth):
head = LinkedList()
for i in range(depth):
head = LinkedList(head, head.prev)
return head
import gc
#gc.set_threshold(2000, 2, 0)
M = 10_000
N = 5_000
head = make_ll(M)
count = M
next_count = 1_000_000
while True:
newhead = make_ll(N)
newhead.surprise = head
count += N
if count >= next_count:
print(f"Count = {int(count/1_000_000)}M")
print(gc.get_stats()[:2])
next_count += 1_000_000 I've upped the size of the the lists, so that they aren't collected by the young collection. |
FWIW this variation grows unbounded even with a greatly increased threshold (although slowly, but I killed it after getting to 10g), but maybe there's some amount of auto-tuning where it would keep up. On a short run it also seems to be spending ~25% time in
|
The first threshold just determines how often a collection is done. It shouldn't really impact whether the collector can keep up. Since this program is doing nothing but producing cyclic garbage, I'm not surprised that it spends a lot of time in GC. Is it worse than the current GC? |
I am seeing much the same behavior on 3.11 and main in terms of the count of objects being collected. |
Ahh I hadn't compared the CPU time and indeed the baseline GC is spending as much time in GC, so never mind on the time spent :) |
I haven't been looking at the collection statistics but rather memory usage. On the most recent program I see |
With your latest example, the stats show the leak as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
|
|
|
|
See issue gh-115127: multiprocessing test_thread_safety() fails with "gc_list_is_empty(to) || gc_old_space(to_tail) == gc_old_space(from_tail)" assert error. |
…)" This reverts commit 36518e6.
…on (pythonGH-108038)" (python#115132) Revert "pythonGH-108362: Incremental GC implementation (pythonGH-108038)" This reverts commit 36518e6.
Implements incremental cyclic GC.
Instead of traversing one generation on each collection, we traverse the young generation and the oldest part of the old generation. By traversing the old generation a chunk at a time, we keep pause times down a lot.
See faster-cpython/ideas#613 for the idea and algorithm.