-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory arena contention #8714
base: main
Are you sure you want to change the base?
Conversation
I wonder what happens if you plot the result of your benchmark as a function of thread count. See e.g. this NumPy issue which reported a similar scaling issue and running a benchmark as a function of thread count was a very useful way to identify the scaling issue and that it was fixed by using locking that scales better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! The approach looks good to me, though I've left some comments on some specifics.
cb4b753
to
b2a97bb
Compare
@lysnikolaou — Thanks for the review! I've applied your suggestions. Please let me know if you see anything else. |
Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it. This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread. When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena. Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us. As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).
for more information, see https://pre-commit.ci
6479c3b
to
e2a96a5
Compare
@kddnewton Could you please not rebase/force-push after the first round of reviews? It makes follow-up reviews a bit harder. |
@lysnikolaou ahh sorry! I saw it was out of date so just wanted to keep it synced. I'll avoid that going forward. |
@ngoldbaum here's the chart. Sorry I didn't end up getting it working with mathplotlib, so it's just manually putting numbers into google sheets. But the numbers come straight from the benchmarking script linked in the description. You can see from the graph it gets really bad when you start to have a lot of threads on |
There's a segfault on Ubuntu that might need some attention here. |
Yes and it appears that the other one is hanging, I will take a quick look. It was passing before, so I'm imagining this is related to my changes around looping through the arenas. |
This is a follow-up to #8692, based on @wiredfool's feedback.
Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it.
This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread.
When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena.
Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us.
As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).
Here is the benchmarking script that I used:
test.py
Results
3.13.0 on main
3.13.0 on branch
3.13.0t on main
3.13.0t on branch