Reduce memory arena contention #8714

kddnewton · 2025-01-24T19:46:43Z

This is a follow-up to #8692, based on @wiredfool's feedback.

Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it.

This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread.

When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena.

Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us.

As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).

Here is the benchmarking script that I used:

test.py

import concurrent.futures
import os
import threading
import time

from PIL import Image

num_threads = 16
num_images = 1024


def operation():
    images = []
    for i in range(num_images):
        img = Image.new(
            "RGB", (100, 100), color=(i % 256, (i // 256) % 256, (i // 65536) % 256)
        )
        images.append(img)

    for img in images:
        img = img.convert("CMYK")

    images.clear()


def worker(barrier):
    barrier.wait()
    runtimes = []

    for _ in range(5):
        start_time = time.time()
        operation()
        end_time = time.time()
        runtimes.append(end_time - start_time)

    return runtimes


def benchmark():
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        barrier = threading.Barrier(num_threads)
        futures = [executor.submit(worker, barrier) for _ in range(num_threads)]

        run_times = []
        for future in concurrent.futures.as_completed(futures):
            try:
                run_times.extend(future.result())
            except IndexError:
                os._exit(-1)

        min_time = min(run_times)
        max_time = max(run_times)
        mean_time = sum(run_times) / len(run_times)
        print(f"Max: {max_time:.6f} Mean: {mean_time:.6f} Min: {min_time:.6f}")


benchmark()

Results

3.13.0 on main

$ python -VV                                        
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.404962 Mean: 0.353120 Min: 0.303807
$ python test.py
Max: 0.369188 Mean: 0.320218 Min: 0.282613
$ python test.py
Max: 0.386692 Mean: 0.335509 Min: 0.294476
$ python test.py
Max: 0.394410 Mean: 0.350275 Min: 0.299456
$ python test.py
Max: 0.416075 Mean: 0.354347 Min: 0.309045

3.13.0 on branch

$ python -VV
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.422371 Mean: 0.354453 Min: 0.292521
$ python test.py
Max: 0.423698 Mean: 0.358393 Min: 0.313581
$ python test.py
Max: 0.405487 Mean: 0.356346 Min: 0.299354
$ python test.py
Max: 0.431244 Mean: 0.369772 Min: 0.308096
$ python test.py
Max: 0.472806 Mean: 0.377575 Min: 0.313588

3.13.0t on main

$ python -VV
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.161379 Mean: 0.114555 Min: 0.072739
$ python test.py
Max: 0.188203 Mean: 0.133376 Min: 0.095111
$ python test.py
Max: 0.181084 Mean: 0.128733 Min: 0.086086
$ python test.py
Max: 0.187286 Mean: 0.131114 Min: 0.094561
$ python test.py
Max: 0.191979 Mean: 0.133439 Min: 0.097527

3.13.0t on branch

$ python -VV  
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.090055 Mean: 0.044987 Min: 0.019220
$ python test.py
Max: 0.095986 Mean: 0.040712 Min: 0.019204
$ python test.py
Max: 0.094424 Mean: 0.041949 Min: 0.016574
$ python test.py
Max: 0.084610 Mean: 0.042789 Min: 0.016866
$ python test.py
Max: 0.095480 Mean: 0.044068 Min: 0.015999

ngoldbaum · 2025-01-24T20:06:09Z

I wonder what happens if you plot the result of your benchmark as a function of thread count. See e.g. this NumPy issue which reported a similar scaling issue and running a benchmark as a function of thread count was a very useful way to identify the scaling issue and that it was fixed by using locking that scales better.

lysnikolaou

Great job! The approach looks good to me, though I've left some comments on some specifics.

src/libImaging/Storage.c

src/_imaging.c

src/libImaging/Storage.c

kddnewton · 2025-01-30T13:28:54Z

@lysnikolaou — Thanks for the review! I've applied your suggestions. Please let me know if you see anything else.
@ngoldbaum — I'll work on a graph. I chose 8 arenas after some experimentation, but it would be good to back this up.

Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it. This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread. When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena. Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us. As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).

for more information, see https://pre-commit.ci

lysnikolaou · 2025-01-31T15:31:39Z

@kddnewton Could you please not rebase/force-push after the first round of reviews? It makes follow-up reviews a bit harder.

kddnewton · 2025-01-31T15:34:32Z

@lysnikolaou ahh sorry! I saw it was out of date so just wanted to keep it synced. I'll avoid that going forward.

kddnewton · 2025-01-31T17:22:15Z

@ngoldbaum here's the chart. Sorry I didn't end up getting it working with mathplotlib, so it's just manually putting numbers into google sheets. But the numbers come straight from the benchmarking script linked in the description.

You can see from the graph it gets really bad when you start to have a lot of threads on main with free threading. This branch and main are about the same when running without free threading. On this branch with free threading it is significantly better.

lysnikolaou · 2025-01-31T20:32:19Z

There's a segfault on Ubuntu that might need some attention here.

kddnewton · 2025-01-31T20:51:57Z

Yes and it appears that the other one is hanging, I will take a quick look. It was passing before, so I'm imagining this is related to my changes around looping through the arenas.

kddnewton mentioned this pull request Jan 24, 2025

Thread-local arenas #8692

Closed

radarhere mentioned this pull request Jan 25, 2025

Only import distutils when type checking kddnewton/Pillow#3

Merged

radarhere added the Free-threading PEP 703 support label Jan 25, 2025

lysnikolaou reviewed Jan 29, 2025

View reviewed changes

src/libImaging/Storage.c Show resolved Hide resolved

src/libImaging/Storage.c Outdated Show resolved Hide resolved

src/_imaging.c Outdated Show resolved Hide resolved

src/libImaging/Storage.c Show resolved Hide resolved

kddnewton force-pushed the reduce-contention branch from cb4b753 to b2a97bb Compare January 30, 2025 13:26

Kevin Newton and others added 3 commits January 31, 2025 10:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

cff2141

for more information, see https://pre-commit.ci

Only import distutils when type checking

e2a96a5

kddnewton force-pushed the reduce-contention branch from 6479c3b to e2a96a5 Compare January 31, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory arena contention #8714

Reduce memory arena contention #8714

kddnewton commented Jan 24, 2025 •

edited by hugovk

Loading

ngoldbaum commented Jan 24, 2025 •

edited

Loading

lysnikolaou left a comment

kddnewton commented Jan 30, 2025

lysnikolaou commented Jan 31, 2025

kddnewton commented Jan 31, 2025

kddnewton commented Jan 31, 2025

lysnikolaou commented Jan 31, 2025

kddnewton commented Jan 31, 2025

Reduce memory arena contention #8714

Are you sure you want to change the base?

Reduce memory arena contention #8714

Conversation

kddnewton commented Jan 24, 2025 • edited by hugovk Loading

Results

3.13.0 on main

3.13.0 on branch

3.13.0t on main

3.13.0t on branch

ngoldbaum commented Jan 24, 2025 • edited Loading

lysnikolaou left a comment

Choose a reason for hiding this comment

kddnewton commented Jan 30, 2025

lysnikolaou commented Jan 31, 2025

kddnewton commented Jan 31, 2025

kddnewton commented Jan 31, 2025

lysnikolaou commented Jan 31, 2025

kddnewton commented Jan 31, 2025

kddnewton commented Jan 24, 2025 •

edited by hugovk

Loading

ngoldbaum commented Jan 24, 2025 •

edited

Loading