Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-113055: Use pointer for interp->obmalloc state. #113412

Merged
merged 5 commits into from
Jan 27, 2024

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Dec 22, 2023

For interpreters that share state with the main interpreter, this points to the same static memory structure. For interpreters with their own obmalloc state, it is heap allocated. Add free_obmalloc_arenas() which will free the obmalloc arenas and radix tree structures for interpreters with their own obmalloc state.

Note that the free is only done if obmalloc used blocks is zero. If some blocks are used, it's not safe to free the memory since some extension might still be using it.

For interpreters that share state with the main interpreter, this points
to the same static memory structure.  For interpreters with their own
obmalloc state, it is heap allocated.  Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.
@nascheme
Copy link
Member Author

This should improve the situtation for gh-113055. There will still be memory leaks if interpreters don't free their memory (or load extensions who don't free their memory). However, when the Py_RTFLAGS_USE_MAIN_OBMALLOC flag is set (the default, I think), the behaviour will be much like it was in 3.11. I.e. interpreters use the main interpreter obmalloc state.

The obmalloc state for the main interpreter is back to being a static structure in the BSS, as it was in 3.11. My guess is the get_state() function in obmalloc is a tiny bit faster, because there is no test branch in it. I didn't profile that yet though. I think get_state() will be fast enough that it doesn't matter.

The obmalloc_state_main global is okay and should be ignored.
@nascheme
Copy link
Member Author

It looks like the performance different for get_state() is tiny. Before this change (from perf record and perf report):

perf_obmalloc_before

And after:

perf_obmalloc_ptr

The counts in the left hand column are then number of samples that hit that instruction. My quick-and-dirty benchmark was:

./Programs/_testembed test_repeated_init_exec "$(cat bench_embed.py)"

with bench_embed.py containing:

import unittest

for i in range(4):
    unittest.main('test.test_dict', verbosity=1, exit=False)
    unittest.main('test.test_decimal', verbosity=1, exit=False)
    unittest.main('test.test_generators', verbosity=1, exit=False)
    unittest.main('test.test_set', verbosity=1, exit=False)

I didn't build with LTO and PGO, that would change things a bit I would guess.

@nascheme
Copy link
Member Author

With PGO and LTO turned on, the code becomes quite jumbled up and hard to follow. _PyObject_Free() disappears from the profile because it gets inlined in a bunch of places. Looking at PyObject_GC_Del(), there are no samples for the instruction corresponding to return interp->obmalloc in get_state().

image

@neonene
Copy link
Contributor

neonene commented Dec 23, 2023

Non-debug builds have memory leaks on Windows even using 33e0adb. At least, _PyObject_Arena.free() seems to never be called in the insert_to_freepool function.

The same goes for Python 3.12.1 with/without this patch applied.

EDIT: Observed with test_repeated_simple_init

@neonene
Copy link
Contributor

neonene commented Dec 23, 2023

_PyObject_Arena.free() seems to never be called in the insert_to_freepool function

Bisected to: ea2c001, which this PR does not fully affect on my non-debug build.

@nascheme nascheme requested a review from pablogsal January 11, 2024 22:10
@nascheme
Copy link
Member Author

I think this PR should get merged. It should be safe and it reduces memory used by sub-interpreters that share the main interpreter obmalloc state. It also eliminates a memory leak in the case of sub-interpreters that don't share the state. See free_obmalloc_arenas(). This call is conditional, only if _PyInterpreterState_GetAllocatedBlocks() returns zero. A more aggressive but less safe approach would be to always call free_obmalloc_arenas(). We can decide on that after merging this change.

@neonene
Copy link
Contributor

neonene commented Jan 14, 2024

@Yhg1s Any thoughts?

@gvanrossum
Copy link
Member

@ericsnowcurrently Have you reviewed this? Do you feel it's ready to be merged?

@ericsnowcurrently
Copy link
Member

I'm planning on taking a look in the next day or two.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think this is basically right. I didn't see any correctness issues.

Clearly I did leave a bunch of comments, but it's mostly little things like alternate spellings, whether or not to statically initialize obmalloc_state_main, and whether or not we really have to make PyInterpreterState.obmalloc a pointer.

@bedevere-app
Copy link

bedevere-app bot commented Jan 24, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@ericsnowcurrently
Copy link
Member

Also, weren't you considering freeing the blocks that don't have any objects in them any more, to reduce how much leaks?

@ericsnowcurrently
Copy link
Member

Also, what do you think our options are for backporting this to 3.12?

@nascheme
Copy link
Member Author

I would prefer to use a pointer. I think it's cleaner/faster and I think the memory savings, while not huge, are non-trivial. On 64-bit platforms, struct _obmalloc_state is 256 kB. Most of that is the arena_map_root. If you allocate it in the BSS then typically only one entry of that array is actually used and so only one page of memory is actually resident. So typically 4 kB vs 256 kB. Also, every sub-interpreter that uses the main obmalloc state saves 256 kB of resident memory.

@nascheme
Copy link
Member Author

Also, what do you think our options are for backporting this to 3.12?

Possible but a little scary. It's not a lot of code but it changes behavior quite a bit. Up to the release manager, I guess. The number of people affected by the memory usage/leaks is likely to be really small. At least, that's my guess. So maybe they need to skip 3.12 and go from 3.11 to 3.13 it this is problem for them. @neonene might have another perspective.

@ericsnowcurrently
Copy link
Member

I would prefer to use a pointer.

Sounds good.

FWIW, subinterpreters that use the main obmalloc state supported for backward compatibility, are already uncommon, and I expect they will be more so as we move forward. So I wouldn't worry about factoring them into the decision very much.

@nascheme
Copy link
Member Author

Also, weren't you considering freeing the blocks that don't have any objects in them any more, to reduce how much leaks?

That's an idea. A bit of code to do that but probably not horrible. Another idea would be to add an -X flag or something similar that would enable freeing of all obmalloc arenas, even if there is a non-zero in-use count. Embedding programs could set this if they want to eliminate the leaks and are prepared to deal with extensions that crash because of it.

If it is safe to enable gh-113601 then maybe it is safe to just free the arenas as well. I think the same logic applies although it seems more likely freeing the arenas would be more likely to trigger a use-after-free bug.

nascheme and others added 2 commits January 26, 2024 10:39
- Remove 'runtime' argument.
- Remove 'initialized' and 'heap_allocated' members.
- Use _Py_IsMainInterpreter()
- Remove spurious whitespace change.
- Return -1 on error and 0 on success, as normal convention.
@nascheme
Copy link
Member Author

Thanks for the detailed review feedback. I think I've addressed all of your suggestions.

Copy link
Member

@ericsnowcurrently ericsnowcurrently left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ericsnowcurrently
Copy link
Member

Thanks for doing this!

@ericsnowcurrently
Copy link
Member

@Yhg1s, what are your thoughts on backporting this to 3.12?

aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
)

For interpreters that share state with the main interpreter, this points
to the same static memory structure.  For interpreters with their own
obmalloc state, it is heap allocated.  Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.

Co-authored-by: Eric Snow <[email protected]>
@trygveaa
Copy link

trygveaa commented Mar 8, 2024

This fixes a segfault happening with 3.12, so it would be great if this could be backported to 3.12. I posted more details about the crash in #116510.

@bacon-cheeseburger
Copy link

Just wanted to offer that this issue appears to impact Kodi, which has a considerable user base in the millions. The number of users affected by the problem could be quite substantial.

@ericsnowcurrently
Copy link
Member

ping @Yhg1s

Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
)

For interpreters that share state with the main interpreter, this points
to the same static memory structure.  For interpreters with their own
obmalloc state, it is heap allocated.  Add free_obmalloc_arenas() which
will free the obmalloc arenas and radix tree structures for interpreters
with their own obmalloc state.

Co-authored-by: Eric Snow <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants