-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407
Comments
This makes me think that jobs like "sleep(1000000)" should really use something like the issue number so they can be identified more easily. :-) Anyway, it's likely that these are the result of the kill not really killing the subprocess. I'm asking @kumaraditya303 to look into it. |
I am able to reproduce this, I reverted #32073 locally and tested and I am able to reproduce this even without the fix so this is a different regression. I feel that the test has uncovered an old bug as this is first test which tests killing of a process with |
I'm confused. Are you saying that after reverting #32073 you can still see Python processes running |
I think the point is that the test itself leaves processes behind, and I think you've just proved that. So what's wrong with the test? We don't want to leave those processes behind. |
Yes the processes should not be left behind but I don't know what is causing it. |
If it was a general subprocess bug we'd have heard of it before. Maybe you can put something in the subprocess so it handles SIGKILL and prints something when it arrives? |
test_asyncio leaks child processes, the system has less free memory available, and so Linux kills Python test processes with SIGKILL. The root issue is that test_asyncio leaks child processes. |
So the process being killed is the process that just spawned a subprocess, and then that subprocess is leaked? That seems perverse. But where does it leak processen in the first place? |
It appears so, yes. The use of create_subprocess_shell is presumably creating (via Popen's shell=True) a parent sh process and then the python process. It's only the latter which is leaked (and those I see are all owned by init) so presumably the general flow and kill processing is working to kill off the immediately child sh process. So the question is what's keeping the child python process around? Granted it's stuck in the sleep (kernel hrtime), but that seems fine with a very similar command in test_kill (and doesn't seem to be a problem in any interactive tests I try). Although, while the regular test_kill uses a very similar command, it also uses create_subprocess_exec and no pipes so directly executes (and presumably kills) python. If having the intermediate shell involved isn't crucial to whatever this test is fixing, it might be a quick workaround barring figuring out what's going on under the covers. |
Oh. So maybe the fix is just to change |
For what it's worth, in a quick test on the buildbot changing to If changed, it then appears the new test is just |
Here's my theory (perhaps not news to you): maybe in UNIX we're not creating a process group, so the If that's the case the only solution might be to modify either asyncio or subprocess.py to use the Meanwhile, I cannot get the test to fail with the corresponding fix from commit 7015e13 (PR #32073) reverted. So now we appear to have a fix and a test but the test doesn't appear to demonstrate the bug that the fix is supposed to fix, but the test does create additional problems. Further investigations show: Using the original test program (https://bugs.python.org/file49965/kill_subprocess.py) I can demonstrate the problem. This test program uses All these repros are with 3.11rc2 and earlier (3.10, 3.9). With current main the test program no longer shows the offending traceback. So I conclude that the fix works, but we need to skip the test until we know more. |
PR #98491 fixes this by using a process group. Longer term I would like to fix this in |
If I revert the PR, I get the following traceback check ResourceWarning: Enable tracemalloc to get the object allocation traceback
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f437da7b3e0>
Traceback (most recent call last):
File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
self.close()
File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
proto.pipe.close()
File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 561, in close
self._close(None)
File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 585, in _close
self._loop.call_soon(self._call_connection_lost, exc)
File "/workspaces/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
self._check_closed()
File "/workspaces/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
/workspaces/cpython/Lib/test/support/__init__.py:738: ResourceWarning: unclosed file <_io.FileIO name=10 mode='rb' closefd=True>
gc.collect()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/workspaces/cpython/Lib/asyncio/unix_events.py:565: ResourceWarning: unclosed transport <_UnixReadPipeTransport closing fd=10 open>
_warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_asyncio failed (env changed)
== Tests result: SUCCESS ==
1 test altered the execution environment:
test_asyncio
Total duration: 4.4 sec
Tests result: SUCCESS
|
Hm, I cannot repro that on my Mac. When I keep the test and remove the fix, all asyncio tests pass. Also, the original test program still fails:
If I change the offending test to use
then I do get similar tracebacks among the regular test output (but the test somehow still passes):
And with
I get no traceback. So I still suspect that there's something about the test that's not quite right. |
Alas you are on your own here as I don't have access to mac nor do I use it. Even if the test doesn't fail on mac doesn't the failing test on Linux enough (By failing I mean the tracebacks)? It would be better if you verify this on Linux and take a decision on my PR and then continue investigation about mac. |
Also to avoid confusion does the current main without any changes passes all of your tests? By pass I mean no loop closed tracebacks. |
I can try on WSL2 Monday or Tuesday when I am reunited with my Windows laptop. |
Okay but FYI on WSL2 many tests fail randomly especially related to socket and processes. I gave up on this and since then never used WSL2. You can try a VM or codespaces. |
Ah, good idea to try codespaces. Indeed, there the test without the fix shows the spurious traceback. Nevertheless, if I change the test to have |
Okay thanks for checking! I'll change it your command on non windows as windows does not has sleep command. |
pythonGH-98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>
…98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>
Fixed by #98491, feel free to reopen if this happens again. |
AMD64 Ubuntu Shared 3.x buildbot started to fail today:
David Bolen, the buildbot owner, found logs of Linux OOM:
and also 340 processes like
python -c import time; time.sleep(100000000)
. These processes are likely spawned bytest_kill_issue43884()
of test_asyncio.test_subprocess.Maybe PR #32073 introduced a regression test_asyncio. My comment on the PR: #32073 (comment)
Emails about this issue on buildbot-status mailing list: https://mail.python.org/archives/list/[email protected]/thread/ZLG6D5L4SJVS6VHHCC6S2P4OZ63SCXO3/
The text was updated successfully, but these errors were encountered: