Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

Closed
vstinner opened this issue Oct 18, 2022 · 23 comments
Assignees
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

AMD64 Ubuntu Shared 3.x buildbot started to fail today:

David Bolen, the buildbot owner, found logs of Linux OOM:

Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.683539] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=python,pid=18752,uid=1000
Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.683565] Out of memory: Killed process 18752 (python) total-vm:1381868kB, anon-rss:725500kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1748kB oom_score_adj:0
Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.795140] oom_reaper: reaped process 18752 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

and also 340 processes like python -c import time; time.sleep(100000000). These processes are likely spawned by test_kill_issue43884() of test_asyncio.test_subprocess.

Maybe PR #32073 introduced a regression test_asyncio. My comment on the PR: #32073 (comment)

Emails about this issue on buildbot-status mailing list: https://mail.python.org/archives/list/[email protected]/thread/ZLG6D5L4SJVS6VHHCC6S2P4OZ63SCXO3/

@gvanrossum
Copy link
Member

This makes me think that jobs like "sleep(1000000)" should really use something like the issue number so they can be identified more easily. :-)

Anyway, it's likely that these are the result of the kill not really killing the subprocess. I'm asking @kumaraditya303 to look into it.

@kumaraditya303
Copy link
Contributor

I am able to reproduce this, I reverted #32073 locally and tested and I am able to reproduce this even without the fix so this is a different regression. I feel that the test has uncovered an old bug as this is first test which tests killing of a process with shell=True.

@gvanrossum
Copy link
Member

I'm confused. Are you saying that after reverting #32073 you can still see Python processes running time.sleep(100000000)? Where are those started then? That PR adds a brand new test that starts such processes -- are there other tests that also run the same test command? I cannot find any other occurrences of that string in Lib/test.

@kumaraditya303
Copy link
Contributor

I reverted #32073 and manually added the test_kill_issue43884 test to see if it is caused by #32073 or not. It is not caused by the PR and I conclude that the fix is not related to this (which makes sense since that PR only changes when proc.wait callbacks are called).

@gvanrossum
Copy link
Member

I think the point is that the test itself leaves processes behind, and I think you've just proved that. So what's wrong with the test? We don't want to leave those processes behind.

@kumaraditya303
Copy link
Contributor

So what's wrong with the test? We don't want to leave those processes behind.

Yes the processes should not be left behind but I don't know what is causing it. asyncio delegates killing of process and all other stuff to subprocess module, maybe a bug in subprocess module? I am not sure.

@gvanrossum
Copy link
Member

If it was a general subprocess bug we'd have heard of it before.

Maybe you can put something in the subprocess so it handles SIGKILL and prints something when it arrives?

@vstinner
Copy link
Member Author

test_asyncio leaks child processes, the system has less free memory available, and so Linux kills Python test processes with SIGKILL. The root issue is that test_asyncio leaks child processes.

@gvanrossum
Copy link
Member

So the process being killed is the process that just spawned a subprocess, and then that subprocess is leaked? That seems perverse. But where does it leak processen in the first place?

@db3l
Copy link
Contributor

db3l commented Oct 20, 2022

It appears so, yes.

The use of create_subprocess_shell is presumably creating (via Popen's shell=True) a parent sh process and then the python process. It's only the latter which is leaked (and those I see are all owned by init) so presumably the general flow and kill processing is working to kill off the immediately child sh process. So the question is what's keeping the child python process around?

Granted it's stuck in the sleep (kernel hrtime), but that seems fine with a very similar command in test_kill (and doesn't seem to be a problem in any interactive tests I try).

Although, while the regular test_kill uses a very similar command, it also uses create_subprocess_exec and no pipes so directly executes (and presumably kills) python. If having the intermediate shell involved isn't crucial to whatever this test is fixing, it might be a quick workaround barring figuring out what's going on under the covers.

@gvanrossum
Copy link
Member

Oh. So maybe the fix is just to change test_kill_issue43884 to use create_subprocess_exec?

@db3l
Copy link
Contributor

db3l commented Oct 20, 2022

For what it's worth, in a quick test on the buildbot changing to create_subprocess_exec that does appear to avoid any stranded processes. It seems the test (and PR) are focused on the asyncio side of cleanup after process exit, so it seems that either creation call would serve, but I'm not positive. The sample code in the original issue did use create_subprocess_shell.

If changed, it then appears the new test is just test_kill plus an asyncio.sleep(1). The windows process group stuff would be unnecessary given a single child process).

@gvanrossum
Copy link
Member

Here's my theory (perhaps not news to you): maybe in UNIX we're not creating a process group, so the proc.kill() call kills the shell subprocess only, using SIGKILL, and since SIGKILL cannot be ignored or caught (I'd forgotten that detail until just now), the sleep(1000000) process is orphaned reliably whenever that test runs. So the kernel isn't OOM-killing anything -- the test just creates orphans evey time it runs. OR, alternatively, the shell does create a process group, but proc.kill() doesn't kill the process group, it just kills the target process (the shell), still creating orphans. What do you think?

If that's the case the only solution might be to modify either asyncio or subprocess.py to use the killpg syscall instead of kill?

Meanwhile, I cannot get the test to fail with the corresponding fix from commit 7015e13 (PR #32073) reverted. So now we appear to have a fix and a test but the test doesn't appear to demonstrate the bug that the fix is supposed to fix, but the test does create additional problems.

Further investigations show: Using the original test program (https://bugs.python.org/file49965/kill_subprocess.py) I can demonstrate the problem. This test program uses sleep 2 && echo done. If I change that to just sleep 2 the program no longer demonstrates the bug, so it appears to be specific to something the shell does when one program follows another. Using sleep 1; sleep 1 I can also repro it. But using "python3 -c 'import time; time.sleep(2); print('done')'" I see no repro. My conclusion is that the repro is pretty finicky, and the current test fails to reproduce the bug properly.

All these repros are with 3.11rc2 and earlier (3.10, 3.9). With current main the test program no longer shows the offending traceback. So I conclude that the fix works, but we need to skip the test until we know more.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Oct 20, 2022

PR #98491 fixes this by using a process group. Longer term I would like to fix this in asyncio to use process groups by default when spawning shells. I am short on time ATM so will investigate later on.

@kumaraditya303
Copy link
Contributor

Meanwhile, I cannot get the test to fail with the corresponding fix from commit 7015e13 (PR #32073) reverted. So now we appear to have a fix and a test but the test doesn't appear to demonstrate the bug that the fix is supposed to fix, but the test does create additional problems.

If I revert the PR, I get the following traceback check RuntimeError: Event loop is closed. It reproduces the bug mentioned in that issue.

ResourceWarning: Enable tracemalloc to get the object allocation traceback
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f437da7b3e0>
Traceback (most recent call last):
  File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 561, in close
    self._close(None)
  File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 585, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/workspaces/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/workspaces/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
/workspaces/cpython/Lib/test/support/__init__.py:738: ResourceWarning: unclosed file <_io.FileIO name=10 mode='rb' closefd=True>
  gc.collect()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/workspaces/cpython/Lib/asyncio/unix_events.py:565: ResourceWarning: unclosed transport <_UnixReadPipeTransport closing fd=10 open>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_asyncio failed (env changed)

== Tests result: SUCCESS ==

1 test altered the execution environment:
    test_asyncio

Total duration: 4.4 sec
Tests result: SUCCESS

@gvanrossum
Copy link
Member

gvanrossum commented Oct 21, 2022

Hm, I cannot repro that on my Mac. When I keep the test and remove the fix, all asyncio tests pass. Also, the original test program still fails:

~/cpython$ cat ~/t.py
import asyncio
import sys

PROGRAM = f"{sys.executable} -c 'import time; time.sleep(10000000)'"
PROGRAM = "sleep 1; sleep 1"

async def test():
    process = await asyncio.create_subprocess_shell(
        PROGRAM,
        stdout=asyncio.subprocess.PIPE,
    )
    await asyncio.sleep(1)
    process.kill()
    await process.wait()
    # process._transport.close()

asyncio.run(test())
~/cpython$ ./python.exe ~/t.py
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10c1a5120>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
~/cpython$ 

If I change the offending test to use

blocking_shell_command = "sleep 1; sleep 1"

then I do get similar tracebacks among the regular test output (but the test somehow still passes):

~/cpython$ ./python.exe -m test test_asyncio.test_subprocess -v -m '*test_kill_issue43884*'
== CPython 3.12.0a0 (heads/generate-ceval-switch:6f344a83d3, Oct 21 2022, 10:24:03) [Clang 14.0.0 (clang-1400.0.29.102)]
== macOS-12.6-x86_64-i386-64bit little-endian
== cwd: /Users/guido/cpython/build/test_python_2010æ
== CPU count: 12
== encodings: locale=UTF-8, FS=utf-8
0:00:00 load avg: 1.88 Run tests sequentially
0:00:00 load avg: 1.88 [1/1] test_asyncio.test_subprocess
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests.test_kill_issue43884) ... ok
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessPidfdWatcherTests.test_kill_issue43884) ... skipped 'operating system does not support pidfds'
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests.test_kill_issue43884) ... ok
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests.test_kill_issue43884) ... ok

----------------------------------------------------------------------
Ran 4 tests in 3.021s

OK (skipped=1)
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
test_asyncio.test_subprocess failed (env changed)

== Tests result: SUCCESS ==

1 test altered the execution environment:
    test_asyncio.test_subprocess

Total duration: 3.2 sec
Tests result: SUCCESS
~/cpython$ 

And with

blocking_shell_command = "sleep 2"

I get no traceback.

So I still suspect that there's something about the test that's not quite right.

@kumaraditya303
Copy link
Contributor

So I still suspect that there's something about the test that's not quite right.

Alas you are on your own here as I don't have access to mac nor do I use it. Even if the test doesn't fail on mac doesn't the failing test on Linux enough (By failing I mean the tracebacks)? It would be better if you verify this on Linux and take a decision on my PR and then continue investigation about mac.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Oct 22, 2022

Also to avoid confusion does the current main without any changes passes all of your tests? By pass I mean no loop closed tracebacks.

@gvanrossum
Copy link
Member

I can try on WSL2 Monday or Tuesday when I am reunited with my Windows laptop.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Oct 22, 2022

I can try on WSL2 Monday or Tuesday when I am reunited with my Windows laptop.

Okay but FYI on WSL2 many tests fail randomly especially related to socket and processes. I gave up on this and since then never used WSL2. You can try a VM or codespaces.

@gvanrossum
Copy link
Member

Ah, good idea to try codespaces. Indeed, there the test without the fix shows the spurious traceback.

Nevertheless, if I change the test to have blocking_shell_command = 'sleep 1; sleep 1' it demonstrates the problem both on Mac and on Linux, so maybe that's still better? And then you won't have to bother with os.killpg() -- the orphanes subprocesses will exit after 1-2 seconds.

@kumaraditya303
Copy link
Contributor

Okay thanks for checking! I'll change it your command on non windows as windows does not has sleep command.

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 24, 2022
miss-islington added a commit that referenced this issue Oct 24, 2022
@kumaraditya303
Copy link
Contributor

Fixed by #98491, feel free to reopen if this happens again.

Repository owner moved this from Todo to Done in asyncio Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

4 participants