AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

vstinner · 2022-10-18T16:52:30Z

AMD64 Ubuntu Shared 3.x buildbot started to fail today:

David Bolen, the buildbot owner, found logs of Linux OOM:

Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.683539] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=python,pid=18752,uid=1000
Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.683565] Out of memory: Killed process 18752 (python) total-vm:1381868kB, anon-rss:725500kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:1748kB oom_score_adj:0
Oct 18 11:27:39 buildbot-ubuntu kernel: [34420581.795140] oom_reaper: reaped process 18752 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

and also 340 processes like python -c import time; time.sleep(100000000). These processes are likely spawned by test_kill_issue43884() of test_asyncio.test_subprocess.

Maybe PR #32073 introduced a regression test_asyncio. My comment on the PR: #32073 (comment)

Emails about this issue on buildbot-status mailing list: https://mail.python.org/archives/list/[email protected]/thread/ZLG6D5L4SJVS6VHHCC6S2P4OZ63SCXO3/

The text was updated successfully, but these errors were encountered:

gvanrossum · 2022-10-18T17:11:49Z

This makes me think that jobs like "sleep(1000000)" should really use something like the issue number so they can be identified more easily. :-)

Anyway, it's likely that these are the result of the kill not really killing the subprocess. I'm asking @kumaraditya303 to look into it.

kumaraditya303 · 2022-10-18T18:14:05Z

I am able to reproduce this, I reverted #32073 locally and tested and I am able to reproduce this even without the fix so this is a different regression. I feel that the test has uncovered an old bug as this is first test which tests killing of a process with shell=True.

gvanrossum · 2022-10-18T19:17:29Z

I'm confused. Are you saying that after reverting #32073 you can still see Python processes running time.sleep(100000000)? Where are those started then? That PR adds a brand new test that starts such processes -- are there other tests that also run the same test command? I cannot find any other occurrences of that string in Lib/test.

kumaraditya303 · 2022-10-19T11:09:57Z

I reverted #32073 and manually added the test_kill_issue43884 test to see if it is caused by #32073 or not. It is not caused by the PR and I conclude that the fix is not related to this (which makes sense since that PR only changes when proc.wait callbacks are called).

gvanrossum · 2022-10-19T15:48:19Z

I think the point is that the test itself leaves processes behind, and I think you've just proved that. So what's wrong with the test? We don't want to leave those processes behind.

kumaraditya303 · 2022-10-19T18:04:12Z

So what's wrong with the test? We don't want to leave those processes behind.

Yes the processes should not be left behind but I don't know what is causing it. asyncio delegates killing of process and all other stuff to subprocess module, maybe a bug in subprocess module? I am not sure.

gvanrossum · 2022-10-19T18:09:13Z

If it was a general subprocess bug we'd have heard of it before.

Maybe you can put something in the subprocess so it handles SIGKILL and prints something when it arrives?

vstinner · 2022-10-19T22:26:06Z

test_asyncio leaks child processes, the system has less free memory available, and so Linux kills Python test processes with SIGKILL. The root issue is that test_asyncio leaks child processes.

gvanrossum · 2022-10-20T00:56:27Z

So the process being killed is the process that just spawned a subprocess, and then that subprocess is leaked? That seems perverse. But where does it leak processen in the first place?

db3l · 2022-10-20T01:09:10Z

It appears so, yes.

The use of create_subprocess_shell is presumably creating (via Popen's shell=True) a parent sh process and then the python process. It's only the latter which is leaked (and those I see are all owned by init) so presumably the general flow and kill processing is working to kill off the immediately child sh process. So the question is what's keeping the child python process around?

Granted it's stuck in the sleep (kernel hrtime), but that seems fine with a very similar command in test_kill (and doesn't seem to be a problem in any interactive tests I try).

Although, while the regular test_kill uses a very similar command, it also uses create_subprocess_exec and no pipes so directly executes (and presumably kills) python. If having the intermediate shell involved isn't crucial to whatever this test is fixing, it might be a quick workaround barring figuring out what's going on under the covers.

gvanrossum · 2022-10-20T02:27:53Z

Oh. So maybe the fix is just to change test_kill_issue43884 to use create_subprocess_exec?

db3l · 2022-10-20T03:27:02Z

For what it's worth, in a quick test on the buildbot changing to create_subprocess_exec that does appear to avoid any stranded processes. It seems the test (and PR) are focused on the asyncio side of cleanup after process exit, so it seems that either creation call would serve, but I'm not positive. The sample code in the original issue did use create_subprocess_shell.

If changed, it then appears the new test is just test_kill plus an asyncio.sleep(1). The windows process group stuff would be unnecessary given a single child process).

gvanrossum · 2022-10-20T04:41:37Z

Here's my theory (perhaps not news to you): maybe in UNIX we're not creating a process group, so the proc.kill() call kills the shell subprocess only, using SIGKILL, and since SIGKILL cannot be ignored or caught (I'd forgotten that detail until just now), the sleep(1000000) process is orphaned reliably whenever that test runs. So the kernel isn't OOM-killing anything -- the test just creates orphans evey time it runs. OR, alternatively, the shell does create a process group, but proc.kill() doesn't kill the process group, it just kills the target process (the shell), still creating orphans. What do you think?

If that's the case the only solution might be to modify either asyncio or subprocess.py to use the killpg syscall instead of kill?

Meanwhile, I cannot get the test to fail with the corresponding fix from commit 7015e13 (PR #32073) reverted. So now we appear to have a fix and a test but the test doesn't appear to demonstrate the bug that the fix is supposed to fix, but the test does create additional problems.

Further investigations show: Using the original test program (https://bugs.python.org/file49965/kill_subprocess.py) I can demonstrate the problem. This test program uses sleep 2 && echo done. If I change that to just sleep 2 the program no longer demonstrates the bug, so it appears to be specific to something the shell does when one program follows another. Using sleep 1; sleep 1 I can also repro it. But using "python3 -c 'import time; time.sleep(2); print('done')'" I see no repro. My conclusion is that the repro is pretty finicky, and the current test fails to reproduce the bug properly.

All these repros are with 3.11rc2 and earlier (3.10, 3.9). With current main the test program no longer shows the offending traceback. So I conclude that the fix works, but we need to skip the test until we know more.

kumaraditya303 · 2022-10-20T16:03:19Z

PR #98491 fixes this by using a process group. Longer term I would like to fix this in asyncio to use process groups by default when spawning shells. I am short on time ATM so will investigate later on.

kumaraditya303 · 2022-10-20T16:13:08Z

Meanwhile, I cannot get the test to fail with the corresponding fix from commit 7015e13 (PR #32073) reverted. So now we appear to have a fix and a test but the test doesn't appear to demonstrate the bug that the fix is supposed to fix, but the test does create additional problems.

If I revert the PR, I get the following traceback check RuntimeError: Event loop is closed. It reproduces the bug mentioned in that issue.

ResourceWarning: Enable tracemalloc to get the object allocation traceback
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f437da7b3e0>
Traceback (most recent call last):
  File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/workspaces/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 561, in close
    self._close(None)
  File "/workspaces/cpython/Lib/asyncio/unix_events.py", line 585, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/workspaces/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/workspaces/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
/workspaces/cpython/Lib/test/support/__init__.py:738: ResourceWarning: unclosed file <_io.FileIO name=10 mode='rb' closefd=True>
  gc.collect()
ResourceWarning: Enable tracemalloc to get the object allocation traceback
/workspaces/cpython/Lib/asyncio/unix_events.py:565: ResourceWarning: unclosed transport <_UnixReadPipeTransport closing fd=10 open>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_asyncio failed (env changed)

== Tests result: SUCCESS ==

1 test altered the execution environment:
    test_asyncio

Total duration: 4.4 sec
Tests result: SUCCESS

gvanrossum · 2022-10-21T20:01:33Z

Hm, I cannot repro that on my Mac. When I keep the test and remove the fix, all asyncio tests pass. Also, the original test program still fails:

~/cpython$ cat ~/t.py
import asyncio
import sys

PROGRAM = f"{sys.executable} -c 'import time; time.sleep(10000000)'"
PROGRAM = "sleep 1; sleep 1"

async def test():
    process = await asyncio.create_subprocess_shell(
        PROGRAM,
        stdout=asyncio.subprocess.PIPE,
    )
    await asyncio.sleep(1)
    process.kill()
    await process.wait()
    # process._transport.close()

asyncio.run(test())
~/cpython$ ./python.exe ~/t.py
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10c1a5120>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
~/cpython$

If I change the offending test to use

blocking_shell_command = "sleep 1; sleep 1"

then I do get similar tracebacks among the regular test output (but the test somehow still passes):

~/cpython$ ./python.exe -m test test_asyncio.test_subprocess -v -m '*test_kill_issue43884*'
== CPython 3.12.0a0 (heads/generate-ceval-switch:6f344a83d3, Oct 21 2022, 10:24:03) [Clang 14.0.0 (clang-1400.0.29.102)]
== macOS-12.6-x86_64-i386-64bit little-endian
== cwd: /Users/guido/cpython/build/test_python_2010æ
== CPU count: 12
== encodings: locale=UTF-8, FS=utf-8
0:00:00 load avg: 1.88 Run tests sequentially
0:00:00 load avg: 1.88 [1/1] test_asyncio.test_subprocess
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessFastWatcherTests.test_kill_issue43884) ... ok
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessPidfdWatcherTests.test_kill_issue43884) ... skipped 'operating system does not support pidfds'
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessSafeWatcherTests.test_kill_issue43884) ... ok
test_kill_issue43884 (test.test_asyncio.test_subprocess.SubprocessThreadedWatcherTests.test_kill_issue43884) ... ok

----------------------------------------------------------------------
Ran 4 tests in 3.021s

OK (skipped=1)
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Warning -- Unraisable exception
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x10e562ca0>
Traceback (most recent call last):
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 126, in __del__
    self.close()
  File "/Users/guido/cpython/Lib/asyncio/base_subprocess.py", line 104, in close
    proto.pipe.close()
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 558, in close
    self._close(None)
  File "/Users/guido/cpython/Lib/asyncio/unix_events.py", line 582, in _close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 772, in call_soon
    self._check_closed()
  File "/Users/guido/cpython/Lib/asyncio/base_events.py", line 519, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
test_asyncio.test_subprocess failed (env changed)

== Tests result: SUCCESS ==

1 test altered the execution environment:
    test_asyncio.test_subprocess

Total duration: 3.2 sec
Tests result: SUCCESS
~/cpython$

And with

blocking_shell_command = "sleep 2"

I get no traceback.

So I still suspect that there's something about the test that's not quite right.

kumaraditya303 · 2022-10-22T05:09:27Z

So I still suspect that there's something about the test that's not quite right.

Alas you are on your own here as I don't have access to mac nor do I use it. Even if the test doesn't fail on mac doesn't the failing test on Linux enough (By failing I mean the tracebacks)? It would be better if you verify this on Linux and take a decision on my PR and then continue investigation about mac.

kumaraditya303 · 2022-10-22T05:11:57Z

Also to avoid confusion does the current main without any changes passes all of your tests? By pass I mean no loop closed tracebacks.

gvanrossum · 2022-10-22T15:26:03Z

I can try on WSL2 Monday or Tuesday when I am reunited with my Windows laptop.

kumaraditya303 · 2022-10-22T15:35:35Z

I can try on WSL2 Monday or Tuesday when I am reunited with my Windows laptop.

Okay but FYI on WSL2 many tests fail randomly especially related to socket and processes. I gave up on this and since then never used WSL2. You can try a VM or codespaces.

gvanrossum · 2022-10-22T18:18:55Z

Ah, good idea to try codespaces. Indeed, there the test without the fix shows the spurious traceback.

Nevertheless, if I change the test to have blocking_shell_command = 'sleep 1; sleep 1' it demonstrates the problem both on Mac and on Linux, so maybe that's still better? And then you won't have to bother with os.killpg() -- the orphanes subprocesses will exit after 1-2 seconds.

kumaraditya303 · 2022-10-23T04:11:06Z

Okay thanks for checking! I'll change it your command on non windows as windows does not has sleep command.

)

pythonGH-98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>

…98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>

kumaraditya303 · 2022-10-25T13:04:04Z

Fixed by #98491, feel free to reopen if this happens again.

vstinner added the type-bug An unexpected behavior, bug, or error label Oct 18, 2022

vstinner mentioned this issue Oct 18, 2022

gh-88050: Fix asyncio subprocess kill process cleanly when process is blocked #32073

Merged

gvanrossum added the topic-asyncio label Oct 18, 2022

ezio-melotti added this to asyncio Oct 18, 2022

ezio-melotti moved this to Todo in asyncio Oct 18, 2022

gvanrossum assigned kumaraditya303 Oct 18, 2022

bedevere-bot mentioned this issue Oct 20, 2022

GH-98407: fix test_kill_issue43884 to not leak child processes #98491

Merged

gvanrossum pushed a commit that referenced this issue Oct 24, 2022

GH-98407: fix test_kill_issue43884 to not leak child processes (#98491

3b2724a

)

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 24, 2022

pythonGH-98407: fix test_kill_issue43884 to not leak child processes (

053dab5

pythonGH-98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>

bedevere-bot mentioned this issue Oct 24, 2022

[3.11] GH-98407: fix test_kill_issue43884 to not leak child processes (GH-98491) #98613

Merged

miss-islington added a commit that referenced this issue Oct 24, 2022

GH-98407: fix test_kill_issue43884 to not leak child processes (GH-…

3a1eb81

…98491) (cherry picked from commit 3b2724a) Co-authored-by: Kumar Aditya <[email protected]>

kumaraditya303 closed this as completed Oct 25, 2022

Repository owner moved this from Todo to Done in asyncio Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

vstinner commented Oct 18, 2022

gvanrossum commented Oct 18, 2022

kumaraditya303 commented Oct 18, 2022

gvanrossum commented Oct 18, 2022

kumaraditya303 commented Oct 19, 2022

gvanrossum commented Oct 19, 2022

kumaraditya303 commented Oct 19, 2022

gvanrossum commented Oct 19, 2022

vstinner commented Oct 19, 2022

gvanrossum commented Oct 20, 2022

db3l commented Oct 20, 2022

gvanrossum commented Oct 20, 2022

db3l commented Oct 20, 2022

gvanrossum commented Oct 20, 2022

kumaraditya303 commented Oct 20, 2022 •

edited by gvanrossum

Loading

kumaraditya303 commented Oct 20, 2022

gvanrossum commented Oct 21, 2022 •

edited

Loading

kumaraditya303 commented Oct 22, 2022

kumaraditya303 commented Oct 22, 2022 •

edited

Loading

gvanrossum commented Oct 22, 2022

kumaraditya303 commented Oct 22, 2022 •

edited

Loading

gvanrossum commented Oct 22, 2022

kumaraditya303 commented Oct 23, 2022

kumaraditya303 commented Oct 25, 2022

AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

AMD64 Ubuntu Shared 3.x: python processes killed with SIGKILL by Linux Out-of-Memory (OOM), maybe related to test_asyncio #98407

Comments

vstinner commented Oct 18, 2022

gvanrossum commented Oct 18, 2022

kumaraditya303 commented Oct 18, 2022

gvanrossum commented Oct 18, 2022

kumaraditya303 commented Oct 19, 2022

gvanrossum commented Oct 19, 2022

kumaraditya303 commented Oct 19, 2022

gvanrossum commented Oct 19, 2022

vstinner commented Oct 19, 2022

gvanrossum commented Oct 20, 2022

db3l commented Oct 20, 2022

gvanrossum commented Oct 20, 2022

db3l commented Oct 20, 2022

gvanrossum commented Oct 20, 2022

kumaraditya303 commented Oct 20, 2022 • edited by gvanrossum Loading

kumaraditya303 commented Oct 20, 2022

gvanrossum commented Oct 21, 2022 • edited Loading

kumaraditya303 commented Oct 22, 2022

kumaraditya303 commented Oct 22, 2022 • edited Loading

gvanrossum commented Oct 22, 2022

kumaraditya303 commented Oct 22, 2022 • edited Loading

gvanrossum commented Oct 22, 2022

kumaraditya303 commented Oct 23, 2022

kumaraditya303 commented Oct 25, 2022

kumaraditya303 commented Oct 20, 2022 •

edited by gvanrossum

Loading

gvanrossum commented Oct 21, 2022 •

edited

Loading

kumaraditya303 commented Oct 22, 2022 •

edited

Loading

kumaraditya303 commented Oct 22, 2022 •

edited

Loading