Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change timeout kill signals to SIGQUIT from SIGTERM #45864

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

staticfloat
Copy link
Member

We've been having a lot of timeouts on CI recently. Our coredumps might
be helpful in tracking these down, but when we send SIGTERM or
SIGKILL we don't get coredumps. This allows customization of which
signal is sent during these timeout messages, to allow for CI to set it
to SIGSEGV or similar, to force core dumps for debugging purposes.

staticfloat added a commit to JuliaCI/julia-buildkite that referenced this pull request Jun 29, 2022
Copy link
Member

@DilumAluthge DilumAluthge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a convention in this repo to do JULIA_TEST_* instead of JULIA_TESTING_*, so it would be good to be consistent.

staticfloat added a commit to JuliaCI/julia-buildkite that referenced this pull request Jun 29, 2022
stdlib/Distributed/src/managers.jl Outdated Show resolved Hide resolved
@staticfloat
Copy link
Member Author

Canceling buildkite build, since it's truly being tested here: JuliaCI/julia-buildkite#165

DilumAluthge pushed a commit to JuliaCI/julia-buildkite that referenced this pull request Jul 10, 2022
DilumAluthge pushed a commit to JuliaCI/julia-buildkite that referenced this pull request Jul 10, 2022
@DilumAluthge
Copy link
Member

@vtjnash Could you explain why this is a bad idea?

@vtjnash
Copy link
Member

vtjnash commented Jul 25, 2022

Seemed somewhat awkward for the API, but I didn't really think much about it so I didn't leave a review. It also may mean we won't generate a backtrace before exiting anymore

Unrelatedly: currently this appears to be also killing much too fast, so our CI logs are getting incorrectly littered with errors from this, and leaving behind some tmp files that seemed to sometimes now cause some issues for the machines

DilumAluthge pushed a commit to JuliaCI/julia-buildkite that referenced this pull request Nov 2, 2022
@DilumAluthge
Copy link
Member

@staticfloat Can you rebase and fix the merge conflicts?

DilumAluthge pushed a commit to JuliaCI/julia-buildkite that referenced this pull request Nov 5, 2022
@staticfloat
Copy link
Member Author

@vtjnash now that #47459 has made it easier for us to get coredumps on SIGQUIT, I think this PR makes even more sense. We would define JULIA_TEST_TIMEOUT_SIGNUM=3 (for SIGQUIT) and get core dumps whenever we hit a timeout in the test suite. Do you agree?

staticfloat added a commit to JuliaCI/julia-buildkite that referenced this pull request Nov 7, 2022
SIGQUIT generally causes a process to coredump, so let's use that as the
termination signal throughout our test suite.

X-ref: JuliaLang/julia#45864
@staticfloat staticfloat changed the title Support customization of termination signal for timeouts Change timeout kill signals to SIGQUIT from SIGTERM Nov 8, 2022
@staticfloat
Copy link
Member Author

After talking it over with Jameson, he recommended that we skip the whole customization thing and just unconditionally use SIGQUIT.

Copy link
Member

@DilumAluthge DilumAluthge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you hard-coding SIGQUIT, instead of reading it from an environment variable?

@staticfloat staticfloat requested a review from vtjnash November 8, 2022 20:15
@DilumAluthge
Copy link
Member

😂 Left my comment before I saw your comment.

Copy link
Member

@DilumAluthge DilumAluthge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@staticfloat
Copy link
Member Author

Jameson tells me that one complication with this is that we may not be able to use SIGQUIT on Windows.

@DilumAluthge
Copy link
Member

Use SIGTERM on Windows, and SIGQUIT otherwise?

@DilumAluthge
Copy link
Member

How does one even trigger a core dump on Windows?

@vtjnash
Copy link
Member

vtjnash commented Nov 9, 2022

There are a couple library functions for that I believe (notably ReportFault) https://learn.microsoft.com/en-us/windows/win32/dxtecharts/crash-dump-analysis

@DilumAluthge
Copy link
Member

Ah, that's right... IIRC Keno had some code in julia-buildbot for getting minidumps on the Windows Buildbots. We should port that over to Buildkite.

I don't think that needs to block this PR though.

@staticfloat
Copy link
Member Author

So Jameson what do you suggest here? Should we embed Sys.iswindows() ? SIGTERM : SIGQUIT everywhere? Should we define Base.SIGQUIT = Base.SIGTERM? Should we add a patch to libuv to support SIGQUIT in the same way it supports SIGTERM?

@vtjnash
Copy link
Member

vtjnash commented Nov 9, 2022

I think we can patch libuv to make this fatal, but currently to define SIGQUIT=SIGTERM also

@vtjnash
Copy link
Member

vtjnash commented Nov 9, 2022

For extra bonus points, you could even make a dump file when sending SIGQUIT (since there is an API function for doing just that)

@staticfloat
Copy link
Member Author

Here's my initial stab at this: JuliaLang/libuv#30

staticfloat added a commit that referenced this pull request Nov 17, 2022
This commit provides two pieces; first, it enables building libuv with
`SIGQUIT` support on Windows (useful for [0]), and second it enables
coredumping when exceptions are hit, after our exception handler has
finished printing out a backtrace.  This is useful for using WER to
create dump files when we segfault, etc...

This commit requires libuv PRs [1] and [2] to properly function.

[0] #45864
[1] JuliaLang/libuv#30
[2] JuliaLang/libuv#31
@staticfloat
Copy link
Member Author

Switching this PR to draft until the following three PRs are merged:

@staticfloat staticfloat marked this pull request as draft November 17, 2022 16:23
@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2024

Seems like those PRs are done now

We've been having a lot of timeouts on CI recently.  Our coredumps might
be helpful in tracking these down, but when we send `SIGTERM` or
`SIGKILL` we don't get coredumps.  This changes timeout signals to
generally use `SIGQUIT` instead of `SIGTERM`, when sending the first kill signal.
staticfloat added a commit to JuliaLang/Distributed.jl that referenced this pull request Jun 20, 2024
This allows for easier collection of core dumps.
X-ref: JuliaLang/julia#45864
@staticfloat
Copy link
Member Author

Okay, I rebased and moved the Distributed changes over to here: JuliaLang/Distributed.jl#103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants