-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
During shutdown, don't let Conmon die with Systemd CGroups #3474
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mheon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
f596d5d
to
54e13e8
Compare
@@ -18,6 +18,10 @@ func RunUnderSystemdScope(pid int, slice string, unitName string) error { | |||
properties = append(properties, newProp("PIDs", []uint32{uint32(pid)})) | |||
properties = append(properties, newProp("Delegate", true)) | |||
properties = append(properties, newProp("DefaultDependencies", false)) | |||
if zeroTimeout { | |||
var timeout uint64 | |||
properties = append(properties, newProp("TimeoutStopUSec", &timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd the opposite here: set TimeoutStopUSec
to math.MaxUint64
and also add newProp("KillSignal", syscall.SIGUSR1))
that is used internally by conmon and it won't be forwarded to the container process.
In this way we won't lose the retcode from the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh. I like that. Will do.
137ae13
to
e093197
Compare
Repushed. New version doesn't use 0, and instead uses a longer timeout with a different stop signal that won't kill Conmon. |
I'm gonna knock off the WIP |
e093197
to
5a155ef
Compare
LGTM |
Rebased and repushed, let's see if the tests go green |
5a155ef
to
a32672a
Compare
Hmm. The F29/F30 failures sort of make sense, though I can't replicate locally - systemd cgroups are in use. The ubuntu ones are just flakes, not Cgroup issues - makes sense, no systemd cgroups there. |
☔ The latest upstream changes (presumably #3143) made this pull request unmergeable. Please resolve the merge conflicts. |
@mheon This seems to have gotten lost. Could you rebase? |
Ack. I'll try and push this the rest of the way. |
If we set SIGUSR1, we won't kill Conmon or force a SIGTERM to be sent to the container. Instead, the container will exit as per usual during system shutdown, and conmon will remain alive to record its exit status. Use a 10 minute timeout so we don't permanently halt system shutdown if an error occurs. Signed-off-by: Matthew Heon <[email protected]>
a32672a
to
d2f96d0
Compare
☔ The latest upstream changes (presumably #3959) made this pull request unmergeable. Please resolve the merge conflicts. |
@mheon Needs a rebase. Do we need this for a release? |
Nope. We can go without it.
…On Thu, Sep 12, 2019, 16:18 Daniel J Walsh ***@***.***> wrote:
@mheon <https://github.com/mheon> Needs a rebase. Do we need this for a
release?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3474>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB3AOCABC7DBKPQVOGJEMIDQJKPZTANCNFSM4H44M6YA>
.
|
Probably not making 1.6.0 |
needs a rebase |
@mheon ping |
@mheon what is the scoop on this one? |
CI errors were proving very difficult to track down, and there wasn't much priority since we resolved the problem in other ways. Still would be nice to get in if I can find time. |
This pull request had no activity for 30 days. In the absence of activity or the "do-not-close" label, the pull request will be automatically closed within 7 days. |
@mheon: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@mheon Still interested in this one? |
This should immediately hit Conmon with a SIGKILL, preventing potential shutdown ordering issues.