Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During shutdown, don't let Conmon die with Systemd CGroups #3474

Closed
wants to merge 1 commit into from

Conversation

mheon
Copy link
Member

@mheon mheon commented Jul 2, 2019

This should immediately hit Conmon with a SIGKILL, preventing potential shutdown ordering issues.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mheon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS labels Jul 2, 2019
@mheon mheon force-pushed the zero_conmon_stop_timeout branch from f596d5d to 54e13e8 Compare July 2, 2019 14:43
@@ -18,6 +18,10 @@ func RunUnderSystemdScope(pid int, slice string, unitName string) error {
properties = append(properties, newProp("PIDs", []uint32{uint32(pid)}))
properties = append(properties, newProp("Delegate", true))
properties = append(properties, newProp("DefaultDependencies", false))
if zeroTimeout {
var timeout uint64
properties = append(properties, newProp("TimeoutStopUSec", &timeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd the opposite here: set TimeoutStopUSec to math.MaxUint64 and also add newProp("KillSignal", syscall.SIGUSR1)) that is used internally by conmon and it won't be forwarded to the container process.

In this way we won't lose the retcode from the container.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooh. I like that. Will do.

@mheon mheon force-pushed the zero_conmon_stop_timeout branch from 137ae13 to e093197 Compare July 3, 2019 14:32
@mheon
Copy link
Member Author

mheon commented Jul 3, 2019

Repushed. New version doesn't use 0, and instead uses a longer timeout with a different stop signal that won't kill Conmon.

@mheon
Copy link
Member Author

mheon commented Jul 3, 2019

I'm gonna knock off the WIP

@mheon mheon changed the title [WIP] Specify a 0 timeout for Conmon under systemd cgroups Specify a 0 timeout for Conmon under systemd cgroups Jul 3, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2019
@mheon mheon changed the title Specify a 0 timeout for Conmon under systemd cgroups During shutdown, don't let Conmon die with Systemd CGroups Jul 3, 2019
@mheon mheon force-pushed the zero_conmon_stop_timeout branch from e093197 to 5a155ef Compare July 3, 2019 17:18
@giuseppe
Copy link
Member

giuseppe commented Jul 3, 2019

LGTM

@mheon
Copy link
Member Author

mheon commented Jul 21, 2019

Rebased and repushed, let's see if the tests go green

@mheon mheon force-pushed the zero_conmon_stop_timeout branch from 5a155ef to a32672a Compare July 21, 2019 02:47
@mheon
Copy link
Member Author

mheon commented Jul 21, 2019

Hmm.

The F29/F30 failures sort of make sense, though I can't replicate locally - systemd cgroups are in use.

The ubuntu ones are just flakes, not Cgroup issues - makes sense, no systemd cgroups there.

@rh-atomic-bot
Copy link
Collaborator

☔ The latest upstream changes (presumably #3143) made this pull request unmergeable. Please resolve the merge conflicts.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2019
@rhatdan
Copy link
Member

rhatdan commented Aug 22, 2019

@mheon This seems to have gotten lost. Could you rebase?

@mheon
Copy link
Member Author

mheon commented Aug 22, 2019

Ack. I'll try and push this the rest of the way.

If we set SIGUSR1, we won't kill Conmon or force a SIGTERM to be
sent to the container. Instead, the container will exit as per
usual during system shutdown, and conmon will remain alive to
record its exit status.

Use a 10 minute timeout so we don't permanently halt system
shutdown if an error occurs.

Signed-off-by: Matthew Heon <[email protected]>
@mheon mheon force-pushed the zero_conmon_stop_timeout branch from a32672a to d2f96d0 Compare September 10, 2019 15:17
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 10, 2019
@rh-atomic-bot
Copy link
Collaborator

☔ The latest upstream changes (presumably #3959) made this pull request unmergeable. Please resolve the merge conflicts.

@rhatdan
Copy link
Member

rhatdan commented Sep 12, 2019

@mheon Needs a rebase. Do we need this for a release?

@mheon
Copy link
Member Author

mheon commented Sep 12, 2019 via email

@mheon
Copy link
Member Author

mheon commented Sep 13, 2019

Probably not making 1.6.0

@giuseppe
Copy link
Member

needs a rebase

@baude
Copy link
Member

baude commented Oct 22, 2019

@mheon ping

@rhatdan
Copy link
Member

rhatdan commented Nov 7, 2019

@mheon what is the scoop on this one?

@mheon
Copy link
Member Author

mheon commented Nov 7, 2019

CI errors were proving very difficult to track down, and there wasn't much priority since we resolved the problem in other ways. Still would be nice to get in if I can find time.

@github-actions
Copy link

github-actions bot commented Dec 8, 2019

This pull request had no activity for 30 days. In the absence of activity or the "do-not-close" label, the pull request will be automatically closed within 7 days.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2019
@openshift-ci-robot
Copy link
Collaborator

@mheon: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rhatdan
Copy link
Member

rhatdan commented Dec 8, 2019

@mheon Still interested in this one?

@github-actions github-actions bot closed this Dec 16, 2019
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 26, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. stale-pr
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants