-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 2073021: firstboot: Retry on failure #3070
Bug 2073021: firstboot: Retry on failure #3070
Conversation
@cgwalters: This pull request references Bugzilla bug 2073021, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@cgwalters: This pull request references Bugzilla bug 2073021, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I didn't test this to be clear, should be easy to do so at least manually by jumping into a node during firstboot and killing rpm-ostree or whatever. (Could be scripted I guess via an Ignition config too) |
/test e2e-aws |
templates/common/_base/units/machine-config-daemon-firstboot.service.yaml
Outdated
Show resolved
Hide resolved
Hmm did we somehow lose the gating on our jobs? Prow says this just needs a lgtm, but we should definitely be gating on e2e-aws and e2e-gcp-op and e2e-agnostic-upgrade right? |
aff77dd
to
10d3031
Compare
Kubernetes is all about eventual consistency, control loops. On the daemonset side, we will keep retrying, but not on firstboot. This exposes us to flakes where e.g. a hypervisor is heavily loaded on firstboot, or network issues etc. Let's just retry forever.
10d3031
to
fc58309
Compare
OK 🆕 now updated to have the retry loop in the Go code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retry lgtm. Maybe we should have a hard timeout in case the user is wondering why the node is stuck? No strong feelings either way
Tested new change, error is not present any more.
I checked that os image was not pulled into /run/mco-machine-os-content/
Then retry, the command was executed successful this time, and also see image was downloaded.
|
Thanks for testing! We should still try to figure out why the initial image pull is failing in your environment. If it is slow disk I/O, you're going to have a rather painful experience trying to use the cluster I suspect. Anyways though, @yuqi-zhang mind dropping a lgtm? |
On a higher level topic...I was thinking about this a bit and I actually find myself wondering why we didn't take an approach where kubelet does run on firstboot and join the cluster, it just joins unschedulable/tainted and we special case things so that just the MCD daemonset runs to perform an OS update. Seems like this would simplify a whole lot of things and would have avoided this bug to start. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
2 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
7 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
@cgwalters: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@cgwalters: All pull requests linked via external trackers have merged: Bugzilla bug 2073021 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Kubernetes is all about eventual consistency, control loops. On the
daemonset side, we will keep retrying, but not on firstboot.
This exposes us to flakes where e.g. a hypervisor is heavily loaded
on firstboot, or network issues etc.
Let's just retry forever.