Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2073021: firstboot: Retry on failure #3070

Merged
merged 1 commit into from
Apr 15, 2022

Conversation

cgwalters
Copy link
Member

Kubernetes is all about eventual consistency, control loops. On the
daemonset side, we will keep retrying, but not on firstboot.

This exposes us to flakes where e.g. a hypervisor is heavily loaded
on firstboot, or network issues etc.

Let's just retry forever.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

@cgwalters: This pull request references Bugzilla bug 2073021, which is invalid:

  • expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2073021: firstboot: Retry on failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/bugzilla refresh

@openshift-ci openshift-ci bot requested review from cheesesashimi and jkyros April 7, 2022 14:47
@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 7, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 7, 2022

@cgwalters: This pull request references Bugzilla bug 2073021, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.11.0) matches configured target release for branch (4.11.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

I didn't test this to be clear, should be easy to do so at least manually by jumping into a node during firstboot and killing rpm-ostree or whatever. (Could be scripted I guess via an Ignition config too)

@kikisdeliveryservice
Copy link
Contributor

/test e2e-aws

@cgwalters
Copy link
Member Author

Hmm did we somehow lose the gating on our jobs? Prow says this just needs a lgtm, but we should definitely be gating on e2e-aws and e2e-gcp-op and e2e-agnostic-upgrade right?

Kubernetes is all about eventual consistency, control loops.  On the
daemonset side, we will keep retrying, but not on firstboot.

This exposes us to flakes where e.g. a hypervisor is heavily loaded
on firstboot, or network issues etc.

Let's just retry forever.
@cgwalters
Copy link
Member Author

OK 🆕 now updated to have the retry loop in the Go code.

@cgwalters cgwalters requested a review from yuqi-zhang April 12, 2022 18:50
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry lgtm. Maybe we should have a hard timeout in case the user is wondering why the node is stuck? No strong feelings either way

@jinyunma
Copy link

Tested new change, error is not present any more.
When seeing the error of timeout:

Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: W0413 05:47:37.205136    2615 firstboot_complete_machineconfig.go:46] error: failed to update OS to registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: : exit status 1
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205145    2615 firstboot_complete_machineconfig.go:47] Sleeping 1 minute for retry

I checked that os image was not pulled into /run/mco-machine-os-content/

[root@reliability01-p7lmf-master-1 ~]# ls -ltr /run/mco-machine-os-content/
total 0
[root@reliability01-p7lmf-master-1 ~]#

Then retry, the command was executed successful this time, and also see image was downloaded.

[root@reliability01-p7lmf-master-1 ~]# journalctl -u machine-config-daemon-firstboot.service -b -f
-- Logs begin at Wed 2022-04-13 05:43:22 UTC. --
Apr 13 05:46:31 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:31.393167    2615 rpm-ostree.go:238] Current origin is not custom
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009353    2615 rpm-ostree.go:265] Pivoting to: 411.85.202203282001-0 (96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b)
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009376    2615 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-2129765939/srv/repo with customImageURL pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 and checksum 96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009388    2615 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.204797    2615 update.go:1114] Updating files
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205121    2615 update.go:1179] Deleting stale data
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205130    2615 update.go:1927] Removing SIGTERM protection
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: W0413 05:47:37.205136    2615 firstboot_complete_machineconfig.go:46] error: failed to update OS to registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: : exit status 1
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205145    2615 firstboot_complete_machineconfig.go:47] Sleeping 1 minute for retry
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.208899    2615 update.go:1882] Running: systemctl start rpm-ostreed
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.270676    2615 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.307741    2615 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342257    2615 daemon.go:239] Booted osImageURL:  (411.85.202203181601-0)
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342718    2615 update.go:1919] Adding SIGTERM protection
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342922    2615 update.go:504] Checking Reconcilable for config mco-empty-mc to rendered-master-713b9514904a43504845f937ffe26016
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.343392    2615 update.go:1897] Starting update from mco-empty-mc to rendered-master-713b9514904a43504845f937ffe26016: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346393    2615 update.go:1114] Updating files
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346402    2615 update.go:1179] Deleting stale data
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346496    2615 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-3989351951 --registry-config /var/lib/kubelet/config.json registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058
^C
[root@reliability01-p7lmf-master-1 ~]# ls -ltr /run/mco-machine-os-content/
total 0
drwxr-xr-x. 19 root root 480 Apr 13 05:48 os-content-3989351951

@cgwalters
Copy link
Member Author

Thanks for testing!

We should still try to figure out why the initial image pull is failing in your environment. If it is slow disk I/O, you're going to have a rather painful experience trying to use the cluster I suspect.

Anyways though, @yuqi-zhang mind dropping a lgtm?

@cgwalters
Copy link
Member Author

On a higher level topic...I was thinking about this a bit and I actually find myself wondering why we didn't take an approach where kubelet does run on firstboot and join the cluster, it just joins unschedulable/tainted and we special case things so that just the MCD daemonset runs to perform an OS update.

Seems like this would simplify a whole lot of things and would have avoided this bug to start.

@yuqi-zhang
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 13, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

7 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 15, 2022

@cgwalters: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 4c40e0d into openshift:master Apr 15, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 15, 2022

@cgwalters: All pull requests linked via external trackers have merged:

Bugzilla bug 2073021 has been moved to the MODIFIED state.

In response to this:

Bug 2073021: firstboot: Retry on failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants