Bug 2073021: firstboot: Retry on failure #3070

cgwalters · 2022-04-07T14:42:03Z

Kubernetes is all about eventual consistency, control loops. On the
daemonset side, we will keep retrying, but not on firstboot.

This exposes us to flakes where e.g. a hypervisor is heavily loaded
on firstboot, or network issues etc.

Let's just retry forever.

openshift-ci · 2022-04-07T14:42:58Z

@cgwalters: This pull request references Bugzilla bug 2073021, which is invalid:

expected the bug to target the "4.11.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2073021: firstboot: Retry on failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2022-04-07T14:46:01Z

/bugzilla refresh

openshift-ci · 2022-04-07T14:47:29Z

@cgwalters: This pull request references Bugzilla bug 2073021, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.11.0) matches configured target release for branch (4.11.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla ([email protected]), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2022-04-07T15:04:41Z

I didn't test this to be clear, should be easy to do so at least manually by jumping into a node during firstboot and killing rpm-ostree or whatever. (Could be scripted I guess via an Ignition config too)

kikisdeliveryservice · 2022-04-07T17:03:11Z

/test e2e-aws

templates/common/_base/units/machine-config-daemon-firstboot.service.yaml

cgwalters · 2022-04-11T16:49:15Z

Hmm did we somehow lose the gating on our jobs? Prow says this just needs a lgtm, but we should definitely be gating on e2e-aws and e2e-gcp-op and e2e-agnostic-upgrade right?

Kubernetes is all about eventual consistency, control loops. On the daemonset side, we will keep retrying, but not on firstboot. This exposes us to flakes where e.g. a hypervisor is heavily loaded on firstboot, or network issues etc. Let's just retry forever.

cgwalters · 2022-04-12T14:29:26Z

OK 🆕 now updated to have the retry loop in the Go code.

yuqi-zhang

Retry lgtm. Maybe we should have a hard timeout in case the user is wondering why the node is stuck? No strong feelings either way

jinyunma · 2022-04-13T06:15:47Z

Tested new change, error is not present any more.
When seeing the error of timeout:

Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: W0413 05:47:37.205136    2615 firstboot_complete_machineconfig.go:46] error: failed to update OS to registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: : exit status 1
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205145    2615 firstboot_complete_machineconfig.go:47] Sleeping 1 minute for retry

I checked that os image was not pulled into /run/mco-machine-os-content/

[root@reliability01-p7lmf-master-1 ~]# ls -ltr /run/mco-machine-os-content/
total 0
[root@reliability01-p7lmf-master-1 ~]#

Then retry, the command was executed successful this time, and also see image was downloaded.

[root@reliability01-p7lmf-master-1 ~]# journalctl -u machine-config-daemon-firstboot.service -b -f
-- Logs begin at Wed 2022-04-13 05:43:22 UTC. --
Apr 13 05:46:31 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:31.393167    2615 rpm-ostree.go:238] Current origin is not custom
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009353    2615 rpm-ostree.go:265] Pivoting to: 411.85.202203282001-0 (96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b)
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009376    2615 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-2129765939/srv/repo with customImageURL pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 and checksum 96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b
Apr 13 05:46:32 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:46:32.009388    2615 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.204797    2615 update.go:1114] Updating files
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205121    2615 update.go:1179] Deleting stale data
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205130    2615 update.go:1927] Removing SIGTERM protection
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: W0413 05:47:37.205136    2615 firstboot_complete_machineconfig.go:46] error: failed to update OS to registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2129765939/srv/repo:96e553b62c446ec12ef1aa4898bf6a627644ba15ed7dac41652686615842230b --custom-origin-url pivot://registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058 --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: : exit status 1
Apr 13 05:47:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:47:37.205145    2615 firstboot_complete_machineconfig.go:47] Sleeping 1 minute for retry
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.208899    2615 update.go:1882] Running: systemctl start rpm-ostreed
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.270676    2615 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.307741    2615 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342257    2615 daemon.go:239] Booted osImageURL:  (411.85.202203181601-0)
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342718    2615 update.go:1919] Adding SIGTERM protection
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.342922    2615 update.go:504] Checking Reconcilable for config mco-empty-mc to rendered-master-713b9514904a43504845f937ffe26016
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.343392    2615 update.go:1897] Starting update from mco-empty-mc to rendered-master-713b9514904a43504845f937ffe26016: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346393    2615 update.go:1114] Updating files
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346402    2615 update.go:1179] Deleting stale data
Apr 13 05:48:37 reliability01-p7lmf-master-1 machine-config-daemon[2615]: I0413 05:48:37.346496    2615 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-3989351951 --registry-config /var/lib/kubelet/config.json registry.build01.ci.openshift.org/ci-ln-vgbvcqk/stable@sha256:4e3e9ef6c2fc837dde23521f98d567cfa0001bf89a242392e2cd1ac03837b058
^C
[root@reliability01-p7lmf-master-1 ~]# ls -ltr /run/mco-machine-os-content/
total 0
drwxr-xr-x. 19 root root 480 Apr 13 05:48 os-content-3989351951

cgwalters · 2022-04-13T13:13:08Z

Thanks for testing!

We should still try to figure out why the initial image pull is failing in your environment. If it is slow disk I/O, you're going to have a rather painful experience trying to use the cluster I suspect.

Anyways though, @yuqi-zhang mind dropping a lgtm?

cgwalters · 2022-04-13T13:30:00Z

On a higher level topic...I was thinking about this a bit and I actually find myself wondering why we didn't take an approach where kubelet does run on firstboot and join the cluster, it just joins unschedulable/tainted and we special case things so that just the MCD daemonset runs to perform an OS update.

Seems like this would simplify a whole lot of things and would have avoided this bug to start.

yuqi-zhang · 2022-04-13T13:48:52Z

/lgtm

openshift-ci · 2022-04-13T13:49:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2022-04-13T17:30:53Z

/retest-required