Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ready-for-final-review: ibu cnf: add 'rollback after a failed upgrade' test #29

Merged
merged 1 commit into from
Jun 25, 2024

Conversation

mpmaruthu
Copy link
Collaborator

adding test case 'rollback after a failed upgrade'

@mpmaruthu mpmaruthu changed the title Draft: ibu cnf: add 'rollback after a failed upgrade' test ready-for-final-review: ibu cnf: add 'rollback after a failed upgrade' test Jun 6, 2024
@mpmaruthu
Copy link
Collaborator Author

@mcornea , @klaskosk , Please review this PR when you have a time? Thanks!

err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.Compliant, 40*time.Minute)
Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy failed to report Compliant state")

if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err would always be nil here since you're using Expect(err).ToNot(HaveOccurred()) above it, did you want to assert that the error did occur here?

Copy link
Collaborator Author

@mpmaruthu mpmaruthu Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have addressed it and updated with required changes.

Copy link
Collaborator

@klaskosk klaskosk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

tsparams.IbuPolicyNamespace)
Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy")

err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.Compliant, 40*time.Minute)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we make the upgrade fail, what's causing it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @mcornea, Assumption here is , upgrade would fail based on code changes in lifecycle-agent operator and test configuration changes, then rollback to previous state root otherwise the SUT/DUT woudn't rollback. If you think, we should simulate a failure to make upgrade failed and perform rollback. i can think through it and add a test step.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to provide some info for us to consider. In the existing upgrade recovery job, (talm controlled backup and failure recovery using ocp upgrade method), we don't really inject a failure, but rather waiting for the upgrade to proceed to 30%+ and rollback.

We didn't find a good way to inject failure at that time, and above solution (proposed by Angie) will often/always trigger rollback after etcd was upgraded.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent the upgrade from succeeding we can get the machineconfig pool into a degraded state by altering one of the files managed by MCO, e.g. /etc/mco/proxy.env. This way we can cover the rollback is automatically triggered in case of failure.

Copy link
Collaborator Author

@mpmaruthu mpmaruthu Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Yang for sharing more insights. I had a discussion with Marius in this regard. We came up with below proposed methods to trigger auto-rollback upon upgrade failure.

`
step 1
By("Creating, enabling ibu pre-prep CGU and waiting for CGU status to report completed",

step2
By("Creating, enabling ibu prep CGU and waiting for CGU status to report completed"

step3
By("Creating, and enabling ibu upgrade CGU"

step4
Wait for 15-20 mins (node has rebooted to stateroot B)

step5
By("Verifying auto rollback triggered after a failed upgrade",

step a
By("Wait for nodes to become unreachable")

step b
By("Wait for nodes to become reachable")

step c
By("Wait until all nodes are reporting as Ready")

step d
By("Wait for IBU resource to be available")

step e
By ("verifying ibu-upgrade-stage-policy status")

      upgradeStagePolicyStatusCheck, err := ocm.PullPolicy(TargetSNOAPIClient,
         tsparams.UpgradePolicyName,
         tsparams.IbuPolicyNamespace)
      Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy status")

      err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.NonCompliant, 3*time.Minute)
      Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy failed to report Compliant")

step f
By ("verifying current booted stateroot name")
expected result:
stateroot: rhcos_<seed_image_version>

step g

      By("Simulating a fault after node has rebooted into stateroot B", func() {
         faultInjectCmd := "echo a > /etc/mco/proxy.env"  ==> mcp named master will be in "Degraded" state.
         _, err := cluster.ExecCmdWithStdout(TargetSNOAPIClient, faultInjectCmd)
         Expect(err).ToNot(HaveOccurred(), "could not execute command: %s", err)
      })

step h
By ("Configuring IBU CR LCA init monitor watchdog timer to 5 mins to trigger auto-rollback")

step i

      By ("verifying ibu-upgrade-stage-policy status")", func() {
         upgradeStagePolicyStatusCheck, err := ocm.PullPolicy(TargetSNOAPIClient,
            tsparams.UpgradePolicyName,
            tsparams.IbuPolicyNamespace)
         Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy status")

         err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.NonCompliant,
            tsparams.CustomLcaInitMonitorTimeout=> 5 mins)
         Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy expected to report NonCompliant")

step6 (final step)
"Validating target sno cluster version after auto-rollback"
Expected Result
Target sno cluster reports correct (previous ocp version) version
`

@mpmaruthu mpmaruthu force-pushed the ibu-cnf-rollback branch 2 times, most recently from 92c550b to f9141f6 Compare June 24, 2024 19:33
@mpmaruthu
Copy link
Collaborator Author

@mcornea , @yliu127 , Please review this final patch and help to merge if it looks good when you have a time. Thanks!

test automation code roboustness validation logs:-
https://docs.google.com/document/d/14fzhWiPFQscem2YMjq4KEizRi2YZsNA3ltacC83oBdo/edit?usp=sharing

@mcornea
Copy link
Collaborator

mcornea commented Jun 25, 2024

lgtm

@mcornea mcornea merged commit 1ab709d into openshift-kni:main Jun 25, 2024
1 check passed
@mpmaruthu mpmaruthu deleted the ibu-cnf-rollback branch June 25, 2024 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants