-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ready-for-final-review: ibu cnf: add 'rollback after a failed upgrade' test #29
Conversation
a92ccc5
to
ffd809f
Compare
err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.Compliant, 40*time.Minute) | ||
Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy failed to report Compliant state") | ||
|
||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err
would always be nil
here since you're using Expect(err).ToNot(HaveOccurred())
above it, did you want to assert that the error did occur here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have addressed it and updated with required changes.
ffd809f
to
e732aa4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
e732aa4
to
6f1076a
Compare
tsparams.IbuPolicyNamespace) | ||
Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy") | ||
|
||
err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.Compliant, 40*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we make the upgrade fail, what's causing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @mcornea, Assumption here is , upgrade would fail based on code changes in lifecycle-agent operator and test configuration changes, then rollback to previous state root otherwise the SUT/DUT woudn't rollback. If you think, we should simulate a failure to make upgrade failed and perform rollback. i can think through it and add a test step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to provide some info for us to consider. In the existing upgrade recovery job, (talm controlled backup and failure recovery using ocp upgrade method), we don't really inject a failure, but rather waiting for the upgrade to proceed to 30%+ and rollback.
We didn't find a good way to inject failure at that time, and above solution (proposed by Angie) will often/always trigger rollback after etcd was upgraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent the upgrade from succeeding we can get the machineconfig pool into a degraded state by altering one of the files managed by MCO, e.g. /etc/mco/proxy.env
. This way we can cover the rollback is automatically triggered in case of failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Yang for sharing more insights. I had a discussion with Marius in this regard. We came up with below proposed methods to trigger auto-rollback upon upgrade failure.
`
step 1
By("Creating, enabling ibu pre-prep CGU and waiting for CGU status to report completed",
step2
By("Creating, enabling ibu prep CGU and waiting for CGU status to report completed"
step3
By("Creating, and enabling ibu upgrade CGU"
step4
Wait for 15-20 mins (node has rebooted to stateroot B)
step5
By("Verifying auto rollback triggered after a failed upgrade",
step a
By("Wait for nodes to become unreachable")
step b
By("Wait for nodes to become reachable")
step c
By("Wait until all nodes are reporting as Ready")
step d
By("Wait for IBU resource to be available")
step e
By ("verifying ibu-upgrade-stage-policy status")
upgradeStagePolicyStatusCheck, err := ocm.PullPolicy(TargetSNOAPIClient,
tsparams.UpgradePolicyName,
tsparams.IbuPolicyNamespace)
Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy status")
err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.NonCompliant, 3*time.Minute)
Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy failed to report Compliant")
step f
By ("verifying current booted stateroot name")
expected result:
stateroot: rhcos_<seed_image_version>
step g
By("Simulating a fault after node has rebooted into stateroot B", func() {
faultInjectCmd := "echo a > /etc/mco/proxy.env" ==> mcp named master will be in "Degraded" state.
_, err := cluster.ExecCmdWithStdout(TargetSNOAPIClient, faultInjectCmd)
Expect(err).ToNot(HaveOccurred(), "could not execute command: %s", err)
})
step h
By ("Configuring IBU CR LCA init monitor watchdog timer to 5 mins to trigger auto-rollback")
step i
By ("verifying ibu-upgrade-stage-policy status")", func() {
upgradeStagePolicyStatusCheck, err := ocm.PullPolicy(TargetSNOAPIClient,
tsparams.UpgradePolicyName,
tsparams.IbuPolicyNamespace)
Expect(err).ToNot(HaveOccurred(), "Failed to pull upgrade stage policy status")
err = upgradeStagePolicyStatusCheck.WaitUntilComplianceState(policiesv1.NonCompliant,
tsparams.CustomLcaInitMonitorTimeout=> 5 mins)
Expect(err).ToNot(HaveOccurred(), "Upgrade-stage-policy expected to report NonCompliant")
step6 (final step)
"Validating target sno cluster version after auto-rollback"
Expected Result
Target sno cluster reports correct (previous ocp version) version
`
92c550b
to
f9141f6
Compare
@mcornea , @yliu127 , Please review this final patch and help to merge if it looks good when you have a time. Thanks! test automation code roboustness validation logs:- |
lgtm |
f9141f6
to
fb97ff7
Compare
adding test case 'rollback after a failed upgrade'