-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-542: pkg/cvo/internal: Do not block on Degraded=True ClusterOperator #482
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold Big change; we want to give folks time to weigh in. |
I would personally prefer that we tighten up the definition of what degraded means such that it's only used when things are critical enough that they should block the forward progress of upgrades. |
Current definition hinges on quality-of-service. Available definition requires the operand to be "functional and available". Can you float an example of component behavior that would be Available=True but still sufficiently severe to need to block later-manifest reconciliation? |
Thanks for referencing that, I'd say my view was mostly inline with the statement |
I agree with Scott (I think) - I don’t believe degraded is an acceptable state for operators ever, especially during upgrade, as we have historically defined it in practice. |
I'm not arguing for it to be acceptable, I'm arguing about it being non-blocking. We will still alert if operators go Degraded=True in the wild. We can build CI to fail operators that go Degraded=True during a run, if we don't do that already. But if an operator goes Degraded=True in a customer cluster, it's not clear to me how sticking mid-update is helping the customer resolve that situation, vs. pushing through with the rest of the update, as long as the operator is Available=True, and letting admins sort out the degrading issue orthogonally. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@wking: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Picking this one back up for more discussion. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
02fad45
to
b0a2a4b
Compare
@wking: This pull request references OTA-542 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b0a2a4b
to
cf0f1e3
Compare
We have blocking on this condition since 545c342 (api: make status substruct on operatorstatus, 2018-10-15, openshift#31) when it was Failing. We'd softened our install-time handling to act this way back in b0b4902 (clusteroperator: Don't block on failing during initialization, 2019-03-11, openshift#136), motivated by install speed [1]. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes the Degraded=True block from the remaining modes. We still have the warning ClusterOperatorDegraded alerting admins when an operator goes Degraded=True for a while, we will just no longer block updates at that point. We won't block ReconcilingMode manifest application either, but since that's already flattened and permuted, and ClusterOperator tend to be towards the end of their TaskNode, the impact on ReconcilingMode is minimal (except that we will no longer go Failing=True in ClusterVersion when the only issue is some Degraded=True ClusterOperator). [1]: openshift#136 (comment)
/cc |
cf0f1e3
to
b198db0
Compare
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
We have blocking on this condition since 545c342 (#31) when it was
Failing
. We'd softened our install-time handling to act this way back in b0b4902 (#136), motivated by install speed. And a degraded operator may slow dependent components in their own transitions. But as long as the operator/operand are available at all, it should not block depndent components from transitioning, so this commit removes theDegraded=True
block from the remaining modes.We still have the critical
ClusterOperatorDegraded
waking admins up when an operator goesDegraded=True
for a while, we will just no longer block updates at that point. We won't blockReconcilingMode
manifest application either, but since that's already flattened and permuted, andClusterOperator
tend to be towards the end of theirTaskNode
, the impact onReconcilingMode
is minimal (except that we will no longer goFailing=True
inClusterVersion
when the only issue is someDegraded=True
ClusterOperator).CC @abhinavdahiya, @deads2k, @smarterclayton as folks who were involved in the logic I'm removing here.