-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid setting partial api enablements in cluster status #5325
Conversation
Signed-off-by: yi.wu <[email protected]>
13e1e4c
to
d8a6ce4
Compare
Thanks @NickYadance |
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #5325 +/- ##
===========================================
- Coverage 51.39% 28.42% -22.98%
===========================================
Files 250 632 +382
Lines 24979 43836 +18857
===========================================
- Hits 12839 12460 -379
- Misses 11433 30473 +19040
- Partials 707 903 +196
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hi @NickYadance, thanks~ Here, we do indeed need to continue updating the Cluster object instead of exiting directly, because there is a situation where, after a cluster has just been successfully created and joined Karmada, a certain API has not yet been successfully installed, but other types of APIs are functioning normally. If we skip it directly at this point, it will affect the distribution of normally functioning API types. Currently, the processing logic here may not be very reasonable, and there are issues that reflect problems related to this. One of my ideas is to analyze the errors returned by the request to decide whether to update the Cluster status. I also have a strange thought: if the apienablements field in the cluster status is empty, then update it; otherwise, return an error. |
indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment. |
sounds indeed strange to me, to update aipenablements for only once. it should be fine to exit on error here since it will requeue and succeed at sometime later. the thing is to not bring in-contact apienablements into the cluster status. |
Signed-off-by: yi.wu <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
Signed-off-by: yi.wu <[email protected]>
yes it's just too dangerous for karmada to draw massive deletions to the cluster. Extra handling in scheduler #5216 is very helpful, still, keeping apiEnablements contact in cluster helps avoid potential issues too. |
klog.Warningf("Maybe get partial(%d) APIs installed in Cluster %s. Error: %v.", len(apiEnables), cluster.GetName(), err) | ||
if err != nil { | ||
klog.Errorf("Failed to get APIs installed in Cluster %s. Error: %v.", cluster.GetName(), err) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried that this may cause other problems. If an error is returned here, the cluster will be unhealthy. If some APIs cannot be obtained, the cluster will be unhealthy. @XiShanYongYe-Chang @NickYadance @whitewindmills
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be unhealthy ? seems the cluster needs to be healthy&online to reach getAPIEnablements, if err returned the reconcile will be requeue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When changing from an unhealthy state to a healthy state, if some APIs have problems, an error will be returned here and the queue will be requeued. Unable to update the cluster to a healthy state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, maybe it's better keep apiEnablements unchanged when err happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no for above, if we do that the incorrect status might be kept until next reconcile which is not good at all
Do you have any better ideas? |
okay |
next week's meeting agenda is already quite busy, I'm afraid we won't have a chance to discuss it. |
Don't worry let's try it. At least this brings attention. |
hi @NickYadance, after the community meeting discussion, we still keep it as it is but add a new condition(named |
understand your concern, i'll keep the pr open, feel free to close it. |
@NickYadance you can continue this PR. |
Hi @NickYadance, just as @whitewindmills said, would you like to continue to contribute? |
no maybe, i would personally prefer to return on err here and stay simple, if the cluster cannot be healthy due to not being able to retrieve apienablements, so be unhealthy. |
What type of PR is this?
bug
What this PR does / why we need it:
avoid setting partial api enablements in cluster status
Which issue(s) this PR fixes:
Fixes #5309
Special notes for your reviewer:
Does this PR introduce a user-facing change?: