Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid setting partial api enablements in cluster status #5325

Closed
wants to merge 3 commits into from

Conversation

NickYadance
Copy link

What type of PR is this?
bug

What this PR does / why we need it:
avoid setting partial api enablements in cluster status

Which issue(s) this PR fixes:
Fixes #5309

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 8, 2024
@XiShanYongYe-Chang
Copy link
Member

Thanks @NickYadance
/assign

@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 28.42%. Comparing base (e7300c3) to head (a47b104).
Report is 624 commits behind head on master.

Files Patch % Lines
...kg/controllers/status/cluster_status_controller.go 0.00% 4 Missing and 2 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5325       +/-   ##
===========================================
- Coverage   51.39%   28.42%   -22.98%     
===========================================
  Files         250      632      +382     
  Lines       24979    43836    +18857     
===========================================
- Hits        12839    12460      -379     
- Misses      11433    30473    +19040     
- Partials      707      903      +196     
Flag Coverage Δ
unittests 28.42% <0.00%> (-22.98%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@XiShanYongYe-Chang
Copy link
Member

Hi @NickYadance, thanks~

Here, we do indeed need to continue updating the Cluster object instead of exiting directly, because there is a situation where, after a cluster has just been successfully created and joined Karmada, a certain API has not yet been successfully installed, but other types of APIs are functioning normally. If we skip it directly at this point, it will affect the distribution of normally functioning API types.

Currently, the processing logic here may not be very reasonable, and there are issues that reflect problems related to this. One of my ideas is to analyze the errors returned by the request to decide whether to update the Cluster status. I also have a strange thought: if the apienablements field in the cluster status is empty, then update it; otherwise, return an error.

@whitewindmills
Copy link
Member

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

@NickYadance
Copy link
Author

NickYadance commented Aug 9, 2024

I also have a strange thought: if the apienablements field in the cluster status is empty, then update it; otherwise, return an error

sounds indeed strange to me, to update aipenablements for only once.

it should be fine to exit on error here since it will requeue and succeed at sometime later. the thing is to not bring in-contact apienablements into the cluster status.

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 9, 2024
@whitewindmills
Copy link
Member

/ok-to-test

Signed-off-by: yi.wu <[email protected]>
@NickYadance
Copy link
Author

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

yes it's just too dangerous for karmada to draw massive deletions to the cluster. Extra handling in scheduler #5216 is very helpful, still, keeping apiEnablements contact in cluster helps avoid potential issues too.

klog.Warningf("Maybe get partial(%d) APIs installed in Cluster %s. Error: %v.", len(apiEnables), cluster.GetName(), err)
if err != nil {
klog.Errorf("Failed to get APIs installed in Cluster %s. Error: %v.", cluster.GetName(), err)
return err
Copy link
Member

@yanfeng1992 yanfeng1992 Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried that this may cause other problems. If an error is returned here, the cluster will be unhealthy. If some APIs cannot be obtained, the cluster will be unhealthy. @XiShanYongYe-Chang @NickYadance @whitewindmills

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be unhealthy ? seems the cluster needs to be healthy&online to reach getAPIEnablements, if err returned the reconcile will be requeue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When changing from an unhealthy state to a healthy state, if some APIs have problems, an error will be returned here and the queue will be requeued. Unable to update the cluster to a healthy state

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable, maybe it's better keep apiEnablements unchanged when err happens.

Copy link
Author

@NickYadance NickYadance Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no for above, if we do that the incorrect status might be kept until next reconcile which is not good at all

@XiShanYongYe-Chang
Copy link
Member

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

Do you have any better ideas?

@whitewindmills
Copy link
Member

that's what #5325 and #5216 work on. lets move them forward.

@XiShanYongYe-Chang
Copy link
Member

that's what #5325 and #5216 work on. lets move them forward.

How about we discuss this at a community meeting, get more people's opinions?

@whitewindmills
Copy link
Member

okay

@XiShanYongYe-Chang
Copy link
Member

that's what #5325 and #5216 work on. lets move them forward.

How about we discuss this at a community meeting, get more people's opinions?

cc @NickYadance @yanfeng1992

@whitewindmills
Copy link
Member

next week's meeting agenda is already quite busy, I'm afraid we won't have a chance to discuss it.

@RainbowMango
Copy link
Member

next week's meeting agenda is already quite busy, I'm afraid we won't have a chance to discuss it.

Don't worry let's try it. At least this brings attention.

@whitewindmills
Copy link
Member

hi @NickYadance, after the community meeting discussion, we still keep it as it is but add a new condition(named InvalidAPIEnablements? whatever you name it, it's up to you) when err != nil or apiEnablements is empty.
the scheduler will consider this condition.

@NickYadance
Copy link
Author

hi @NickYadance, after the community meeting discussion, we still keep it as it is but add a new condition(named InvalidAPIEnablements? whatever you name it, it's up to you) when err != nil or apiEnablements is empty. the scheduler will consider this condition.

understand your concern, i'll keep the pr open, feel free to close it.

@whitewindmills
Copy link
Member

@NickYadance you can continue this PR.

@XiShanYongYe-Chang
Copy link
Member

Hi @NickYadance, just as @whitewindmills said, would you like to continue to contribute?

@NickYadance
Copy link
Author

Hi @NickYadance, just as @whitewindmills said, would you like to continue to contribute?

no maybe, i would personally prefer to return on err here and stay simple, if the cluster cannot be healthy due to not being able to retrieve apienablements, so be unhealthy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants