avoid setting partial api enablements in cluster status #5325

NickYadance · 2024-08-08T08:14:43Z

What type of PR is this?
bug

What this PR does / why we need it:
avoid setting partial api enablements in cluster status

Which issue(s) this PR fixes:
Fixes #5309

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: yi.wu <[email protected]>

XiShanYongYe-Chang · 2024-08-08T08:25:55Z

Thanks @NickYadance
/assign

codecov-commenter · 2024-08-08T08:32:14Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.

Project coverage is 28.42%. Comparing base (e7300c3) to head (a47b104).
Report is 624 commits behind head on master.

Files	Patch %	Lines
...kg/controllers/status/cluster_status_controller.go	0.00%	4 Missing and 2 partials ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5325       +/-   ##
===========================================
- Coverage   51.39%   28.42%   -22.98%     
===========================================
  Files         250      632      +382     
  Lines       24979    43836    +18857     
===========================================
- Hits        12839    12460      -379     
- Misses      11433    30473    +19040     
- Partials      707      903      +196

Flag	Coverage Δ
unittests	`28.42% <0.00%> (-22.98%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

XiShanYongYe-Chang · 2024-08-08T11:45:17Z

Hi @NickYadance, thanks~

Here, we do indeed need to continue updating the Cluster object instead of exiting directly, because there is a situation where, after a cluster has just been successfully created and joined Karmada, a certain API has not yet been successfully installed, but other types of APIs are functioning normally. If we skip it directly at this point, it will affect the distribution of normally functioning API types.

Currently, the processing logic here may not be very reasonable, and there are issues that reflect problems related to this. One of my ideas is to analyze the errors returned by the request to decide whether to update the Cluster status. I also have a strange thought: if the apienablements field in the cluster status is empty, then update it; otherwise, return an error.

whitewindmills · 2024-08-09T02:19:34Z

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

NickYadance · 2024-08-09T02:24:50Z

I also have a strange thought: if the apienablements field in the cluster status is empty, then update it; otherwise, return an error

sounds indeed strange to me, to update aipenablements for only once.

it should be fine to exit on error here since it will requeue and succeed at sometime later. the thing is to not bring in-contact apienablements into the cluster status.

Signed-off-by: yi.wu <[email protected]>

karmada-bot · 2024-08-09T02:38:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

whitewindmills · 2024-08-09T02:40:43Z

/ok-to-test

Signed-off-by: yi.wu <[email protected]>

NickYadance · 2024-08-09T07:13:33Z

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

yes it's just too dangerous for karmada to draw massive deletions to the cluster. Extra handling in scheduler #5216 is very helpful, still, keeping apiEnablements contact in cluster helps avoid potential issues too.

yanfeng1992 · 2024-08-09T08:32:05Z

pkg/controllers/status/cluster_status_controller.go

-		klog.Warningf("Maybe get partial(%d) APIs installed in Cluster %s. Error: %v.", len(apiEnables), cluster.GetName(), err)
+	if err != nil {
+		klog.Errorf("Failed to get APIs installed in Cluster %s. Error: %v.", cluster.GetName(), err)
+		return err


I am a little worried that this may cause other problems. If an error is returned here, the cluster will be unhealthy. If some APIs cannot be obtained, the cluster will be unhealthy. @XiShanYongYe-Chang @NickYadance @whitewindmills

would it be unhealthy ? seems the cluster needs to be healthy&online to reach getAPIEnablements, if err returned the reconcile will be requeue.

When changing from an unhealthy state to a healthy state, if some APIs have problems, an error will be returned here and the queue will be requeued. Unable to update the cluster to a healthy state

sounds reasonable, maybe it's better keep apiEnablements unchanged when err happens.

no for above, if we do that the incorrect status might be kept until next reconcile which is not good at all

XiShanYongYe-Chang · 2024-08-09T09:21:04Z

indeed, updating APIENABLEMENTS is a very dangerous operation, which may cause the scheduler to delete the scheduled results. we're struggling with this lately. it caused us huge losses in the production environment.

Do you have any better ideas?

whitewindmills · 2024-08-09T09:34:56Z

that's what #5325 and #5216 work on. lets move them forward.

XiShanYongYe-Chang · 2024-08-09T09:42:39Z

that's what #5325 and #5216 work on. lets move them forward.

How about we discuss this at a community meeting, get more people's opinions?

whitewindmills · 2024-08-09T09:46:07Z

okay

XiShanYongYe-Chang · 2024-08-09T09:47:55Z

that's what #5325 and #5216 work on. lets move them forward.

How about we discuss this at a community meeting, get more people's opinions?

cc @NickYadance @yanfeng1992

whitewindmills · 2024-08-09T09:48:52Z

next week's meeting agenda is already quite busy, I'm afraid we won't have a chance to discuss it.

RainbowMango · 2024-08-12T01:59:30Z

next week's meeting agenda is already quite busy, I'm afraid we won't have a chance to discuss it.

Don't worry let's try it. At least this brings attention.

whitewindmills · 2024-08-13T09:16:26Z

hi @NickYadance, after the community meeting discussion, we still keep it as it is but add a new condition(named InvalidAPIEnablements? whatever you name it, it's up to you) when err != nil or apiEnablements is empty.
the scheduler will consider this condition.

NickYadance · 2024-08-13T10:59:16Z

hi @NickYadance, after the community meeting discussion, we still keep it as it is but add a new condition(named InvalidAPIEnablements? whatever you name it, it's up to you) when err != nil or apiEnablements is empty. the scheduler will consider this condition.

understand your concern, i'll keep the pr open, feel free to close it.

whitewindmills · 2024-08-13T11:07:22Z

@NickYadance you can continue this PR.

XiShanYongYe-Chang · 2024-08-17T02:42:32Z

Hi @NickYadance, just as @whitewindmills said, would you like to continue to contribute?

NickYadance · 2024-08-19T06:59:44Z

Hi @NickYadance, just as @whitewindmills said, would you like to continue to contribute?

no maybe, i would personally prefer to return on err here and stay simple, if the cluster cannot be healthy due to not being able to retrieve apienablements, so be unhealthy.

karmada-bot requested review from jwcesign and pigletfly August 8, 2024 08:14

karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Aug 8, 2024

avoid setting partial api enablements in cluster status

d8a6ce4

Signed-off-by: yi.wu <[email protected]>

NickYadance force-pushed the avoid-partial-apis branch from 13e1e4c to d8a6ce4 Compare August 8, 2024 08:18

karmada-bot assigned XiShanYongYe-Chang Aug 8, 2024

add more error handling when setting APIs in cluster status

29b581a

Signed-off-by: yi.wu <[email protected]>

karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 9, 2024

karmada-bot added the ok-to-test label Aug 9, 2024

fix ut

a47b104

Signed-off-by: yi.wu <[email protected]>

yanfeng1992 reviewed Aug 9, 2024

View reviewed changes

whitewindmills mentioned this pull request Aug 16, 2024

failover feature-gate Cannot be closed correctly #5375

Open

NickYadance closed this Aug 19, 2024

whitewindmills mentioned this pull request Aug 19, 2024

add new cluster condition: CompleteAPIEnablements #5400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid setting partial api enablements in cluster status #5325

avoid setting partial api enablements in cluster status #5325

NickYadance commented Aug 8, 2024

XiShanYongYe-Chang commented Aug 8, 2024

codecov-commenter commented Aug 8, 2024 •

edited

Loading

XiShanYongYe-Chang commented Aug 8, 2024

whitewindmills commented Aug 9, 2024

NickYadance commented Aug 9, 2024 •

edited

Loading

karmada-bot commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

NickYadance commented Aug 9, 2024

yanfeng1992 Aug 9, 2024 •

edited

Loading

NickYadance Aug 9, 2024

yanfeng1992 Aug 9, 2024

NickYadance Aug 9, 2024

NickYadance Aug 9, 2024 •

edited

Loading

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

RainbowMango commented Aug 12, 2024

whitewindmills commented Aug 13, 2024

NickYadance commented Aug 13, 2024

whitewindmills commented Aug 13, 2024

XiShanYongYe-Chang commented Aug 17, 2024

NickYadance commented Aug 19, 2024

avoid setting partial api enablements in cluster status #5325

avoid setting partial api enablements in cluster status #5325

Conversation

NickYadance commented Aug 8, 2024

XiShanYongYe-Chang commented Aug 8, 2024

codecov-commenter commented Aug 8, 2024 • edited Loading

Codecov Report

XiShanYongYe-Chang commented Aug 8, 2024

whitewindmills commented Aug 9, 2024

NickYadance commented Aug 9, 2024 • edited Loading

karmada-bot commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

NickYadance commented Aug 9, 2024

yanfeng1992 Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

NickYadance Aug 9, 2024

Choose a reason for hiding this comment

yanfeng1992 Aug 9, 2024

Choose a reason for hiding this comment

NickYadance Aug 9, 2024

Choose a reason for hiding this comment

NickYadance Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

XiShanYongYe-Chang commented Aug 9, 2024

whitewindmills commented Aug 9, 2024

RainbowMango commented Aug 12, 2024

whitewindmills commented Aug 13, 2024

NickYadance commented Aug 13, 2024

whitewindmills commented Aug 13, 2024

XiShanYongYe-Chang commented Aug 17, 2024

NickYadance commented Aug 19, 2024

codecov-commenter commented Aug 8, 2024 •

edited

Loading

NickYadance commented Aug 9, 2024 •

edited

Loading

yanfeng1992 Aug 9, 2024 •

edited

Loading

NickYadance Aug 9, 2024 •

edited

Loading