Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkloadRebalancer doesn't get deleted after TTL. #5985

Closed
deefreak opened this issue Dec 27, 2024 · 10 comments · Fixed by #5989
Closed

WorkloadRebalancer doesn't get deleted after TTL. #5985

deefreak opened this issue Dec 27, 2024 · 10 comments · Fixed by #5989
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@deefreak
Copy link

What happened:
A WorkloadRebalancer created with a TTL(Time To Live) field in the spec is not getting deleted after TTL has reached.

What you expected to happen:
WorkloadRebalancer should get deleted after TTL.

How to reproduce it (as minimally and precisely as possible):
Create a workloadRebalancer with .spec.ttlSecondsAfterFinished . Verify that it is not deleted after TTL has been passed.

Anything else we need to know?:
NA

Possible Cause:
I debugged this and found few issues with the code of the workloadrebalancer_controller.go file.

  • The controller reconciles when the workloadrebalancer is created or the spec changes. See here.
  • We are checking for the rebalance.Status.FinishTime == nil before setting the rebalancer.Status = *newStatus here. So, line 93 always evaluates to true and it never reconciles again and the workload rebalancer doesn't get deleted.

Environment:

  • Karmada version: 1.11
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:
@deefreak deefreak added the kind/bug Categorizes issue or PR as related to a bug. label Dec 27, 2024
@RainbowMango
Copy link
Member

cc @chaosi-zju

@chaosi-zju
Copy link
Member

Thank you for your feedback @deefreak

Why failed

It's a outrageous low-level error.

When reconciling for the first time, the rebalancer.Status.FinishTime is nil and run into a unexpected code block.

Here it is returned directly, skipping the following deletion detection logic.

It is expected to judge whether newStatus.FinishTime == nil.

if rebalancer.Status.FinishTime == nil {
// should never reach here.
klog.Errorf("finishTime shouldn't be nil, current status: %+v", rebalancer.Status)
return controllerruntime.Result{}, nil
}

How to fix

see #5989

How to prevent

Although the original e2e had test case for deletion, it was still not robust enough, so this time an new e2e case was added.

@RainbowMango
Copy link
Member

We are checking for the rebalance.Status.FinishTime == nil before setting the rebalancer.Status = *newStatus here. So, line 93 always evaluates to true and it never reconciles again and the workload rebalancer doesn't get deleted.

It looks like a fundamental use case. @chaosi-zju can you help to recall why it wasn't discovered in previous development?

@chaosi-zju
Copy link
Member

It looks like a fundamental use case. @chaosi-zju can you help to recall why it wasn't discovered in previous development?

I thought about it a lot this morning, but found only two possibilities:

  • tests were done when the code was first submitted, and everything seemed fine. However, after the last review changes, a bug was introduced, and the final code wasn't thoroughly tested.
  • verification is performed in two steps: first, confirm the rebalancer works, then add TTL for automatic cleanup. This will bypass the error.

@RainbowMango
Copy link
Member

That's possible.
I reviewed the #4894 several rounds:

  • May 14
  • May 17
  • May 20
  • May 21
  • May 23
  • May 24

But I didn't ask for a final test report and didn't review the E2E tests part which was reviewed by @XiShanYongYe-Chang.
That reminds me I should always ask for a test report and review the coverage of E2E tests.

@RainbowMango
Copy link
Member

/assign @chaosi-zju
in favor of #5989

Great thanks to @deefreak for reporting this issue.

I think this issue should be backported to release-1.11 and release-1.12 after fixed on the master.

@chaosi-zju
Copy link
Member

That reminds me I should always ask for a test report and review the coverage of E2E tests.

Because when I finally submitted the code, I had already added the e2e case, so there was no manual verification.

But the e2e case was just right confirm the rebalancer works first and then add TTL for automatic cleanup, rather than directly create a rebalancer with TTL.

@deefreak
Copy link
Author

That reminds me I should always ask for a test report and review the coverage of E2E tests.

Because when I finally submitted the code, I had already added the e2e case, so there was no manual verification.

But the e2e case was just right confirm the rebalancer works first and then add TTL for automatic cleanup, rather than directly create a rebalancer with TTL.

Yea , something like this I saw in the test case.
Initially, I just made this work by reconciling the workloadrebalancer on status updates as well.
Yesterday, while raising the issue, went through the code again and found that this nil check is something that is messing up with the logic.

@RainbowMango
Copy link
Member

Opened #5992 to track the backport tasks.
@deefreak welcome to help~

@deefreak
Copy link
Author

@RainbowMango raised 3 PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

3 participants