-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor LB #31
Refactor LB #31
Conversation
818ac20
to
075494b
Compare
Local test: (2) Delete lb3, IP is released and then lb4 get it (3) When prober fails on all backend servers: (3.1) Disable Healthy check, it became Ready (3.2) Change helthy check port (from not working 80 to working 22) Then it became Ready added at 2024.08.09 A dummy endpoint is appended to avoid the LB traffic is routed to local host accidently. When there is no backend server, ot backend server is detected as none-ready, it is added.
|
e8db723
to
cc6d434
Compare
2321047
to
015c0b2
Compare
Allow place-holder of LB without any VM instance existing Fix IP duplicated allocation or release Simply VMI controller Enhance health check probe Refactor lb controller, move lb to none-ready status in various error cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor generally LGTM, thank you for the great work.
However, letting users create dummy LBs (LBs without backend servers) introduces a problem: if the listening port of an LB is 22
, 80
, 443
, or anything that Harvester exposes on the nodes, users will be able to access the Harvester services unexpectedly. The same situation also happens to LBs, with all its backend servers failing the health check. This could be an existing issue; I just want to bring it up that letting users create dummy LBs will widen the hole.
// add the existing endpoint | ||
endpoints = append(endpoints, ep) | ||
break | ||
} | ||
} | ||
// add the non-existing endpoint | ||
if !flag { | ||
if !existing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we, by default, set the ready condition of the newly added endpoint to false
? In the current implementation, if an LB with a health checker defined is created, the backend server endpoints' ready condition will be true
from the beginning. This behavior seems a bit unexpected from the POV of a user who intentionally configured a health checker. Though this PR does not introduce the behavior, I suggest we make it this way. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few considerations :
- User may enable healthCheck on the fly, set it as
false
may have a short interruption; if frequently enable/disable, the default true seems to be a bit better
if backend server works well, it has almost no interruption; if not, the traffic itself is interrupted - For history compatibility
- Sure, it has drawbacks, the UI may show LB none-ready a bit late, if everything is from beginning:
LB transfers from: created, get-lb-ip, no-backend server, VM is up and IP is detected, LB becomes ready, health-check-fail-and-become-none-ready.
If someone looks continuously, he found the LB is ready for a bit while then none-ready; other similar cases
At the moment, I would like to keep it initially as true
. Will add this to Harvester document. Your idea? thanks.
pkg/controller/vmi/controller.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could use the handy relatedresource.Watch
as the following to replace the entire vmi controller here:
relatedresource.Watch(ctx, "lb-trigger", func(namespace, name string, obj runtime.Object) ([]relatedresource.Key, error) {
var keys []relatedresource.Key
vmi, ok := obj.(*kubevirtv1.VirtualMachineInstance)
if !ok {
return keys, nil
}
lbs, err := handler.lbCache.List(vmi.Namespace, labels.Everything())
if err != nil {
return nil, fmt.Errorf("fail to list load balancers, error: %w", err)
}
for _, lb := range lbs {
// skip the cluster LB or the LB whose server selector is empty
if lb.DeletionTimestamp != nil || lb.Spec.WorkloadType == lbv1.Cluster || len(lb.Spec.BackendServerSelector) == 0 {
continue
}
// notify LB
selector, err := utils.NewSelector(lb.Spec.BackendServerSelector)
if err != nil {
return nil, fmt.Errorf("fail to parse selector %+v, error: %w", lb.Spec.BackendServerSelector, err)
}
if selector.Matches(labels.Set(vmi.Labels)) {
logrus.Debugf("VMI %s/%s notify lb %s/%s", vmi.Namespace, vmi.Name, lb.Namespace, lb.Name)
handler.lbController.Enqueue(lb.Namespace, lb.Name)
}
}
return keys, nil
}, lbc, vmis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will test this on the last step, thanks.
035a24e
to
b938e34
Compare
@starbops @FrankYang0529 Thanks for your effective review, please take a new look, thanks. Besides the minor fix, the second commit brings: Validator returns such error
Mutator makes sure:
When user enable the healthCheck, but does not fill the time/thresh param, they are mutatored to: |
b938e34
to
53108d3
Compare
Signed-off-by: Jian Wang <[email protected]>
53108d3
to
83e1432
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
The optimizaitons:
Remove the strict limitation of at least one VM should exist when creating a LB
Refactor the controllers
2.1 VMI changes only enqueue LB, instead to add/remove backend servers
2.2 LB changes to
Not Ready
in case of errors like ip allocation failure, serivce failure, endpointslices failure...2.3 Remove the controller looping waiting of External IP, use enqueue instead
Refactor the IP Allocation (TBD), the current code has a few bugs [Bug] Load Balancer Deployment Fails in Both Guest and Harvester Cluster Scenarios harvester#5033 (comment)
...
Related issues:
harvester/harvester#5316
harvester/harvester#4821
harvester/harvester#4972
harvester/harvester#5033
Test plan:
(1) LB can be created alone when VM is not existing, the status Ready is False
(2) VM's can be added/removed to/from LB freely
(3) LB can allocate & free IP from/to pool robustly, test the IP exhausting of pool, even no VM instances.
(4) If health-check probe is enabled and all probe fails when there are at least 1 active VMs on this LB, the status Ready is False; if probe is disabled, the LB turns to Ready