Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor LB #31

Merged
merged 2 commits into from
Aug 29, 2024
Merged

Refactor LB #31

merged 2 commits into from
Aug 29, 2024

Conversation

w13915984028
Copy link
Member

@w13915984028 w13915984028 commented Jul 3, 2024

The optimizaitons:

  1. Remove the strict limitation of at least one VM should exist when creating a LB

  2. Refactor the controllers
    2.1 VMI changes only enqueue LB, instead to add/remove backend servers
    2.2 LB changes to Not Ready in case of errors like ip allocation failure, serivce failure, endpointslices failure...
    2.3 Remove the controller looping waiting of External IP, use enqueue instead

  3. Refactor the IP Allocation (TBD), the current code has a few bugs [Bug] Load Balancer Deployment Fails in Both Guest and Harvester Cluster Scenarios harvester#5033 (comment)
    ...

Related issues:
harvester/harvester#5316
harvester/harvester#4821
harvester/harvester#4972
harvester/harvester#5033

Test plan:
(1) LB can be created alone when VM is not existing, the status Ready is False
(2) VM's can be added/removed to/from LB freely
(3) LB can allocate & free IP from/to pool robustly, test the IP exhausting of pool, even no VM instances.
(4) If health-check probe is enabled and all probe fails when there are at least 1 active VMs on this LB, the status Ready is False; if probe is disabled, the LB turns to Ready

@w13915984028
Copy link
Member Author

w13915984028 commented Jul 3, 2024

Local test:
(1) Various of Not-Ready status of LB

image

(2) Delete lb3, IP is released and then lb4 get it

image

(3) When prober fails on all backend servers:

image

(3.1) Disable Healthy check, it became Ready

image

(3.2) Change helthy check port (from not working 80 to working 22)

image

Then it became Ready

image

added at 2024.08.09
4. the dummy endpoint

image

A dummy endpoint is appended to avoid the LB traffic is routed to local host accidently. When there is no backend server, ot backend server is detected as none-ready, it is added.

endpointslice
NAME         ADDRESSTYPE   PORTS   ENDPOINTS                AGE

lb1          IPv4          443     10.52.0.15,10.52.0.255   5h42m  // VM 10.52.0.15 is not ready, then 10.52.0.255 is added
lb3          IPv4          443     10.52.0.255              3h43m
lb5          IPv4          443     10.52.0.255              3h40m
harv31:/home/rancher # kk get endpointslice -n test1
NAME        ADDRESSTYPE   PORTS   ENDPOINTS     AGE
test1-lb1   IPv4          443     10.52.0.255   4m16s

endpointslice lb1 -oyaml
addressType: IPv4
apiVersion: discovery.k8s.io/v1
endpoints:
- addresses:
  - 10.52.0.15
  conditions:
    ready: false
  targetRef:
    name: vm1
    namespace: default
    uid: a8ebb12c-bea8-4feb-8079-3c675d7dd00e
- addresses:
  - 10.52.0.255
  conditions:
    ready: true
  targetRef:
    name: lb1
    namespace: default
    uid: dummy347-546a-4642-9da6-endpoint5608
kind: EndpointSlice
metadata:
  creationTimestamp: "2024-08-09T07:58:57Z"
  generation: 28
  labels:
    kubernetes.io/service-name: lb1
    loadbalancer.harvesterhci.io/servicelb: "true"
  name: lb1
  namespace: default

@w13915984028 w13915984028 force-pushed the fixwebhook branch 10 times, most recently from e8db723 to cc6d434 Compare July 9, 2024 16:38
@w13915984028 w13915984028 force-pushed the fixwebhook branch 5 times, most recently from 2321047 to 015c0b2 Compare July 11, 2024 20:06
@w13915984028 w13915984028 changed the title [WIP] Refactor LB Refactor LB Jul 11, 2024
  Allow place-holder of LB without any VM instance existing
  Fix IP duplicated allocation or release
  Simply VMI controller
  Enhance health check probe
  Refactor lb controller, move lb to none-ready status in various error cases
Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor generally LGTM, thank you for the great work.

However, letting users create dummy LBs (LBs without backend servers) introduces a problem: if the listening port of an LB is 22, 80, 443, or anything that Harvester exposes on the nodes, users will be able to access the Harvester services unexpectedly. The same situation also happens to LBs, with all its backend servers failing the health check. This could be an existing issue; I just want to bring it up that letting users create dummy LBs will widen the hole.

pkg/lb/servicelb/manager.go Outdated Show resolved Hide resolved
// add the existing endpoint
endpoints = append(endpoints, ep)
break
}
}
// add the non-existing endpoint
if !flag {
if !existing {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we, by default, set the ready condition of the newly added endpoint to false? In the current implementation, if an LB with a health checker defined is created, the backend server endpoints' ready condition will be true from the beginning. This behavior seems a bit unexpected from the POV of a user who intentionally configured a health checker. Though this PR does not introduce the behavior, I suggest we make it this way. WDYT?

Copy link
Member Author

@w13915984028 w13915984028 Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few considerations :

  1. User may enable healthCheck on the fly, set it as false may have a short interruption; if frequently enable/disable, the default true seems to be a bit better
    if backend server works well, it has almost no interruption; if not, the traffic itself is interrupted
  2. For history compatibility
  3. Sure, it has drawbacks, the UI may show LB none-ready a bit late, if everything is from beginning:
    LB transfers from: created, get-lb-ip, no-backend server, VM is up and IP is detected, LB becomes ready, health-check-fail-and-become-none-ready.
    If someone looks continuously, he found the LB is ready for a bit while then none-ready; other similar cases

At the moment, I would like to keep it initially as true. Will add this to Harvester document. Your idea? thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could use the handy relatedresource.Watch as the following to replace the entire vmi controller here:

	relatedresource.Watch(ctx, "lb-trigger", func(namespace, name string, obj runtime.Object) ([]relatedresource.Key, error) {
		var keys []relatedresource.Key

		vmi, ok := obj.(*kubevirtv1.VirtualMachineInstance)
		if !ok {
			return keys, nil
		}

		lbs, err := handler.lbCache.List(vmi.Namespace, labels.Everything())
		if err != nil {
			return nil, fmt.Errorf("fail to list load balancers, error: %w", err)
		}

		for _, lb := range lbs {
			// skip the cluster LB or the LB whose server selector is empty
			if lb.DeletionTimestamp != nil || lb.Spec.WorkloadType == lbv1.Cluster || len(lb.Spec.BackendServerSelector) == 0 {
				continue
			}
			// notify LB
			selector, err := utils.NewSelector(lb.Spec.BackendServerSelector)
			if err != nil {
				return nil, fmt.Errorf("fail to parse selector %+v, error: %w", lb.Spec.BackendServerSelector, err)
			}

			if selector.Matches(labels.Set(vmi.Labels)) {
				logrus.Debugf("VMI %s/%s notify lb %s/%s", vmi.Namespace, vmi.Name, lb.Namespace, lb.Name)
				handler.lbController.Enqueue(lb.Namespace, lb.Name)
			}
		}
		return keys, nil
	}, lbc, vmis)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will test this on the last step, thanks.

pkg/prober/worker.go Show resolved Hide resolved
@w13915984028 w13915984028 force-pushed the fixwebhook branch 3 times, most recently from 035a24e to b938e34 Compare August 9, 2024 18:39
@w13915984028
Copy link
Member Author

w13915984028 commented Aug 9, 2024

@starbops @FrankYang0529 Thanks for your effective review, please take a new look, thanks.

Besides the minor fix, the second commit brings:
(1) Add a dummy endpoint to lb if it has no endpoint/all endpoints are not Ready, thus the traffic will be routed to other service/local host accidently.
(2) Validator and mutator for potential invalid port number and healthy check parameters.

Validator returns such error


error: loadbalancers.loadbalancer.harvesterhci.io "lb1" could not be patched: admission webhook "validator.harvester-system.harvester-load-balancer-webhook" denied the request: 
Internal error occurred: update loadbalancer default/lb1 failed: listener backend port 443333 must <= 65535

..update loadbalancer default/lb1 failed with healthyCheck: healthcheck port 4435 is not in listener backend port list

..update loadbalancer default/lb1 failed with healthyCheck: healthcheck port 443 can only be a TCP backend port

..update loadbalancer default/lb1 failed: listener port 0 must >= 1

..update loadbalancer default/lb1 failed: the loadbalancer needs to have at least one listener

Mutator makes sure:

  healthCheck:
    failureThreshold: 2 // min value 2
    periodSeconds: 1  // min value 1
    port: 443
    successThreshold: 2 // min value 2
    timeoutSeconds: 3 // min value 1

When user enable the healthCheck, but does not fill the time/thresh param, they are mutatored to:
image

Copy link
Member

@FrankYang0529 FrankYang0529 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR.

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@starbops starbops merged commit df4c506 into harvester:master Aug 29, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants