Integrate autoscaler with VerticaDB #195

spilchen · 2022-04-08T15:55:11Z

This adds support for the horizontal pod autoscaler (HPA) to work with VerticaDB. With this integration, you can have k8s automatically scale a VerticaDB based on metrics from your workload.

We want flexibility in how we scale. Depending on the use case we are going to look at providing two types:

Subcluster: Scale by adding or removing subclusters. This is the preferred approach for "dashboard" style of queries, which typically run in a short amount of time. Multiple subclusters will all share the same service object. This means the client connections do not have to change their endpoints.
Pod: Scale by adding or removing pods to an existing subcluster. This is the preferred approach for longer running analytic queries. We recommend a shard to node ration of 2:1 or 3:1. If the ratio in the subcluster is smaller, it could benefit from increasing the size.

To allow either of these use cases to be used, or to add support for a different type of scaling, we are introducing a new custom resource that will manage autoscaling of a set of subclusters. The new CR is called VerticaAutoscaler.

We created a separate package to handle reconciliation of the new CR. It is handle by the same operator though.

A webhook was added for this new CR.

Sample usage:

Create your VerticaDB as normal. Be sure to include a CPU resource setting if scaling by CPU percent. Note, to scale beyond 3 nodes be sure to include your license.
kubectl apply -f config/samples/v1beta1_verticadb.yaml
Create the VerticaAutoscaler to indicate how we want to scale. The default in this file will scale by adding/removing subclusters.
kubectl apply -f config/samples/v1beta1_verticaautoscaler.yaml
Create the HPA object to tell k8s to autoscale.
kubectl autoscale verticaautoscaler/verticaautoscaler-sample --cpu-percent=70 --min=3 --max=6

Two changes to the restart reconciler for pending pods: - when we do re-ip, lets ignore pods that are pending. The old behaviour would requeue them. They don't have an IP, so there is nothing we can do - don't requeue restart if pods are running. We ended up getting into an infinite loop without anyway to get out. Lets continue on with the reconciler. I was trying to remove the pending pods but I couldn't because it was blocked in the restart.

I need to rethink these as it ended up breaking online upgrade.

spilchen · 2022-04-08T15:55:57Z

I'm pushing this to the autoscaler branch. It will go into main after we release 1.4.0 of the operator. I will create a changie entry when it goes into main.

A few minor changes to the e2e tests: - in online-upgrade-kill-transient, we occasionally see a test failure at step 45. We run a pod to check access to the subcluster. This can fail due to timing, so we will change it to use a job so that it restarts if access fails. Only if the access fails continuously will it fail the test - moved revivedb-1-node to the extra tests

roypaulin

Looks good. Just have a few comments.

roypaulin · 2022-04-11T17:28:42Z

pkg/controllers/vas/refreshcurrentsize_reconcile.go

+	ctrl "sigs.k8s.io/controller-runtime"
+)
+
+type RefreshCurrentSizeReconciler struct {


Could we put a comment to tell what this struct does?

roypaulin · 2022-04-11T17:29:32Z

pkg/controllers/vas/refreshselector_reconcile.go

+	ctrl "sigs.k8s.io/controller-runtime"
+)
+
+type RefreshSelectorReconciler struct {


roypaulin · 2022-04-11T17:34:36Z

pkg/controllers/vas/subclusterresize_reconcile.go

+	var res ctrl.Result
+	scalingDone := false
+	// Update the VerticaDB with a retry mechanism for any conflict updates
+	err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {


What exactly happens here? What is considered a conflict update?

It handles the case if someone else had update the object since you last fetched it. This will attempt to retry the update again by fetching the more recent copy of the object.

roypaulin · 2022-04-11T17:48:29Z

pkg/controllers/vas/subclusterscale_reconcile.go

+// considerAddingSubclusters will grow the Vdb by adding new subclusters.
+// Changes are made in-place in s.Vdb
+func (s *SubclusterScaleReconciler) considerAddingSubclusters(newPodsNeeded int32) bool {
+	origSize := len(s.Vdb.Spec.Subclusters)


In the function above you call this variable origNumSubclusters, I know it is definitely not a problem but could we keep the same name for both.

roypaulin · 2022-04-11T17:58:09Z

pkg/events/event.go

+// Constants for VerticaAutoscaler reconciler
+const (
+	SubclusterServiceNameNotFound = "SubclusterServiceNameNotFound"
+	VerticaDBNotFound             = "VerticaDBNotFound"


I think this should not be considered specific to the Autoscaler CR

The VerticaAutoscaler CR does have a reference to a VerticaDB. It is used in pkg/controllers/vas/k8s.go

roypaulin

Looks good.

Matt Spilchen added 26 commits March 10, 2022 15:42

Add autoscaler CR scafolding

934789b

More autoscaler CR updates

c8de100

Finish subcluster resize

03d0f73

Add autoscale e2e test

7fb4cb7

Merge branch 'main' into autoscaler

de1ba18

Fix merge conflicts

4b17cfe

Merge branch 'main' into autoscaler2

98fc53d

Add scaling by subcluster

6c18e7a

Rename autoscale-sanity to autoscale-by-pod

48bb4b0

e2e test

0f1f015

Fixes for e2e test

ab311ce

Fix scorecard tests

30d4d06

Status changes

20c2352

Merge branch 'main' into autoscaler

212b5b9

Add vertica category to kubectl

aa9dd52

Expand label selector

bc07fcd

Add webhook

584b4fc

Add ability to use existing subcluster as template

0e8c2c6

Add currentSize and init targetSize

8a80d11

Code cleanup (rename)

61e6907

Revert changes for pending pod handling

3e2382b

I need to rethink these as it ended up breaking online upgrade.

Seperate autoscaler controller into separate package

dec0300

Code cleanup

1f8e5ae

Code cleanup

a686ec0

Merge branch 'main' into autoscaler

7311fae

spilchen requested a review from roypaulin April 8, 2022 15:55

spilchen self-assigned this Apr 8, 2022

Matt Spilchen added 4 commits April 11, 2022 10:27

Apply review comments

1103d54

Move vas to its own subdir under controllers

56d77cb

Move VerticaDB reconciler to controllers/vdb

a230daa

Merge branch 'main' into autoscaler

8ae3f31

roypaulin reviewed Apr 11, 2022

View reviewed changes

Address review comments

30b2b65

roypaulin approved these changes Apr 12, 2022

View reviewed changes

spilchen merged commit df5569d into vertica:autoscaler Apr 12, 2022

spilchen deleted the autoscaler branch April 12, 2022 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate autoscaler with VerticaDB #195

Integrate autoscaler with VerticaDB #195

spilchen commented Apr 8, 2022

spilchen commented Apr 8, 2022

roypaulin left a comment

roypaulin Apr 11, 2022

roypaulin Apr 11, 2022

roypaulin Apr 11, 2022

spilchen Apr 11, 2022

roypaulin Apr 11, 2022

roypaulin Apr 11, 2022

spilchen Apr 11, 2022

roypaulin left a comment

Integrate autoscaler with VerticaDB #195

Integrate autoscaler with VerticaDB #195

Conversation

spilchen commented Apr 8, 2022

spilchen commented Apr 8, 2022

roypaulin left a comment

Choose a reason for hiding this comment

roypaulin Apr 11, 2022

Choose a reason for hiding this comment

roypaulin Apr 11, 2022

Choose a reason for hiding this comment

roypaulin Apr 11, 2022

Choose a reason for hiding this comment

spilchen Apr 11, 2022

Choose a reason for hiding this comment

roypaulin Apr 11, 2022

Choose a reason for hiding this comment

roypaulin Apr 11, 2022

Choose a reason for hiding this comment

spilchen Apr 11, 2022

Choose a reason for hiding this comment

roypaulin left a comment

Choose a reason for hiding this comment