Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate autoscaler with VerticaDB #195

Merged
merged 32 commits into from
Apr 12, 2022
Merged

Conversation

spilchen
Copy link
Collaborator

@spilchen spilchen commented Apr 8, 2022

This adds support for the horizontal pod autoscaler (HPA) to work with VerticaDB. With this integration, you can have k8s automatically scale a VerticaDB based on metrics from your workload.

We want flexibility in how we scale. Depending on the use case we are going to look at providing two types:

  • Subcluster: Scale by adding or removing subclusters. This is the preferred approach for "dashboard" style of queries, which typically run in a short amount of time. Multiple subclusters will all share the same service object. This means the client connections do not have to change their endpoints.
  • Pod: Scale by adding or removing pods to an existing subcluster. This is the preferred approach for longer running analytic queries. We recommend a shard to node ration of 2:1 or 3:1. If the ratio in the subcluster is smaller, it could benefit from increasing the size.

To allow either of these use cases to be used, or to add support for a different type of scaling, we are introducing a new custom resource that will manage autoscaling of a set of subclusters. The new CR is called VerticaAutoscaler.

We created a separate package to handle reconciliation of the new CR. It is handle by the same operator though.

A webhook was added for this new CR.

Sample usage:

  1. Create your VerticaDB as normal. Be sure to include a CPU resource setting if scaling by CPU percent. Note, to scale beyond 3 nodes be sure to include your license.
    kubectl apply -f config/samples/v1beta1_verticadb.yaml

  2. Create the VerticaAutoscaler to indicate how we want to scale. The default in this file will scale by adding/removing subclusters.
    kubectl apply -f config/samples/v1beta1_verticaautoscaler.yaml

  3. Create the HPA object to tell k8s to autoscale.
    kubectl autoscale verticaautoscaler/verticaautoscaler-sample --cpu-percent=70 --min=3 --max=6

Matt Spilchen added 26 commits March 10, 2022 15:42
Two changes to the restart reconciler for pending pods:
- when we do re-ip, lets ignore pods that are pending.  The old
behaviour would requeue them.  They don't have an IP, so there is
nothing we can do
- don't requeue restart if pods are running.  We ended up getting into
an infinite loop without anyway to get out.  Lets continue on with the
reconciler.  I was trying to remove the pending pods but I couldn't
because it was blocked in the restart.
I need to rethink these as it ended up breaking online upgrade.
@spilchen spilchen requested a review from roypaulin April 8, 2022 15:55
@spilchen spilchen self-assigned this Apr 8, 2022
@spilchen
Copy link
Collaborator Author

spilchen commented Apr 8, 2022

I'm pushing this to the autoscaler branch. It will go into main after we release 1.4.0 of the operator. I will create a changie entry when it goes into main.

A few minor changes to the e2e tests:

- in online-upgrade-kill-transient, we occasionally see a test failure at step
  45. We run a pod to check access to the subcluster. This can fail due to
  timing, so we will change it to use a job so that it restarts if access
  fails. Only if the access fails continuously will it fail the test
- moved revivedb-1-node to the extra tests
Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just have a few comments.

ctrl "sigs.k8s.io/controller-runtime"
)

type RefreshCurrentSizeReconciler struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we put a comment to tell what this struct does?

ctrl "sigs.k8s.io/controller-runtime"
)

type RefreshSelectorReconciler struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

var res ctrl.Result
scalingDone := false
// Update the VerticaDB with a retry mechanism for any conflict updates
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly happens here? What is considered a conflict update?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It handles the case if someone else had update the object since you last fetched it. This will attempt to retry the update again by fetching the more recent copy of the object.

// considerAddingSubclusters will grow the Vdb by adding new subclusters.
// Changes are made in-place in s.Vdb
func (s *SubclusterScaleReconciler) considerAddingSubclusters(newPodsNeeded int32) bool {
origSize := len(s.Vdb.Spec.Subclusters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the function above you call this variable origNumSubclusters, I know it is definitely not a problem but could we keep the same name for both.

// Constants for VerticaAutoscaler reconciler
const (
SubclusterServiceNameNotFound = "SubclusterServiceNameNotFound"
VerticaDBNotFound = "VerticaDBNotFound"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be considered specific to the Autoscaler CR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VerticaAutoscaler CR does have a reference to a VerticaDB. It is used in pkg/controllers/vas/k8s.go

Copy link
Collaborator

@roypaulin roypaulin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@spilchen spilchen merged commit df5569d into vertica:autoscaler Apr 12, 2022
@spilchen spilchen deleted the autoscaler branch April 12, 2022 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants