-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate autoscaler with VerticaDB #195
Conversation
Two changes to the restart reconciler for pending pods: - when we do re-ip, lets ignore pods that are pending. The old behaviour would requeue them. They don't have an IP, so there is nothing we can do - don't requeue restart if pods are running. We ended up getting into an infinite loop without anyway to get out. Lets continue on with the reconciler. I was trying to remove the pending pods but I couldn't because it was blocked in the restart.
I need to rethink these as it ended up breaking online upgrade.
I'm pushing this to the autoscaler branch. It will go into main after we release 1.4.0 of the operator. I will create a changie entry when it goes into main. |
A few minor changes to the e2e tests: - in online-upgrade-kill-transient, we occasionally see a test failure at step 45. We run a pod to check access to the subcluster. This can fail due to timing, so we will change it to use a job so that it restarts if access fails. Only if the access fails continuously will it fail the test - moved revivedb-1-node to the extra tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just have a few comments.
ctrl "sigs.k8s.io/controller-runtime" | ||
) | ||
|
||
type RefreshCurrentSizeReconciler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we put a comment to tell what this struct does?
ctrl "sigs.k8s.io/controller-runtime" | ||
) | ||
|
||
type RefreshSelectorReconciler struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
var res ctrl.Result | ||
scalingDone := false | ||
// Update the VerticaDB with a retry mechanism for any conflict updates | ||
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly happens here? What is considered a conflict update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It handles the case if someone else had update the object since you last fetched it. This will attempt to retry the update again by fetching the more recent copy of the object.
// considerAddingSubclusters will grow the Vdb by adding new subclusters. | ||
// Changes are made in-place in s.Vdb | ||
func (s *SubclusterScaleReconciler) considerAddingSubclusters(newPodsNeeded int32) bool { | ||
origSize := len(s.Vdb.Spec.Subclusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the function above you call this variable origNumSubclusters, I know it is definitely not a problem but could we keep the same name for both.
// Constants for VerticaAutoscaler reconciler | ||
const ( | ||
SubclusterServiceNameNotFound = "SubclusterServiceNameNotFound" | ||
VerticaDBNotFound = "VerticaDBNotFound" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be considered specific to the Autoscaler CR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VerticaAutoscaler CR does have a reference to a VerticaDB. It is used in pkg/controllers/vas/k8s.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
This adds support for the horizontal pod autoscaler (HPA) to work with VerticaDB. With this integration, you can have k8s automatically scale a VerticaDB based on metrics from your workload.
We want flexibility in how we scale. Depending on the use case we are going to look at providing two types:
To allow either of these use cases to be used, or to add support for a different type of scaling, we are introducing a new custom resource that will manage autoscaling of a set of subclusters. The new CR is called VerticaAutoscaler.
We created a separate package to handle reconciliation of the new CR. It is handle by the same operator though.
A webhook was added for this new CR.
Sample usage:
Create your VerticaDB as normal. Be sure to include a CPU resource setting if scaling by CPU percent. Note, to scale beyond 3 nodes be sure to include your license.
kubectl apply -f config/samples/v1beta1_verticadb.yaml
Create the VerticaAutoscaler to indicate how we want to scale. The default in this file will scale by adding/removing subclusters.
kubectl apply -f config/samples/v1beta1_verticaautoscaler.yaml
Create the HPA object to tell k8s to autoscale.
kubectl autoscale verticaautoscaler/verticaautoscaler-sample --cpu-percent=70 --min=3 --max=6