-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598
Conversation
3d4cd25
to
00e1b55
Compare
00e1b55
to
935e729
Compare
@@ -192,13 +192,13 @@ func (r *AWSMachinePoolReconciler) Reconcile(ctx context.Context, req ctrl.Reque | |||
return ctrl.Result{}, r.reconcileDelete(machinePoolScope, infraScope, infraScope) | |||
} | |||
|
|||
return ctrl.Result{}, r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope) | |||
return r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reconcileNormal
now returns a Result
object as well, since it may return "please wait 30 seconds until the previous ASG instance refresh is cancelled"
canUpdateLaunchTemplate := func() (bool, error) { | ||
canStartInstanceRefresh := func() (bool, *string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed this. The name was much too generic. There's only one usage of these functions, and that's for ASGs + instance refresh. No need to give it a name with broad meaning.
This now additionally returns the status of the previous instance refresh, if any. With that, we can decide whether CancelInstanceRefresh
needs to be called (e.g. status InProgress), or not (already in status Cancelling).
return err | ||
return ctrl.Result{}, err | ||
} | ||
if res != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above: this function can now optionally return a Result
. In that case, return early (it delays the reconciliation with RequeueAfter: ... seconds
).
} | ||
return asgsvc.CanStartASGInstanceRefresh(machinePoolScope) | ||
} | ||
cancelInstanceRefresh := func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This callback function is new, and gets passed to the launch template / ASG reconciliation code which decides whether to cancel the previous instance refresh (because there are changes that need a new refresh = roll all nodes again).
@@ -222,13 +226,29 @@ func (s *Service) ReconcileLaunchTemplate( | |||
launchTemplateNeedsUserDataSecretKeyTag := launchTemplateUserDataSecretKey == nil | |||
|
|||
if needsUpdate || tagsChanged || amiChanged || userDataSecretKeyChanged { | |||
canUpdate, err := canUpdateLaunchTemplate() | |||
// More than just the bootstrap token changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting into this if
block means we want to roll all nodes because more than just the join token was updated/changed.
Remember the rename and return values change from above: s/canUpdateLaunchTemplate/canStartInstanceRefresh
if !canUpdate { | ||
conditions.MarkFalse(scope.GetSetter(), expinfrav1.PreLaunchTemplateUpdateCheckCondition, expinfrav1.PreLaunchTemplateUpdateCheckFailedReason, clusterv1.ConditionSeverityWarning, "") | ||
return errors.New("Cannot update the launch template, prerequisite not met") | ||
if !canStartRefresh { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main change
} else { | ||
scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus) | ||
} | ||
return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, it will requeue if there is an instance refresh currently being cancelled, or there is an instance refresh with no status message? When would that happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.
@@ -18,6 +18,7 @@ package services | |||
|
|||
import ( | |||
apimachinerytypes "k8s.io/apimachinery/pkg/types" | |||
ctrl "sigs.k8s.io/controller-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to the other comment I left you on a different PR 😄 I think we need to be careful with the dependencies between packages or everything will be tangled up. Can we maybe return a custom error that we can check in the caller somehow, so that we don't have to make the services
package depend on controller-runtime
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result
object as well, and also have other dependencies on controller-runtime
.
…king until previous one is finished (which may have led to failing nodes due to outdated join token)
935e729
to
9a5622b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebasing with no changes
} else { | ||
scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus) | ||
} | ||
return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.
@@ -18,6 +18,7 @@ package services | |||
|
|||
import ( | |||
apimachinerytypes "k8s.io/apimachinery/pkg/types" | |||
ctrl "sigs.k8s.io/controller-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result
object as well, and also have other dependencies on controller-runtime
.
…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)
* Add Giant Swarm fork modifications * Push to Azure registry * aws-cni-deleted-helm-managed-resources * import-order * Filter CNI subnets when creating EKS NodeGroup * add godoc * 🐛 Create a `aws.Config` with region to be able to work different AWS partition (like gov cloud or china AWS partition) (#588) * create-aws-client-with-region * 🐛 Add ID to secondary subnets (#589) * give name to secondary subnets * make linter happy * Add non root volumes to AWSMachineTemplate * Support adding custom secondary VPC CIDR blocks in `AWSCluster` (backport) (#590) * S3 user data support for `AWSMachinePool` (#592) * Delete machine pool user data files that did not get deleted yet by the lifecycle policy (#593) * Delete machine pool user data files that did not get deleted yet by the lifecycle policy * Use paging for S3 results * Log S3 list operation * Handle NotFound * Remove duplicated argument * Add `make test` to Circle CI build, S3 test fixes (#596) * Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) (#598) * Use feature gate for S3 storage (#599) * Fixes after cherry-pick our customizations --------- Co-authored-by: Andreas Sommer <[email protected]> Co-authored-by: calvix <[email protected]> Co-authored-by: Mario Nitchev <[email protected]> Co-authored-by: calvix <[email protected]>
…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)
Towards https://github.com/giantswarm/giantswarm/issues/31105
If CAPA detects two changes subsequently, such as two upgrades in a row, it previously waited until the ongoing ASG instance refresh was finished. While waiting, it didn't update the join token, so after 10+ minutes, no new nodes could join and the cluster was screwed up.
Waiting for the refresh makes no sense if with the next detected change, all nodes would be rolled again. Therefore, we introduce cancelling the ongoing instance refresh, then waiting until a new one can be started (e.g. while in
Cancelling
, notCancelled
status), and only then start a new one.I've tested this by applying either 1 or 2
cluster-aws
changes subsequently, observing logs and instance refresh behavior.