Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

AndiDog · 2024-07-02T10:43:34Z

Towards https://github.com/giantswarm/giantswarm/issues/31105

If CAPA detects two changes subsequently, such as two upgrades in a row, it previously waited until the ongoing ASG instance refresh was finished. While waiting, it didn't update the join token, so after 10+ minutes, no new nodes could join and the cluster was screwed up.

Waiting for the refresh makes no sense if with the next detected change, all nodes would be rolled again. Therefore, we introduce cancelling the ongoing instance refresh, then waiting until a new one can be started (e.g. while in Cancelling, not Cancelled status), and only then start a new one.

I've tested this by applying either 1 or 2 cluster-aws changes subsequently, observing logs and instance refresh behavior.

AndiDog · 2024-07-02T16:06:07Z

exp/controllers/awsmachinepool_controller.go

@@ -192,13 +192,13 @@ func (r *AWSMachinePoolReconciler) Reconcile(ctx context.Context, req ctrl.Reque
 			return ctrl.Result{}, r.reconcileDelete(machinePoolScope, infraScope, infraScope)
 		}

-		return ctrl.Result{}, r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope)
+		return r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope)


reconcileNormal now returns a Result object as well, since it may return "please wait 30 seconds until the previous ASG instance refresh is cancelled"

AndiDog · 2024-07-02T16:06:52Z

exp/controllers/awsmachinepool_controller.go

-	canUpdateLaunchTemplate := func() (bool, error) {
+	canStartInstanceRefresh := func() (bool, *string, error) {


I've renamed this. The name was much too generic. There's only one usage of these functions, and that's for ASGs + instance refresh. No need to give it a name with broad meaning.

This now additionally returns the status of the previous instance refresh, if any. With that, we can decide whether CancelInstanceRefresh needs to be called (e.g. status InProgress), or not (already in status Cancelling).

AndiDog · 2024-07-02T16:09:47Z

exp/controllers/awsmachinepool_controller.go

-		return err
+		return ctrl.Result{}, err
+	}
+	if res != nil {


As above: this function can now optionally return a Result. In that case, return early (it delays the reconciliation with RequeueAfter: ... seconds).

AndiDog · 2024-07-02T16:11:03Z

exp/controllers/awsmachinepool_controller.go

 		}
 		return asgsvc.CanStartASGInstanceRefresh(machinePoolScope)
 	}
+	cancelInstanceRefresh := func() error {


This callback function is new, and gets passed to the launch template / ASG reconciliation code which decides whether to cancel the previous instance refresh (because there are changes that need a new refresh = roll all nodes again).

AndiDog · 2024-07-02T16:13:46Z

pkg/cloud/services/ec2/launchtemplate.go

@@ -222,13 +226,29 @@ func (s *Service) ReconcileLaunchTemplate(
 	launchTemplateNeedsUserDataSecretKeyTag := launchTemplateUserDataSecretKey == nil

 	if needsUpdate || tagsChanged || amiChanged || userDataSecretKeyChanged {
-		canUpdate, err := canUpdateLaunchTemplate()
+		// More than just the bootstrap token changed


Getting into this if block means we want to roll all nodes because more than just the join token was updated/changed.

Remember the rename and return values change from above: s/canUpdateLaunchTemplate/canStartInstanceRefresh

AndiDog · 2024-07-02T16:16:39Z

pkg/cloud/services/ec2/launchtemplate.go

-		if !canUpdate {
-			conditions.MarkFalse(scope.GetSetter(), expinfrav1.PreLaunchTemplateUpdateCheckCondition, expinfrav1.PreLaunchTemplateUpdateCheckFailedReason, clusterv1.ConditionSeverityWarning, "")
-			return errors.New("Cannot update the launch template, prerequisite not met")
+		if !canStartRefresh {


This is the main change

fiunchinho · 2024-07-03T07:37:30Z

pkg/cloud/services/ec2/launchtemplate.go

+			} else {
+				scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus)
+			}
+			return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil


So if I understand correctly, it will requeue if there is an instance refresh currently being cancelled, or there is an instance refresh with no status message? When would that happen?

It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.

fiunchinho · 2024-07-03T07:45:28Z

pkg/cloud/services/interfaces.go

@@ -18,6 +18,7 @@ package services

 import (
 	apimachinerytypes "k8s.io/apimachinery/pkg/types"
+	ctrl "sigs.k8s.io/controller-runtime"


This is similar to the other comment I left you on a different PR 😄 I think we need to be careful with the dependencies between packages or everything will be tangled up. Can we maybe return a custom error that we can check in the caller somehow, so that we don't have to make the services package depend on controller-runtime?

For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result object as well, and also have other dependencies on controller-runtime.

…king until previous one is finished (which may have led to failing nodes due to outdated join token)

AndiDog

Rebasing with no changes

AndiDog · 2024-07-03T09:32:04Z

pkg/cloud/services/ec2/launchtemplate.go

+			} else {
+				scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus)
+			}
+			return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil


It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.

AndiDog · 2024-07-03T09:34:07Z

pkg/cloud/services/interfaces.go

@@ -18,6 +18,7 @@ package services

 import (
 	apimachinerytypes "k8s.io/apimachinery/pkg/types"
+	ctrl "sigs.k8s.io/controller-runtime"


For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result object as well, and also have other dependencies on controller-runtime.

For upcoming feature giantswarm/cluster-api-provider-aws#598

…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)

* Add Giant Swarm fork modifications * Push to Azure registry * aws-cni-deleted-helm-managed-resources * import-order * Filter CNI subnets when creating EKS NodeGroup * add godoc * 🐛 Create a `aws.Config` with region to be able to work different AWS partition (like gov cloud or china AWS partition) (#588) * create-aws-client-with-region * 🐛 Add ID to secondary subnets (#589) * give name to secondary subnets * make linter happy * Add non root volumes to AWSMachineTemplate * Support adding custom secondary VPC CIDR blocks in `AWSCluster` (backport) (#590) * S3 user data support for `AWSMachinePool` (#592) * Delete machine pool user data files that did not get deleted yet by the lifecycle policy (#593) * Delete machine pool user data files that did not get deleted yet by the lifecycle policy * Use paging for S3 results * Log S3 list operation * Handle NotFound * Remove duplicated argument * Add `make test` to Circle CI build, S3 test fixes (#596) * Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) (#598) * Use feature gate for S3 storage (#599) * Fixes after cherry-pick our customizations --------- Co-authored-by: Andreas Sommer <[email protected]> Co-authored-by: calvix <[email protected]> Co-authored-by: Mario Nitchev <[email protected]> Co-authored-by: calvix <[email protected]>

…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)

AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 3d4cd25 to 00e1b55 Compare July 2, 2024 15:21

AndiDog marked this pull request as ready for review July 2, 2024 15:49

AndiDog requested a review from a team July 2, 2024 15:49

AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 00e1b55 to 935e729 Compare July 2, 2024 16:17

AndiDog commented Jul 2, 2024

View reviewed changes

fiunchinho reviewed Jul 3, 2024

View reviewed changes

Cancel instance refresh on any relevant change to ASG instead of bloc…

9a5622b

…king until previous one is finished (which may have led to failing nodes due to outdated join token)

AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 935e729 to 9a5622b Compare July 3, 2024 09:34

AndiDog commented Jul 3, 2024

View reviewed changes

AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024

Add autoscaling:CancelInstanceRefresh permission for CAPA

b8ea7a9

For upcoming feature giantswarm/cluster-api-provider-aws#598

AndiDog mentioned this pull request Jul 3, 2024

Add autoscaling:CancelInstanceRefresh permission for CAPA giantswarm/giantswarm-aws-account-prerequisites#108

Merged

AndiDog changed the base branch from s3-test-fixes to release-2.3 July 3, 2024 09:39

AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024

Add autoscaling:CancelInstanceRefresh permission for CAPA

97698f9

For upcoming feature giantswarm/cluster-api-provider-aws#598

fiunchinho approved these changes Jul 3, 2024

View reviewed changes

AndiDog merged commit 021690f into release-2.3 Jul 3, 2024
5 of 8 checks passed

AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024

Add autoscaling:CancelInstanceRefresh permission for CAPA

fe763af

For upcoming feature giantswarm/cluster-api-provider-aws#598

AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 4, 2024

Add autoscaling:CancelInstanceRefresh permission for CAPA (#108)

1a4abb4

For upcoming feature giantswarm/cluster-api-provider-aws#598

fiunchinho pushed a commit that referenced this pull request Jul 4, 2024

Cancel instance refresh on any relevant change to ASG instead of bloc…

2ff7838

…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)

AndiDog mentioned this pull request Jul 15, 2024

Cancel instance refresh on any relevant change to ASG, delete machine pool user data file from S3 when pruning an old launch template version giantswarm/cluster-api-provider-aws-app#255

Merged

1 task

fiunchinho pushed a commit that referenced this pull request Aug 21, 2024

Cancel instance refresh on any relevant change to ASG instead of bloc…

8715fee

…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

AndiDog commented Jul 2, 2024

AndiDog Jul 2, 2024

AndiDog Jul 2, 2024

AndiDog Jul 2, 2024

AndiDog Jul 2, 2024

AndiDog Jul 2, 2024

AndiDog Jul 2, 2024

fiunchinho Jul 3, 2024

AndiDog Jul 3, 2024

fiunchinho Jul 3, 2024

AndiDog Jul 3, 2024

AndiDog left a comment

AndiDog Jul 3, 2024

AndiDog Jul 3, 2024

		canUpdateLaunchTemplate := func() (bool, error) {
		canStartInstanceRefresh := func() (bool, *string, error) {

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

Conversation

AndiDog commented Jul 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AndiDog left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment