Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) #598

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

AndiDog
Copy link

@AndiDog AndiDog commented Jul 2, 2024

Towards https://github.com/giantswarm/giantswarm/issues/31105

If CAPA detects two changes subsequently, such as two upgrades in a row, it previously waited until the ongoing ASG instance refresh was finished. While waiting, it didn't update the join token, so after 10+ minutes, no new nodes could join and the cluster was screwed up.

Waiting for the refresh makes no sense if with the next detected change, all nodes would be rolled again. Therefore, we introduce cancelling the ongoing instance refresh, then waiting until a new one can be started (e.g. while in Cancelling, not Cancelled status), and only then start a new one.

I've tested this by applying either 1 or 2 cluster-aws changes subsequently, observing logs and instance refresh behavior.

@AndiDog AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 3d4cd25 to 00e1b55 Compare July 2, 2024 15:21
@AndiDog AndiDog marked this pull request as ready for review July 2, 2024 15:49
@AndiDog AndiDog requested a review from a team July 2, 2024 15:49
@AndiDog AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 00e1b55 to 935e729 Compare July 2, 2024 16:17
@@ -192,13 +192,13 @@ func (r *AWSMachinePoolReconciler) Reconcile(ctx context.Context, req ctrl.Reque
return ctrl.Result{}, r.reconcileDelete(machinePoolScope, infraScope, infraScope)
}

return ctrl.Result{}, r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope)
return r.reconcileNormal(ctx, machinePoolScope, infraScope, infraScope, s3Scope)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileNormal now returns a Result object as well, since it may return "please wait 30 seconds until the previous ASG instance refresh is cancelled"

canUpdateLaunchTemplate := func() (bool, error) {
canStartInstanceRefresh := func() (bool, *string, error) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed this. The name was much too generic. There's only one usage of these functions, and that's for ASGs + instance refresh. No need to give it a name with broad meaning.

This now additionally returns the status of the previous instance refresh, if any. With that, we can decide whether CancelInstanceRefresh needs to be called (e.g. status InProgress), or not (already in status Cancelling).

return err
return ctrl.Result{}, err
}
if res != nil {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above: this function can now optionally return a Result. In that case, return early (it delays the reconciliation with RequeueAfter: ... seconds).

}
return asgsvc.CanStartASGInstanceRefresh(machinePoolScope)
}
cancelInstanceRefresh := func() error {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This callback function is new, and gets passed to the launch template / ASG reconciliation code which decides whether to cancel the previous instance refresh (because there are changes that need a new refresh = roll all nodes again).

@@ -222,13 +226,29 @@ func (s *Service) ReconcileLaunchTemplate(
launchTemplateNeedsUserDataSecretKeyTag := launchTemplateUserDataSecretKey == nil

if needsUpdate || tagsChanged || amiChanged || userDataSecretKeyChanged {
canUpdate, err := canUpdateLaunchTemplate()
// More than just the bootstrap token changed
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting into this if block means we want to roll all nodes because more than just the join token was updated/changed.

Remember the rename and return values change from above: s/canUpdateLaunchTemplate/canStartInstanceRefresh

if !canUpdate {
conditions.MarkFalse(scope.GetSetter(), expinfrav1.PreLaunchTemplateUpdateCheckCondition, expinfrav1.PreLaunchTemplateUpdateCheckFailedReason, clusterv1.ConditionSeverityWarning, "")
return errors.New("Cannot update the launch template, prerequisite not met")
if !canStartRefresh {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change

} else {
scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus)
}
return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, it will requeue if there is an instance refresh currently being cancelled, or there is an instance refresh with no status message? When would that happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.

@@ -18,6 +18,7 @@ package services

import (
apimachinerytypes "k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to the other comment I left you on a different PR 😄 I think we need to be careful with the dependencies between packages or everything will be tangled up. Can we maybe return a custom error that we can check in the caller somehow, so that we don't have to make the services package depend on controller-runtime?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result object as well, and also have other dependencies on controller-runtime.

…king until previous one is finished (which may have led to failing nodes due to outdated join token)
@AndiDog AndiDog force-pushed the s3-user-data-support-for-awsmachinepool branch from 935e729 to 9a5622b Compare July 3, 2024 09:34
Copy link
Author

@AndiDog AndiDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebasing with no changes

} else {
scope.Info("Existing instance refresh is not finished, delaying reconciliation until the next one can be started", "unfinishedRefreshStatus", unfinishedRefreshStatus)
}
return &ctrl.Result{RequeueAfter: 30 * time.Second}, nil
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would requeue if there's a refresh in any unfinished status (InstanceRefreshStatusInProgress, InstanceRefreshStatusPending, InstanceRefreshStatusCancelling). Once it transitions to a status where a new refresh can be started, we go ahead (no requeue). There's always a status.

@@ -18,6 +18,7 @@ package services

import (
apimachinerytypes "k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this one, I wouldn't make the effort to pull it out. Other code below this package use the Result object as well, and also have other dependencies on controller-runtime.

AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024
@AndiDog AndiDog changed the base branch from s3-test-fixes to release-2.3 July 3, 2024 09:39
AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024
@AndiDog AndiDog merged commit 021690f into release-2.3 Jul 3, 2024
5 of 8 checks passed
AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 3, 2024
AndiDog added a commit to giantswarm/giantswarm-aws-account-prerequisites that referenced this pull request Jul 4, 2024
fiunchinho pushed a commit that referenced this pull request Jul 4, 2024
…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)
fiunchinho added a commit that referenced this pull request Jul 4, 2024
* Add Giant Swarm fork modifications

* Push to Azure registry

* aws-cni-deleted-helm-managed-resources

* import-order

* Filter CNI subnets when creating EKS NodeGroup

* add godoc

* 🐛 Create a `aws.Config` with region to be able to work different AWS partition (like gov cloud or china AWS partition) (#588)

* create-aws-client-with-region

* 🐛 Add ID to secondary subnets (#589)

* give name to secondary subnets

* make linter happy

* Add non root volumes to AWSMachineTemplate

* Support adding custom secondary VPC CIDR blocks in `AWSCluster` (backport) (#590)

* S3 user data support for `AWSMachinePool` (#592)

* Delete machine pool user data files that did not get deleted yet by the lifecycle policy (#593)

* Delete machine pool user data files that did not get deleted yet by the lifecycle policy

* Use paging for S3 results

* Log S3 list operation

* Handle NotFound

* Remove duplicated argument

* Add `make test` to Circle CI build, S3 test fixes (#596)

* Cancel instance refresh on any relevant change to ASG instead of blocking until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)

* Use feature gate for S3 storage (#599)

* Fixes after cherry-pick our customizations

---------

Co-authored-by: Andreas Sommer <[email protected]>
Co-authored-by: calvix <[email protected]>
Co-authored-by: Mario Nitchev <[email protected]>
Co-authored-by: calvix <[email protected]>
fiunchinho pushed a commit that referenced this pull request Aug 21, 2024
…king until previous one is finished (which may have led to failing nodes due to outdated join token) (#598)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants