Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible Automation Platform Idler #631

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

alexeykazakov
Copy link
Contributor

It turned out that we can't rely on the "Running" condition in the AAP CRs. It doesn't really properly reflect the status of AAPs. So we can't just use a simple cron job to idle AAPs.

This PR extends our idler so it can now handle AAPs too.

  • The idling timeout for AAPs is shorten (comparing to the regular timeout from the Idler.Spec) so the regular Pod idler does not interfere with the AAP idler. AAP Idler should kick in first.
  • I also tried to separate the AAP idler code from the main pod idler code.

TODO:

  • tests

Copy link

openshift-ci bot commented Feb 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

idledAAPs := []string{}

// Check if there is any AAP CRs in the namespace
aapList, err := r.DynamicClient.Resource(aapGVR).Namespace(idler.Name).List(ctx, metav1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MatousJobanek discussed updating the idler to watch Pods and mentioned he'll work on that tomorrow. If we're doing that then I guess we don't need to list aap CRs like this? Can we start from the pod and work our way up the owner references to get to the AAP resource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two reasons to list aaps:

  1. A working AAP instance has a lot of pods (like 20 or so). Which have a complex ownership structure (for example some of them look like this: AAP -> Another AAP CR -> yet another AAP CR -> Deployment -> ReplicaSet -> Pod). And in most cases we expect users to have just one (or none) AAP (vs. many of them). Our UI will work with a single AAP instance only (it will create, show, un-idle just one AAP with a hardcoded name).
    So, in case there is just one AAP then in order to avoid going through all of this ownership chain for every pod (20 times) for a single AAP instance we just list all AAPs first and then if the first AAP is already idled (after going through the ownership chain for the fist pod) then we don't need to handle the rest of the pods (19 AAP pods and any other one not related to AAP the user might have).
    In other words in most cases we will need to handle a single pod instead of 20 of them.
  2. Since we want to list all AAPs for the reason above anyway, why not do not skip checking the pods all together if there is no AAP in the namespace?

It's basically all about optimization. And a pod watcher doesn't change it, does it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for the explanation. I think that info would be good to capture in a comment as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how this will actually work together with the current implementation of the idler that is executed before the app-specific one? Correct me if I'm wrong, but if the pod is owned by a controller (which are all these 20 cases), then the current idler will try to idle them using the standard procedure:

func (r *Reconciler) scaleControllerToZero(ctx context.Context, meta metav1.ObjectMeta) (string, string, bool, error) {
log.FromContext(ctx).Info("Scaling controller to zero", "name", meta.Name)
owners := meta.GetOwnerReferences()
for _, owner := range owners {
if owner.Controller != nil && *owner.Controller {
switch owner.Kind {
case "Deployment":
return r.scaleDeploymentToZero(ctx, meta.Namespace, owner)
case "ReplicaSet":
return r.scaleReplicaSetToZero(ctx, meta.Namespace, owner)
case "DaemonSet":
return r.deleteDaemonSet(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.
case "StatefulSet":
return r.scaleStatefulSetToZero(ctx, meta.Namespace, owner)
case "DeploymentConfig":
return r.scaleDeploymentConfigToZero(ctx, meta.Namespace, owner)
case "ReplicationController":
return r.scaleReplicationControllerToZero(ctx, meta.Namespace, owner)
case "Job":
return r.deleteJob(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.
case "VirtualMachineInstance":
return r.stopVirtualMachine(ctx, meta.Namespace, owner) // Nothing to scale down. Stop instead.
}
}
}
return "", "", false, nil
}

the code will go through this for all existing pods anyway, and the code doesn't care if the deployment is owned by someone else. Does it mean that the main logic simply scales everything down and then APP reverts the change back, and only the final logic in this aap_idle.go file will scale the correct instance down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the main idler logic and the aap idler logic kinda compete with each other. This is why I shorten the AAP idling timeout so it kicks in one hour earlier than the main idler: https://github.com/codeready-toolchain/member-operator/pull/631/files#diff-79b76c6fb335e6e635f582d5c79bbd4cadbb780830fa6af6e2e936e78b83253eR140-R143

// Check if the idler timeout is less than one hour and if so, set it to half of the timeout.
// Otherwise, subtract one hour from the timeout.
// This is to ensure that the AAP idler kicks in before the main idler.

So in theory the main idler should never try to idle controllers which are managed by the aap idler because the aap idler would idle it one hour earlier.

@@ -111,6 +114,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.
return reconcile.Result{}, r.wrapErrorWithStatusUpdate(ctx, idler, r.setStatusFailed, err,
"failed to ensure idling '%s'", idler.Name)
}
if err := r.ensureAnsiblePlatformIdling(ctx, idler); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we also call the AAP idler logic from the scaleControllerToZero function only for pods that have the AAP owner reference like how we do for VMs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately no. This is one of the reasons why I wanted to keep the AAP idler separate. AAP resource ownership is complex. Pods are not owned by the AAP CRs directly. There is a long chain of ownership (see more details in my comment above). Integrating it with the main idler loop will require deep refactoring of the existing idler logic and deep integration the main logic with the AAP idling. Which I really want to avoid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?

I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:

  • if controller is not owned by someone else then scale it down
  • if controller is owned by someone else, then:
    • if it's app, then go in the chain up as you do using getAAPOwner function
    • if it's owned by an unknown instance, then try to scale it down (expecting that the owner reverts the change)

in other words, incorporate the knowledge and usage of getAAPOwner into scaleControllerToZero or any of the functions called from it. it shouldn't be so hard and it would be very similar to what you do in the aap_idler.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?

Yes, I'm sure. I checked it. Theoretically it can be added on the AAP side but we can't expect it happen soon enough.

I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:

I wouldn't refactor the main idler for this yet. While it should not be super hard to do IMO it's not a trivial change either. I would still try to keep the AAP code as separate as possible for now.

Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

Comment on lines +33 to +44
// Get all API resources from the cluster using the discovery client. We need it for constructing GVRs for unstructured objects.
// Do it here once, so we do not have to list it multiple times before listing/getting every unstructured resource.
resourceLists, err := r.DiscoveryClient.ServerPreferredResources()
if err != nil {
return err
}

// Check if the AAP API is even available/installed
aapGVR, err := findGVRForKind(aapKind, aapAPIVersion, resourceLists)
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be done only once - either as part of the controller initialization or using sync.Once logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about that but then if the API is changed (installed/uninstalled operators) then we will have to restart the member operator manually. The requests seems to be quick enough but we could move it to the initialization (or once) and keep this limitation with installing new operators while the member operator is running in mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most realistic risk here if the AAP operator is updated and it introduces some API changes which could break the idler (like introducing another AAP CRD which could be in the middle of the owner chain).

Copy link

sonarqubecloud bot commented Mar 6, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants