Ansible Automation Platform Idler #631

alexeykazakov · 2025-02-28T04:35:51Z

It turned out that we can't rely on the "Running" condition in the AAP CRs. It doesn't really properly reflect the status of AAPs. So we can't just use a simple cron job to idle AAPs.

This PR extends our idler so it can now handle AAPs too.

The idling timeout for AAPs is shorten (comparing to the regular timeout from the Idler.Spec) so the regular Pod idler does not interfere with the AAP idler. AAP Idler should kick in first.
I also tried to separate the AAP idler code from the main pod idler code.

TODO:

tests

openshift-ci · 2025-02-28T04:35:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alexeykazakov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

controllers/idler/app_idler.go

rajivnathan · 2025-03-03T23:14:25Z

controllers/idler/app_idler.go

+	idledAAPs := []string{}
+
+	// Check if there is any AAP CRs in the namespace
+	aapList, err := r.DynamicClient.Resource(aapGVR).Namespace(idler.Name).List(ctx, metav1.ListOptions{})


@MatousJobanek discussed updating the idler to watch Pods and mentioned he'll work on that tomorrow. If we're doing that then I guess we don't need to list aap CRs like this? Can we start from the pod and work our way up the owner references to get to the AAP resource?

There are two reasons to list aaps:

A working AAP instance has a lot of pods (like 20 or so). Which have a complex ownership structure (for example some of them look like this: AAP -> Another AAP CR -> yet another AAP CR -> Deployment -> ReplicaSet -> Pod). And in most cases we expect users to have just one (or none) AAP (vs. many of them). Our UI will work with a single AAP instance only (it will create, show, un-idle just one AAP with a hardcoded name).
So, in case there is just one AAP then in order to avoid going through all of this ownership chain for every pod (20 times) for a single AAP instance we just list all AAPs first and then if the first AAP is already idled (after going through the ownership chain for the fist pod) then we don't need to handle the rest of the pods (19 AAP pods and any other one not related to AAP the user might have).
In other words in most cases we will need to handle a single pod instead of 20 of them.

Since we want to list all AAPs for the reason above anyway, why not do not skip checking the pods all together if there is no AAP in the namespace?

It's basically all about optimization. And a pod watcher doesn't change it, does it?

That makes sense, thanks for the explanation. I think that info would be good to capture in a comment as well

how this will actually work together with the current implementation of the idler that is executed before the app-specific one? Correct me if I'm wrong, but if the pod is owned by a controller (which are all these 20 cases), then the current idler will try to idle them using the standard procedure:

member-operator/controllers/idler/idler_controller.go

Lines 316 to 342 in fc36345

func (r *Reconciler) scaleControllerToZero(ctx context.Context, meta metav1.ObjectMeta) (string, string, bool, error) {

log.FromContext(ctx).Info("Scaling controller to zero", "name", meta.Name)

owners := meta.GetOwnerReferences()

for _, owner := range owners {

if owner.Controller != nil && *owner.Controller {

switch owner.Kind {

case "Deployment":

return r.scaleDeploymentToZero(ctx, meta.Namespace, owner)

case "ReplicaSet":

return r.scaleReplicaSetToZero(ctx, meta.Namespace, owner)

case "DaemonSet":

return r.deleteDaemonSet(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.

case "StatefulSet":

return r.scaleStatefulSetToZero(ctx, meta.Namespace, owner)

case "DeploymentConfig":

return r.scaleDeploymentConfigToZero(ctx, meta.Namespace, owner)

case "ReplicationController":

return r.scaleReplicationControllerToZero(ctx, meta.Namespace, owner)

case "Job":

return r.deleteJob(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.

case "VirtualMachineInstance":

return r.stopVirtualMachine(ctx, meta.Namespace, owner) // Nothing to scale down. Stop instead.

}

}

}

return "", "", false, nil

}

the code will go through this for all existing pods anyway, and the code doesn't care if the deployment is owned by someone else. Does it mean that the main logic simply scales everything down and then APP reverts the change back, and only the final logic in this aap_idle.go file will scale the correct instance down?

Yes, the main idler logic and the aap idler logic kinda compete with each other. This is why I shorten the AAP idling timeout so it kicks in one hour earlier than the main idler: https://github.com/codeready-toolchain/member-operator/pull/631/files#diff-79b76c6fb335e6e635f582d5c79bbd4cadbb780830fa6af6e2e936e78b83253eR140-R143

// Check if the idler timeout is less than one hour and if so, set it to half of the timeout.
// Otherwise, subtract one hour from the timeout.
// This is to ensure that the AAP idler kicks in before the main idler.

So in theory the main idler should never try to idle controllers which are managed by the aap idler because the aap idler would idle it one hour earlier.

rajivnathan · 2025-03-03T23:27:43Z

controllers/idler/idler_controller.go

@@ -111,6 +114,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl.
 		return reconcile.Result{}, r.wrapErrorWithStatusUpdate(ctx, idler, r.setStatusFailed, err,
 			"failed to ensure idling '%s'", idler.Name)
 	}
+	if err := r.ensureAnsiblePlatformIdling(ctx, idler); err != nil {


Couldn't we also call the AAP idler logic from the scaleControllerToZero function only for pods that have the AAP owner reference like how we do for VMs?

member-operator/controllers/idler/idler_controller.go

Line 321 in fc36345

switch owner.Kind {

Unfortunately no. This is one of the reasons why I wanted to keep the AAP idler separate. AAP resource ownership is complex. Pods are not owned by the AAP CRs directly. There is a long chain of ownership (see more details in my comment above). Integrating it with the main idler loop will require deep refactoring of the existing idler logic and deep integration the main logic with the AAP idling. Which I really want to avoid.

are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?

I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:

if controller is not owned by someone else then scale it down

if controller is owned by someone else, then:

if it's app, then go in the chain up as you do using getAAPOwner function

if it's owned by an unknown instance, then try to scale it down (expecting that the owner reverts the change)

in other words, incorporate the knowledge and usage of getAAPOwner into scaleControllerToZero or any of the functions called from it. it shouldn't be so hard and it would be very similar to what you do in the aap_idler.go.

are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?

Yes, I'm sure. I checked it. Theoretically it can be added on the AAP side but we can't expect it happen soon enough.

I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:

I wouldn't refactor the main idler for this yet. While it should not be super hard to do IMO it's not a trivial change either. I would still try to keep the AAP code as separate as possible for now.

mfrancisc

Looks good overall!

controllers/idler/aap_idler.go

controllers/idler/idler_controller.go

controllers/idler/aap_idler.go

Co-authored-by: Francisc Munteanu <[email protected]>

MatousJobanek · 2025-03-06T15:17:04Z

controllers/idler/aap_idler.go

+	// Get all API resources from the cluster using the discovery client. We need it for constructing GVRs for unstructured objects.
+	// Do it here once, so we do not have to list it multiple times before listing/getting every unstructured resource.
+	resourceLists, err := r.DiscoveryClient.ServerPreferredResources()
+	if err != nil {
+		return err
+	}
+
+	// Check if the AAP API is even available/installed
+	aapGVR, err := findGVRForKind(aapKind, aapAPIVersion, resourceLists)
+	if err != nil {
+		return err
+	}


this could be done only once - either as part of the controller initialization or using sync.Once logic

I was thinking about that but then if the API is changed (installed/uninstalled operators) then we will have to restart the member operator manually. The requests seems to be quick enough but we could move it to the initialization (or once) and keep this limitation with installing new operators while the member operator is running in mind.

The most realistic risk here if the AAP operator is updated and it introduces some API changes which could break the idler (like introducing another AAP CRD which could be in the middle of the owner chain).

controllers/idler/aap_idler.go

Co-authored-by: Matous Jobanek <[email protected]>

sonarqubecloud · 2025-03-06T17:22:20Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Ansible Automation Platform Idler

f030d3d

alexeykazakov requested review from MatousJobanek, xcoulon, rajivnathan, ranakan19, mfrancisc, rsoaresd, fbm3307, metlos and jrosental as code owners February 28, 2025 04:35

openshift-ci bot added the approved label Feb 28, 2025

alexeykazakov added 3 commits February 27, 2025 20:52

fix

5e9fe8c

fix

ae0dfba

cleanup

925be0e

metlos reviewed Mar 3, 2025

View reviewed changes

controllers/idler/app_idler.go Outdated Show resolved Hide resolved

controllers/idler/app_idler.go Outdated Show resolved Hide resolved

rajivnathan reviewed Mar 3, 2025

View reviewed changes

alexeykazakov added 3 commits March 3, 2025 17:24

rename

20fa50f

Check AAP API before start

b829082

Merge branch 'master' into aap_idler

a9e2a29

mfrancisc reviewed Mar 4, 2025

View reviewed changes

alexeykazakov and others added 3 commits March 4, 2025 11:39

Update controllers/idler/aap_idler.go

e5ea737

Co-authored-by: Francisc Munteanu <[email protected]>

cleanup

b1f628f

fix

7c6e0e9

MatousJobanek reviewed Mar 6, 2025

View reviewed changes

Update controllers/idler/aap_idler.go

f0fb05b

Co-authored-by: Matous Jobanek <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ansible Automation Platform Idler #631

Ansible Automation Platform Idler #631

alexeykazakov commented Feb 28, 2025

openshift-ci bot commented Feb 28, 2025

rajivnathan Mar 3, 2025

alexeykazakov Mar 4, 2025

rajivnathan Mar 4, 2025

MatousJobanek Mar 4, 2025

alexeykazakov Mar 4, 2025

rajivnathan Mar 3, 2025

alexeykazakov Mar 4, 2025

MatousJobanek Mar 4, 2025

alexeykazakov Mar 4, 2025

mfrancisc left a comment

MatousJobanek Mar 6, 2025

alexeykazakov Mar 6, 2025

alexeykazakov Mar 6, 2025

sonarqubecloud bot commented Mar 6, 2025

	func (r *Reconciler) scaleControllerToZero(ctx context.Context, meta metav1.ObjectMeta) (string, string, bool, error) {
	log.FromContext(ctx).Info("Scaling controller to zero", "name", meta.Name)
	owners := meta.GetOwnerReferences()
	for _, owner := range owners {
	if owner.Controller != nil && *owner.Controller {
	switch owner.Kind {
	case "Deployment":
	return r.scaleDeploymentToZero(ctx, meta.Namespace, owner)
	case "ReplicaSet":
	return r.scaleReplicaSetToZero(ctx, meta.Namespace, owner)
	case "DaemonSet":
	return r.deleteDaemonSet(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.
	case "StatefulSet":
	return r.scaleStatefulSetToZero(ctx, meta.Namespace, owner)
	case "DeploymentConfig":
	return r.scaleDeploymentConfigToZero(ctx, meta.Namespace, owner)
	case "ReplicationController":
	return r.scaleReplicationControllerToZero(ctx, meta.Namespace, owner)
	case "Job":
	return r.deleteJob(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead.
	case "VirtualMachineInstance":
	return r.stopVirtualMachine(ctx, meta.Namespace, owner) // Nothing to scale down. Stop instead.
	}
	}
	}
	return "", "", false, nil
	}

Ansible Automation Platform Idler #631

Are you sure you want to change the base?

Ansible Automation Platform Idler #631

Conversation

alexeykazakov commented Feb 28, 2025

openshift-ci bot commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfrancisc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 6, 2025

Quality Gate passed