-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ansible Automation Platform Idler #631
base: master
Are you sure you want to change the base?
Ansible Automation Platform Idler #631
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeykazakov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
controllers/idler/app_idler.go
Outdated
idledAAPs := []string{} | ||
|
||
// Check if there is any AAP CRs in the namespace | ||
aapList, err := r.DynamicClient.Resource(aapGVR).Namespace(idler.Name).List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MatousJobanek discussed updating the idler to watch Pods and mentioned he'll work on that tomorrow. If we're doing that then I guess we don't need to list aap CRs like this? Can we start from the pod and work our way up the owner references to get to the AAP resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons to list aaps:
- A working AAP instance has a lot of pods (like 20 or so). Which have a complex ownership structure (for example some of them look like this: AAP -> Another AAP CR -> yet another AAP CR -> Deployment -> ReplicaSet -> Pod). And in most cases we expect users to have just one (or none) AAP (vs. many of them). Our UI will work with a single AAP instance only (it will create, show, un-idle just one AAP with a hardcoded name).
So, in case there is just one AAP then in order to avoid going through all of this ownership chain for every pod (20 times) for a single AAP instance we just list all AAPs first and then if the first AAP is already idled (after going through the ownership chain for the fist pod) then we don't need to handle the rest of the pods (19 AAP pods and any other one not related to AAP the user might have).
In other words in most cases we will need to handle a single pod instead of 20 of them. - Since we want to list all AAPs for the reason above anyway, why not do not skip checking the pods all together if there is no AAP in the namespace?
It's basically all about optimization. And a pod watcher doesn't change it, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for the explanation. I think that info would be good to capture in a comment as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this will actually work together with the current implementation of the idler that is executed before the app-specific one? Correct me if I'm wrong, but if the pod is owned by a controller (which are all these 20 cases), then the current idler will try to idle them using the standard procedure:
member-operator/controllers/idler/idler_controller.go
Lines 316 to 342 in fc36345
func (r *Reconciler) scaleControllerToZero(ctx context.Context, meta metav1.ObjectMeta) (string, string, bool, error) { | |
log.FromContext(ctx).Info("Scaling controller to zero", "name", meta.Name) | |
owners := meta.GetOwnerReferences() | |
for _, owner := range owners { | |
if owner.Controller != nil && *owner.Controller { | |
switch owner.Kind { | |
case "Deployment": | |
return r.scaleDeploymentToZero(ctx, meta.Namespace, owner) | |
case "ReplicaSet": | |
return r.scaleReplicaSetToZero(ctx, meta.Namespace, owner) | |
case "DaemonSet": | |
return r.deleteDaemonSet(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead. | |
case "StatefulSet": | |
return r.scaleStatefulSetToZero(ctx, meta.Namespace, owner) | |
case "DeploymentConfig": | |
return r.scaleDeploymentConfigToZero(ctx, meta.Namespace, owner) | |
case "ReplicationController": | |
return r.scaleReplicationControllerToZero(ctx, meta.Namespace, owner) | |
case "Job": | |
return r.deleteJob(ctx, meta.Namespace, owner) // Nothing to scale down. Delete instead. | |
case "VirtualMachineInstance": | |
return r.stopVirtualMachine(ctx, meta.Namespace, owner) // Nothing to scale down. Stop instead. | |
} | |
} | |
} | |
return "", "", false, nil | |
} |
the code will go through this for all existing pods anyway, and the code doesn't care if the deployment is owned by someone else. Does it mean that the main logic simply scales everything down and then APP reverts the change back, and only the final logic in this aap_idle.go file will scale the correct instance down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the main idler logic and the aap idler logic kinda compete with each other. This is why I shorten the AAP idling timeout so it kicks in one hour earlier than the main idler: https://github.com/codeready-toolchain/member-operator/pull/631/files#diff-79b76c6fb335e6e635f582d5c79bbd4cadbb780830fa6af6e2e936e78b83253eR140-R143
// Check if the idler timeout is less than one hour and if so, set it to half of the timeout.
// Otherwise, subtract one hour from the timeout.
// This is to ensure that the AAP idler kicks in before the main idler.
So in theory the main idler should never try to idle controllers which are managed by the aap idler because the aap idler would idle it one hour earlier.
@@ -111,6 +114,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, request ctrl.Request) (ctrl. | |||
return reconcile.Result{}, r.wrapErrorWithStatusUpdate(ctx, idler, r.setStatusFailed, err, | |||
"failed to ensure idling '%s'", idler.Name) | |||
} | |||
if err := r.ensureAnsiblePlatformIdling(ctx, idler); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we also call the AAP idler logic from the scaleControllerToZero function only for pods that have the AAP owner reference like how we do for VMs?
switch owner.Kind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no. This is one of the reasons why I wanted to keep the AAP idler separate. AAP resource ownership is complex. Pods are not owned by the AAP CRs directly. There is a long chain of ownership (see more details in my comment above). Integrating it with the main idler loop will require deep refactoring of the existing idler logic and deep integration the main logic with the AAP idling. Which I really want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?
I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:
- if controller is not owned by someone else then scale it down
- if controller is owned by someone else, then:
- if it's app, then go in the chain up as you do using
getAAPOwner
function - if it's owned by an unknown instance, then try to scale it down (expecting that the owner reverts the change)
- if it's app, then go in the chain up as you do using
in other words, incorporate the knowledge and usage of getAAPOwner
into scaleControllerToZero
or any of the functions called from it. it shouldn't be so hard and it would be very similar to what you do in the aap_idler.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that there is really no way (label, annotation, some spec field) we could decide if the pod is created/owned by the APP instance?
Yes, I'm sure. I checked it. Theoretically it can be added on the AAP side but we can't expect it happen soon enough.
I'm wondering if it wouldn't make sense to give it a try and incorporate it into the main logic - something like this:
I wouldn't refactor the main idler for this yet. While it should not be super hard to do IMO it's not a trivial change either. I would still try to keep the AAP code as separate as possible for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
Co-authored-by: Francisc Munteanu <[email protected]>
// Get all API resources from the cluster using the discovery client. We need it for constructing GVRs for unstructured objects. | ||
// Do it here once, so we do not have to list it multiple times before listing/getting every unstructured resource. | ||
resourceLists, err := r.DiscoveryClient.ServerPreferredResources() | ||
if err != nil { | ||
return err | ||
} | ||
|
||
// Check if the AAP API is even available/installed | ||
aapGVR, err := findGVRForKind(aapKind, aapAPIVersion, resourceLists) | ||
if err != nil { | ||
return err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be done only once - either as part of the controller initialization or using sync.Once
logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about that but then if the API is changed (installed/uninstalled operators) then we will have to restart the member operator manually. The requests seems to be quick enough but we could move it to the initialization (or once) and keep this limitation with installing new operators while the member operator is running in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most realistic risk here if the AAP operator is updated and it introduces some API changes which could break the idler (like introducing another AAP CRD which could be in the middle of the owner chain).
Co-authored-by: Matous Jobanek <[email protected]>
|
It turned out that we can't rely on the "Running" condition in the AAP CRs. It doesn't really properly reflect the status of AAPs. So we can't just use a simple cron job to idle AAPs.
This PR extends our idler so it can now handle AAPs too.
TODO: