Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix requeue in idler controller #633

Merged

Conversation

MatousJobanek
Copy link
Contributor

@MatousJobanek MatousJobanek commented Mar 3, 2025

Added watcher on pods with a predicate that should trigger the reconcile only if:

  • pod exists in users namespace
  • either the startTime was set in the pod's status, or it reached the restart threshold.

two fixes of scheduling the next requeue of the idler controller:

  1. drops the RequeueTimeThreshold - the original value caused that the controller kept reconciling as crazy in instances with a high number of users
  2. take into account that VMs should live only one twelfth of the idler timeout - the next requeue should be scheduled accordingly.

Copy link
Contributor

@metlos metlos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification of the code 👍🏼

Copy link

codecov bot commented Mar 3, 2025

Codecov Report

Attention: Patch coverage is 88.57143% with 8 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (fc36345) to head (4e807fb).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
controllers/idler/idler_controller.go 90.90% 4 Missing ⚠️
controllers/idler/predicate.go 78.94% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #633      +/-   ##
==========================================
+ Coverage   81.71%   81.79%   +0.07%     
==========================================
  Files          29       31       +2     
  Lines        3330     3350      +20     
==========================================
+ Hits         2721     2740      +19     
- Misses        459      460       +1     
  Partials      150      150              
Files with missing lines Coverage Δ
controllers/idler/mapper.go 100.00% <100.00%> (ø)
pkg/webhook/mutatingwebhook/userpods_mutate.go 87.09% <ø> (ø)
controllers/idler/idler_controller.go 93.76% <90.90%> (+0.49%) ⬆️
controllers/idler/predicate.go 78.94% <78.94%> (ø)
🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@@ -35,7 +35,7 @@ import (

const (
RestartThreshold = 50
RequeueTimeThreshold = 300 * time.Second
RequeueTimeThreshold = 6 * time.Hour
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make it a bit shorter like 3h? Basically crash-lopping pods will be killed after the threshold is reached (50 restarts) plus up to RequeueTimeThreshold. With this change it's up to 6h instead of 5m extra time after the threshold. Which is fine. But potentially could be shorter (3h???) if it's not too often for the idler with many namespaces.
But again, it's minor. We can keep it as 6h for now and shorten it after we switch to a pod watcher.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@MatousJobanek MatousJobanek Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I'm not really a fan of the short requeue time. having 3h would result in a new reconcile of the controller every 5-6 seconds in an environment with 2k namespaces. Do we really need that?

Anyway, I implemented the watcher with a predicate that should take care of this without triggering unnecessary reconcile loops.

Copy link
Contributor

@ranakan19 ranakan19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!

Couple of thought/comments/questions:

  1. Sorry, could you please clarify the issue with shorter VM idling mentioned in point 2 of PR description? Was it not functioning as expected before, aside from the 5m requeue time being too aggressive causing issues?
  2. The current implementation checks restartCount against the threshold but doesn’t tightly couple requeueTime to it. Previously also the logic for restartCount wasn't tied to timeout, but shorter requeue time would almost instantly idle the pod once it reaches restartThreshold. For instance, reaching a restartCount of 50 takes about 4 hours, yet idling a crash-looping pod with requeueTimeThreshold set to 6 hours can take anywhere between 6 to 12 hours. As Alexey pointed out its still better than not idling crashlooping pods at all, but a requeueTimethreshold shorter than 6 hr should help too.

@MatousJobanek
Copy link
Contributor Author

Sorry, could you please clarify the issue with shorter VM idling mentioned in point 2 of PR description? Was it not functioning as expected before, aside from the 5m requeue time being too aggressive causing issues?

yeah, you are actually right, with the extreme short requeue time it wasn't actually a big problem - the VM would eventually get killed within 5 minutes since the scheduled idle time. I was mainly looking at the logic where it calculated the requeue time from the list of tracked pods, and this logic didn't take into account the pods owned by VM. As a result, modifying the requeue thredhold time had indirectly a negative impact on the VM idling.

The current implementation checks restartCount against the threshold but doesn’t tightly couple requeueTime to it. Previously also the logic for restartCount wasn't tied to timeout, but shorter requeue time would almost instantly idle the pod once it reaches restartThreshold. For instance, reaching a restartCount of 50 takes about 4 hours, yet idling a crash-looping pod with requeueTimeThreshold set to 6 hours can take anywhere between 6 to 12 hours. As Alexey pointed out its still better than not idling crashlooping pods at all, but a requeueTimethreshold shorter than 6 hr should help too.

We need to find a balance between overloading the controller and our logs vs being able to proactively act on the changes in the cluster. The best way to act on the changes is using a watcher, so I replaced the approach of using the requeue threshold time with a watcher and a predicate.

@@ -248,7 +249,7 @@ func main() {
DynamicClient: dynamicClient,
GetHostCluster: cluster.GetHostCluster,
Namespace: namespace,
}).SetupWithManager(mgr); err != nil {
}).SetupWithManager(mgr, allNamespacesCluster); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you know the difference between allNamespacesCluster.GetCache() and allNamespacesCache? I'm wondering if we have overlap between these clients and caches.

Copy link
Contributor Author

@MatousJobanek MatousJobanek Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually the same.

allNamespacesClient, allNamespacesCache, err := newAllNamespacesClient(cfg)

member-operator/main.go

Lines 338 to 345 in fc36345

func newAllNamespacesClient(config *rest.Config) (client.Client, cache.Cache, error) {
clusterAllNamespaces, err := runtimecluster.New(config, func(clusterOptions *runtimecluster.Options) {
clusterOptions.Scheme = scheme
})
if err != nil {
return nil, nil, err
}
return clusterAllNamespaces.GetClient(), clusterAllNamespaces.GetCache(), nil

The code in the main.go could be a bit refactored to make it clear, but that's for different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the call, I misunderstood the variables you were referring to. So yes, you are right, there is a duplication of the clients - addressed here: #635
Thanks for pointing it out 👍

@ranakan19
Copy link
Contributor

I was mainly looking at the logic where it calculated the requeue time from the list of tracked pods, and this logic didn't take into account the pods owned by VM

Ah because only the timeoutsecond variable was updated within the tracledPods loop, while the requeuetime calculation relied on idler spec! I see now, thank you so much :)

Copy link
Contributor

@ranakan19 ranakan19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements! predicate looks real nice.
minor - the testIdlertimeoutseconds const in idler_controller_test can be a lower value now

Copy link

sonarqubecloud bot commented Mar 6, 2025

@MatousJobanek
Copy link
Contributor Author

minor - the testIdlertimeoutseconds const in idler_controller_test can be a lower value now

it's a unit-test, it's just a number. TBH, it's sometimes even better to keep it higher to have a room for playing with the age of the workload and other numbers, so keeping it as it is

Copy link
Contributor

@rajivnathan rajivnathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

Copy link

openshift-ci bot commented Mar 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, MatousJobanek, metlos, rajivnathan, ranakan19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [MatousJobanek,alexeykazakov,metlos,rajivnathan,ranakan19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@MatousJobanek MatousJobanek merged commit 8fdaf17 into codeready-toolchain:master Mar 7, 2025
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants