fix requeue in idler controller #633

MatousJobanek · 2025-03-03T15:35:28Z

Added watcher on pods with a predicate that should trigger the reconcile only if:

pod exists in users namespace
either the startTime was set in the pod's status, or it reached the restart threshold.

two fixes of scheduling the next requeue of the idler controller:

drops the RequeueTimeThreshold - the original value caused that the controller kept reconciling as crazy in instances with a high number of users
take into account that VMs should live only one twelfth of the idler timeout - the next requeue should be scheduled accordingly.

metlos

Nice simplification of the code 👍🏼

controllers/idler/idler_controller.go

codecov · 2025-03-03T17:36:10Z

Codecov Report

Attention: Patch coverage is 88.57143% with 8 lines in your changes missing coverage. Please review.

Project coverage is 81.79%. Comparing base (fc36345) to head (4e807fb).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
controllers/idler/idler_controller.go	90.90%	4 Missing ⚠️
controllers/idler/predicate.go	78.94%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #633      +/-   ##
==========================================
+ Coverage   81.71%   81.79%   +0.07%     
==========================================
  Files          29       31       +2     
  Lines        3330     3350      +20     
==========================================
+ Hits         2721     2740      +19     
- Misses        459      460       +1     
  Partials      150      150

Files with missing lines	Coverage Δ
controllers/idler/mapper.go	`100.00% <100.00%> (ø)`
pkg/webhook/mutatingwebhook/userpods_mutate.go	`87.09% <ø> (ø)`
controllers/idler/idler_controller.go	`93.76% <90.90%> (+0.49%)`	⬆️
controllers/idler/predicate.go	`78.94% <78.94%> (ø)`

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alexeykazakov · 2025-03-03T18:15:20Z

controllers/idler/idler_controller.go

@@ -35,7 +35,7 @@ import (

 const (
 	RestartThreshold     = 50
-	RequeueTimeThreshold = 300 * time.Second
+	RequeueTimeThreshold = 6 * time.Hour


I wonder if we can make it a bit shorter like 3h? Basically crash-lopping pods will be killed after the threshold is reached (50 restarts) plus up to RequeueTimeThreshold. With this change it's up to 6h instead of 5m extra time after the threshold. Which is fine. But potentially could be shorter (3h???) if it's not too often for the idler with many namespaces.
But again, it's minor. We can keep it as 6h for now and shorten it after we switch to a pod watcher.

well, I'm not really a fan of the short requeue time. having 3h would result in a new reconcile of the controller every 5-6 seconds in an environment with 2k namespaces. Do we really need that?

Anyway, I implemented the watcher with a predicate that should take care of this without triggering unnecessary reconcile loops.

controllers/idler/idler_controller.go

ranakan19

Thanks for the changes!

Couple of thought/comments/questions:

Sorry, could you please clarify the issue with shorter VM idling mentioned in point 2 of PR description? Was it not functioning as expected before, aside from the 5m requeue time being too aggressive causing issues?
The current implementation checks restartCount against the threshold but doesn’t tightly couple requeueTime to it. Previously also the logic for restartCount wasn't tied to timeout, but shorter requeue time would almost instantly idle the pod once it reaches restartThreshold. For instance, reaching a restartCount of 50 takes about 4 hours, yet idling a crash-looping pod with requeueTimeThreshold set to 6 hours can take anywhere between 6 to 12 hours. As Alexey pointed out its still better than not idling crashlooping pods at all, but a requeueTimethreshold shorter than 6 hr should help too.

MatousJobanek · 2025-03-04T09:28:37Z

Sorry, could you please clarify the issue with shorter VM idling mentioned in point 2 of PR description? Was it not functioning as expected before, aside from the 5m requeue time being too aggressive causing issues?

yeah, you are actually right, with the extreme short requeue time it wasn't actually a big problem - the VM would eventually get killed within 5 minutes since the scheduled idle time. I was mainly looking at the logic where it calculated the requeue time from the list of tracked pods, and this logic didn't take into account the pods owned by VM. As a result, modifying the requeue thredhold time had indirectly a negative impact on the VM idling.

The current implementation checks restartCount against the threshold but doesn’t tightly couple requeueTime to it. Previously also the logic for restartCount wasn't tied to timeout, but shorter requeue time would almost instantly idle the pod once it reaches restartThreshold. For instance, reaching a restartCount of 50 takes about 4 hours, yet idling a crash-looping pod with requeueTimeThreshold set to 6 hours can take anywhere between 6 to 12 hours. As Alexey pointed out its still better than not idling crashlooping pods at all, but a requeueTimethreshold shorter than 6 hr should help too.

We need to find a balance between overloading the controller and our logs vs being able to proactively act on the changes in the cluster. The best way to act on the changes is using a watcher, so I replaced the approach of using the requeue threshold time with a watcher and a predicate.

controllers/idler/mapper.go

controllers/idler/predicate.go

rajivnathan · 2025-03-05T14:35:34Z

main.go

@@ -248,7 +249,7 @@ func main() {
 		DynamicClient:       dynamicClient,
 		GetHostCluster:      cluster.GetHostCluster,
 		Namespace:           namespace,
-	}).SetupWithManager(mgr); err != nil {
+	}).SetupWithManager(mgr, allNamespacesCluster); err != nil {


Just curious, do you know the difference between allNamespacesCluster.GetCache() and allNamespacesCache? I'm wondering if we have overlap between these clients and caches.

it's actually the same.

member-operator/main.go

Line 199 in fc36345

allNamespacesClient, allNamespacesCache, err := newAllNamespacesClient(cfg)

member-operator/main.go

Lines 338 to 345 in fc36345

func newAllNamespacesClient(config *rest.Config) (client.Client, cache.Cache, error) {

clusterAllNamespaces, err := runtimecluster.New(config, func(clusterOptions *runtimecluster.Options) {

clusterOptions.Scheme = scheme

})

if err != nil {

return nil, nil, err

}

return clusterAllNamespaces.GetClient(), clusterAllNamespaces.GetCache(), nil

The code in the main.go could be a bit refactored to make it clear, but that's for different PR

As discussed in the call, I misunderstood the variables you were referring to. So yes, you are right, there is a duplication of the clients - addressed here: #635
Thanks for pointing it out 👍

controllers/idler/idler_controller.go

controllers/idler/idler_controller_test.go

ranakan19 · 2025-03-06T05:56:47Z

I was mainly looking at the logic where it calculated the requeue time from the list of tracked pods, and this logic didn't take into account the pods owned by VM

Ah because only the timeoutsecond variable was updated within the tracledPods loop, while the requeuetime calculation relied on idler spec! I see now, thank you so much :)

ranakan19

Great improvements! predicate looks real nice.
minor - the testIdlertimeoutseconds const in idler_controller_test can be a lower value now

Co-authored-by: Rajiv Senthilnathan <[email protected]>

sonarqubecloud · 2025-03-06T12:57:51Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

MatousJobanek · 2025-03-06T13:22:33Z

minor - the testIdlertimeoutseconds const in idler_controller_test can be a lower value now

it's a unit-test, it's just a number. TBH, it's sometimes even better to keep it higher to have a room for playing with the age of the workload and other numbers, so keeping it as it is

rajivnathan

Nice! 👍

openshift-ci · 2025-03-06T15:29:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, MatousJobanek, metlos, rajivnathan, ranakan19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [MatousJobanek,alexeykazakov,metlos,rajivnathan,ranakan19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix requeue in idler controller

a2dc282

MatousJobanek requested review from xcoulon, alexeykazakov, rajivnathan, ranakan19, mfrancisc, rsoaresd, fbm3307, metlos and jrosental as code owners March 3, 2025 15:35

openshift-ci bot added the approved label Mar 3, 2025

metlos reviewed Mar 3, 2025

View reviewed changes

controllers/idler/idler_controller.go Outdated Show resolved Hide resolved

alexeykazakov approved these changes Mar 3, 2025

View reviewed changes

ranakan19 reviewed Mar 4, 2025

View reviewed changes

use pod watcher

11c707f

startTime

8ebdbd1

metlos reviewed Mar 4, 2025

View reviewed changes

controllers/idler/mapper.go Show resolved Hide resolved

controllers/idler/predicate.go Outdated Show resolved Hide resolved

MatousJobanek added 2 commits March 4, 2025 12:54

remove creation

12221bf

unexport restartThreshold

2001124

metlos approved these changes Mar 5, 2025

View reviewed changes

rajivnathan reviewed Mar 5, 2025

View reviewed changes

alexeykazakov approved these changes Mar 5, 2025

View reviewed changes

ranakan19 approved these changes Mar 6, 2025

View reviewed changes

MatousJobanek and others added 2 commits March 6, 2025 13:54

Update controllers/idler/idler_controller.go

adbd831

Co-authored-by: Rajiv Senthilnathan <[email protected]>

test name

4e807fb

rajivnathan approved these changes Mar 6, 2025

View reviewed changes

MatousJobanek merged commit 8fdaf17 into codeready-toolchain:master Mar 7, 2025
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix requeue in idler controller #633

fix requeue in idler controller #633

MatousJobanek commented Mar 3, 2025 •

edited

Loading

metlos left a comment

codecov bot commented Mar 3, 2025 •

edited

Loading

alexeykazakov Mar 3, 2025

ranakan19 Mar 4, 2025

MatousJobanek Mar 4, 2025 •

edited

Loading

ranakan19 left a comment

MatousJobanek commented Mar 4, 2025

rajivnathan Mar 5, 2025

MatousJobanek Mar 6, 2025 •

edited

Loading

MatousJobanek Mar 6, 2025

ranakan19 commented Mar 6, 2025

ranakan19 left a comment

sonarqubecloud bot commented Mar 6, 2025

MatousJobanek commented Mar 6, 2025

rajivnathan left a comment

openshift-ci bot commented Mar 6, 2025

	func newAllNamespacesClient(config *rest.Config) (client.Client, cache.Cache, error) {
	clusterAllNamespaces, err := runtimecluster.New(config, func(clusterOptions *runtimecluster.Options) {
	clusterOptions.Scheme = scheme
	})
	if err != nil {
	return nil, nil, err
	}
	return clusterAllNamespaces.GetClient(), clusterAllNamespaces.GetCache(), nil

fix requeue in idler controller #633

fix requeue in idler controller #633

Conversation

MatousJobanek commented Mar 3, 2025 • edited Loading

metlos left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 3, 2025 • edited Loading

Codecov Report

alexeykazakov Mar 3, 2025

Choose a reason for hiding this comment

ranakan19 Mar 4, 2025

Choose a reason for hiding this comment

MatousJobanek Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

ranakan19 left a comment

Choose a reason for hiding this comment

MatousJobanek commented Mar 4, 2025

rajivnathan Mar 5, 2025

Choose a reason for hiding this comment

MatousJobanek Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

MatousJobanek Mar 6, 2025

Choose a reason for hiding this comment

ranakan19 commented Mar 6, 2025

ranakan19 left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 6, 2025

Quality Gate passed

MatousJobanek commented Mar 6, 2025

rajivnathan left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Mar 6, 2025

MatousJobanek commented Mar 3, 2025 •

edited

Loading

codecov bot commented Mar 3, 2025 •

edited

Loading

MatousJobanek Mar 4, 2025 •

edited

Loading

MatousJobanek Mar 6, 2025 •

edited

Loading