SANDBOX-967: check restartCounts and idle crashlooping pods #630

ranakan19 · 2025-02-20T01:31:53Z

This PR

introduces functionality to idle crashlooping pods if restartCount > 50
change requeue time to 5 mins when nothing to idle.
send idler notification when idles crashlooping pod ; need to make changes in the messaging of idler notification to not include 12 hrs - host PR - make idler email content generic instead host-operator#1147

wrt - https://issues.redhat.com/browse/SANDBOX-967

alexeykazakov

Looks good overall!

controllers/idler/idler_controller.go

controllers/idler/idler_controller_test.go

controllers/idler/idler_controller.go

alexeykazakov

Looks good! Just a couple of minor comments.

alexeykazakov · 2025-02-24T22:24:52Z

controllers/idler/idler_controller_test.go

 					VMStopped(podsRunningForTooLong.virtualmachine).
 					VMRunning(podsTooEarlyToKill.virtualmachine).
 					VMRunning(noise.virtualmachine)

-				// Still tracking all pods. Even deleted ones.
+				// Only tracks pods that have not been processed in this reconcile


Not been processed or rater not deleted yet?

updated in 82f8118, have also added a comment in the third reconcile loop where we track controlledpods again

controllers/idler/idler_controller_test.go

openshift-ci · 2025-02-24T22:35:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, ranakan19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alexeykazakov,ranakan19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…stant for tests

… into idlerCrashLoop

controllers/idler/idler_controller.go

rajivnathan

Looks good! Just some minor suggestions

controllers/idler/idler_controller.go

controllers/idler/idler_controller_test.go

controllers/idler/idler_controller.go

MatousJobanek · 2025-02-25T12:47:59Z

controllers/idler/idler_controller.go

+// if it is a standalone pod, delete it.
+// Send notification if the deleted pod was managed by a controller, was a standalone pod that was not completed or was crashlooping
+func deletePodsAndCreateNotification(ctx context.Context, podLogger logr.Logger, pod corev1.Pod, r *Reconciler, idler *toolchainv1alpha1.Idler) error {
+	lctx := log.IntoContext(ctx, podLogger)


This is needed only when you need to propagate some values from the logger to the context, but I don't see anything like that here.
If you want to propagate the values set in the logger before, then it should be done outside of this function

lctx is used in the scaleControllerToero function. It was called outside this function before. I also see that this change was purposely made in #495.
I've now passed two contexts to the function deletePodsAndCreateNotification in sync with the changes of the mentioned PR. Let me know what you think

two contexts passed in - 9164efe

controllers/idler/idler_controller.go

ranakan19 · 2025-02-26T05:16:24Z

controllers/idler/idler_controller_test.go

+	RestartCountWithinThresholdContainer1 = 30
+	RestartCountWithinThresholdContainer2 = 24
+	RestartCountOverThreshold             = 52
+	TestIdlerTimeOutSeconds               = 540


I wanted to keep the testIdlerTimeoutSeconds to be different from RequeueTimeThresholdSeconds. Making it shorter was making the VMs idle too fast and assertions to fail at the count of pods being tracked. Keeping the IdlerTimeoutSeconds greater keeps the existing flow of tests with the max timeout of 5m.

codecov · 2025-02-26T07:12:42Z

Codecov Report

Attention: Patch coverage is 95.52239% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.71%. Comparing base (7ed2c18) to head (96d7089).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
controllers/idler/idler_controller.go	95.52%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #630      +/-   ##
==========================================
+ Coverage   81.64%   81.71%   +0.06%     
==========================================
  Files          29       29              
  Lines        3302     3330      +28     
==========================================
+ Hits         2696     2721      +25     
- Misses        457      459       +2     
- Partials      149      150       +1

Files with missing lines	Coverage Δ
controllers/idler/idler_controller.go	`93.27% <95.52%> (-0.23%)`	⬇️

controllers/idler/idler_controller.go

MatousJobanek · 2025-02-26T15:31:01Z

controllers/idler/idler_controller.go

+	// Requeue in idler.Spec.TimeoutSeconds or RequeueTimeThresholdSeconds, whichever is sooner.
+	nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler)
+	maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second
+	idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second
+	after := findShortestDuration(nextPodToKillAfter, &maxRequeDuration, &idlerTimeoutDuration)


I like the idea of the findShortestDuration function, but wouldn't make more sense to move

nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler) maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second

into findShortestDuration function? the variable don't seem to be used anywhere else in this method...

makes sense, moved it into the function and renamed it to findShortestRequeueDuration in bfc3cda
also removed the log.Info in nextPodToKillAfter - thus only logs the shortest requeue duration in findShortestRequeueDuration

controllers/idler/idler_controller.go

MatousJobanek · 2025-02-26T15:37:40Z

controllers/idler/idler_controller.go

+				lctx := log.IntoContext(ctx, podLogger)
+				// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.
+				err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler)


could you please clarify what is the point of using two types of context? Why not using one?
Also, I would rename the context to make it clear that it's related to the pod

Suggested change

lctx := log.IntoContext(ctx, podLogger)

// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.

err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler)

podCtx := log.IntoContext(ctx, podLogger)

// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.

err := deletePodsAndCreateNotification(podCtx, pod, r, idler)

I tried answering it here - #630 (comment)
To add though, podCtx would have the pod_name and pod_phase which is useful to have in the logs when scalling the controller and pods down, instead of passing it in the logs everytime.
Below is the snippet of logs with normal ctx and podCtx. I'd keep the pod context - its helpful to see which pod is referenced in the log itself.
i'll rename the variable to podCtx though to make it clearer - thanks!

2025-02-26T13:52:38-05:00 INFO Scaling controller to zero {"pod_name": "restartCount-alex-stage-pod-fail", "pod_phase": "", "name": "restartCount-alex-stage-pod-fail"} 2025-02-26T13:52:38-05:00 INFO Deleting pod without controller 2025-02-26T13:52:38-05:00 INFO Pod deleted

we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.

I'm completely fine with using podCtx, which is also what I proposed, but I don't see any reason for passing and using the original ctx context - this is what I'm still missing in your explanation.

we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.

overkill is using two types of contexts if not really needed for any cancellation reasons or anything like that.
This whole function deletePodsAndCreateNotification is executed in the context of the specific pod, so it's completely valid (and TBH also expected) to include the pod information in the logs. From your snippet, you can also see that the lines:

2025-02-26T13:52:38-05:00 INFO Deleting pod without controller 2025-02-26T13:52:38-05:00 INFO Pod deleted

are missing which pod it is referring to, which is more a bug than a feature.
Providing the context values is the best practice we do everywhere else, just look at the controller context and the logger - it contains the request metadata of the object the reconcile was triggered for in the whole reconcile loop.

controllers/idler/idler_controller.go

MatousJobanek · 2025-02-26T15:42:54Z

controllers/idler/idler_controller.go


 			} else {
 				newStatusPods = append(newStatusPods, *trackedPod) // keep tracking


only a small detail, another option (to keep it consistent with the previous section) would be

continue } newStatusPods = append(newStatusPods, *trackedPod) // keep tracking

but feel free to keep it as it is

I've kept it as it is for now - making the change had some test failures in the number of pods being tracked - i didn't debug further for now

controllers/idler/idler_controller.go

alexeykazakov

Looks good to me. Besides Matou's comments :)

sonarqubecloud · 2025-02-27T15:23:31Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.9% Duplication on New Code

See analysis details on SonarQube Cloud

MatousJobanek

Thanks for addressing my comments. Looks good 👍

check restartCounts and idle crashlooping pods

90b04bb

ranakan19 requested review from MatousJobanek, xcoulon, alexeykazakov, rajivnathan, mfrancisc, rsoaresd, fbm3307, metlos and jrosental as code owners February 20, 2025 01:31

openshift-ci bot added the approved label Feb 20, 2025

alexeykazakov reviewed Feb 20, 2025

View reviewed changes

ranakan19 added 2 commits February 19, 2025 21:25

rename const to RequeueTimeThresholdSeconds, max requeue time to 5 mins

a9b2750

fix condition

d40b75b

alexeykazakov reviewed Feb 20, 2025

View reviewed changes

controllers/idler/idler_controller.go Outdated Show resolved Hide resolved

ranakan19 and others added 3 commits February 24, 2025 17:09

add more scenarios for crashlooping pods but within threshold

576b44c

revert condition fix

c230065

Merge branch 'master' into idlerCrashLoop

6cfde26

ranakan19 requested a review from alexeykazakov February 24, 2025 22:26

alexeykazakov approved these changes Feb 24, 2025

View reviewed changes

ranakan19 added 2 commits February 24, 2025 17:52

fix comment, add comment when tracking controlled pods again, use con…

82f8118

…stant for tests

Merge branch 'idlerCrashLoop' of github.com:ranakan19/member-operator…

f3bea34

… into idlerCrashLoop

alexeykazakov reviewed Feb 24, 2025

View reviewed changes

controllers/idler/idler_controller.go Outdated Show resolved Hide resolved

move test's consts in test file

b38555a

rajivnathan reviewed Feb 24, 2025

View reviewed changes

add rajiv's suggestions from review

a147e63

ranakan19 requested a review from rajivnathan February 25, 2025 00:23

MatousJobanek reviewed Feb 25, 2025

View reviewed changes

matous's suggestions

9164efe

change requeuetime to be the shortest, to fix e2e test failure

a75d109

ranakan19 commented Feb 26, 2025

View reviewed changes

MatousJobanek reviewed Feb 26, 2025

View reviewed changes

controllers/idler/idler_controller.go Outdated Show resolved Hide resolved

alexeykazakov reviewed Feb 26, 2025

View reviewed changes

ranakan19 added 2 commits February 26, 2025 16:38

matous's suggestions second review

bfc3cda

only use podCtx in deletePodsAndCreateNotification

96d7089

ranakan19 merged commit 3ef3b29 into codeready-toolchain:master Feb 27, 2025
12 of 13 checks passed

MatousJobanek reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SANDBOX-967: check restartCounts and idle crashlooping pods #630

SANDBOX-967: check restartCounts and idle crashlooping pods #630

ranakan19 commented Feb 20, 2025 •

edited

Loading

alexeykazakov left a comment

alexeykazakov left a comment

alexeykazakov Feb 24, 2025

ranakan19 Feb 24, 2025

openshift-ci bot commented Feb 24, 2025

rajivnathan left a comment

MatousJobanek Feb 25, 2025

ranakan19 Feb 25, 2025

ranakan19 Feb 25, 2025

ranakan19 Feb 26, 2025

codecov bot commented Feb 26, 2025 •

edited

Loading

MatousJobanek Feb 26, 2025

ranakan19 Feb 26, 2025

MatousJobanek Feb 26, 2025

ranakan19 Feb 26, 2025 •

edited

Loading

ranakan19 Feb 26, 2025

MatousJobanek Feb 27, 2025

MatousJobanek Feb 26, 2025

ranakan19 Feb 26, 2025

alexeykazakov left a comment

sonarqubecloud bot commented Feb 27, 2025

MatousJobanek left a comment


		} else {
		newStatusPods = append(newStatusPods, *trackedPod) // keep tracking

SANDBOX-967: check restartCounts and idle crashlooping pods #630

SANDBOX-967: check restartCounts and idle crashlooping pods #630

Conversation

ranakan19 commented Feb 20, 2025 • edited Loading

alexeykazakov left a comment

Choose a reason for hiding this comment

alexeykazakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 24, 2025

rajivnathan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 26, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ranakan19 Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexeykazakov left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Feb 27, 2025

Quality Gate passed

MatousJobanek left a comment

Choose a reason for hiding this comment

ranakan19 commented Feb 20, 2025 •

edited

Loading

codecov bot commented Feb 26, 2025 •

edited

Loading

ranakan19 Feb 26, 2025 •

edited

Loading