Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SANDBOX-967: check restartCounts and idle crashlooping pods #630

Merged
merged 14 commits into from
Feb 27, 2025

Conversation

ranakan19
Copy link
Contributor

@ranakan19 ranakan19 commented Feb 20, 2025

This PR

  • introduces functionality to idle crashlooping pods if restartCount > 50
  • change requeue time to 5 mins when nothing to idle.
  • send idler notification when idles crashlooping pod ; need to make changes in the messaging of idler notification to not include 12 hrs - host PR - make idler email content generic instead host-operator#1147

wrt - https://issues.redhat.com/browse/SANDBOX-967

Copy link
Contributor

@alexeykazakov alexeykazakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall!

Copy link
Contributor

@alexeykazakov alexeykazakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of minor comments.

VMStopped(podsRunningForTooLong.virtualmachine).
VMRunning(podsTooEarlyToKill.virtualmachine).
VMRunning(noise.virtualmachine)

// Still tracking all pods. Even deleted ones.
// Only tracks pods that have not been processed in this reconcile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not been processed or rater not deleted yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 82f8118, have also added a comment in the third reconcile loop where we track controlledpods again

Copy link

openshift-ci bot commented Feb 24, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexeykazakov, ranakan19

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alexeykazakov,ranakan19]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@rajivnathan rajivnathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just some minor suggestions

// if it is a standalone pod, delete it.
// Send notification if the deleted pod was managed by a controller, was a standalone pod that was not completed or was crashlooping
func deletePodsAndCreateNotification(ctx context.Context, podLogger logr.Logger, pod corev1.Pod, r *Reconciler, idler *toolchainv1alpha1.Idler) error {
lctx := log.IntoContext(ctx, podLogger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed only when you need to propagate some values from the logger to the context, but I don't see anything like that here.
If you want to propagate the values set in the logger before, then it should be done outside of this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lctx is used in the scaleControllerToero function. It was called outside this function before. I also see that this change was purposely made in #495.
I've now passed two contexts to the function deletePodsAndCreateNotification in sync with the changes of the mentioned PR. Let me know what you think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two contexts passed in - 9164efe

RestartCountWithinThresholdContainer1 = 30
RestartCountWithinThresholdContainer2 = 24
RestartCountOverThreshold = 52
TestIdlerTimeOutSeconds = 540
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep the testIdlerTimeoutSeconds to be different from RequeueTimeThresholdSeconds. Making it shorter was making the VMs idle too fast and assertions to fail at the count of pods being tracked. Keeping the IdlerTimeoutSeconds greater keeps the existing flow of tests with the max timeout of 5m.

Copy link

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 95.52239% with 3 lines in your changes missing coverage. Please review.

Project coverage is 81.71%. Comparing base (7ed2c18) to head (96d7089).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
controllers/idler/idler_controller.go 95.52% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #630      +/-   ##
==========================================
+ Coverage   81.64%   81.71%   +0.06%     
==========================================
  Files          29       29              
  Lines        3302     3330      +28     
==========================================
+ Hits         2696     2721      +25     
- Misses        457      459       +2     
- Partials      149      150       +1     
Files with missing lines Coverage Δ
controllers/idler/idler_controller.go 93.27% <95.52%> (-0.23%) ⬇️

Comment on lines 115 to 119
// Requeue in idler.Spec.TimeoutSeconds or RequeueTimeThresholdSeconds, whichever is sooner.
nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler)
maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second
idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second
after := findShortestDuration(nextPodToKillAfter, &maxRequeDuration, &idlerTimeoutDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of the findShortestDuration function, but wouldn't make more sense to move

nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler)
maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second
idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second

into findShortestDuration function? the variable don't seem to be used anywhere else in this method...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, moved it into the function and renamed it to findShortestRequeueDuration in bfc3cda
also removed the log.Info in nextPodToKillAfter - thus only logs the shortest requeue duration in findShortestRequeueDuration

Comment on lines 143 to 145
lctx := log.IntoContext(ctx, podLogger)
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.
err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please clarify what is the point of using two types of context? Why not using one?
Also, I would rename the context to make it clear that it's related to the pod

Suggested change
lctx := log.IntoContext(ctx, podLogger)
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.
err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler)
podCtx := log.IntoContext(ctx, podLogger)
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero.
err := deletePodsAndCreateNotification(podCtx, pod, r, idler)

Copy link
Contributor Author

@ranakan19 ranakan19 Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried answering it here - #630 (comment)
To add though, podCtx would have the pod_name and pod_phase which is useful to have in the logs when scalling the controller and pods down, instead of passing it in the logs everytime.
Below is the snippet of logs with normal ctx and podCtx. I'd keep the pod context - its helpful to see which pod is referenced in the log itself.
i'll rename the variable to podCtx though to make it clearer - thanks!

2025-02-26T13:52:38-05:00	INFO	Scaling controller to zero	{"pod_name": "restartCount-alex-stage-pod-fail", "pod_phase": "", "name": "restartCount-alex-stage-pod-fail"}
2025-02-26T13:52:38-05:00	INFO	Deleting pod without controller
2025-02-26T13:52:38-05:00	INFO	Pod deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely fine with using podCtx, which is also what I proposed, but I don't see any reason for passing and using the original ctx context - this is what I'm still missing in your explanation.

we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.

overkill is using two types of contexts if not really needed for any cancellation reasons or anything like that.
This whole function deletePodsAndCreateNotification is executed in the context of the specific pod, so it's completely valid (and TBH also expected) to include the pod information in the logs. From your snippet, you can also see that the lines:

2025-02-26T13:52:38-05:00	INFO	Deleting pod without controller
2025-02-26T13:52:38-05:00	INFO	Pod deleted

are missing which pod it is referring to, which is more a bug than a feature.
Providing the context values is the best practice we do everywhere else, just look at the controller context and the logger - it contains the request metadata of the object the reconcile was triggered for in the whole reconcile loop.

Comment on lines 165 to 167

} else {
newStatusPods = append(newStatusPods, *trackedPod) // keep tracking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a small detail, another option (to keep it consistent with the previous section) would be

				continue
			}
			newStatusPods = append(newStatusPods, *trackedPod) // keep tracking

but feel free to keep it as it is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kept it as it is for now - making the change had some test failures in the number of pods being tracked - i didn't debug further for now

Copy link
Contributor

@alexeykazakov alexeykazakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Besides Matou's comments :)

@ranakan19 ranakan19 merged commit 3ef3b29 into codeready-toolchain:master Feb 27, 2025
12 of 13 checks passed
Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. Looks good 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants