-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SANDBOX-967: check restartCounts and idle crashlooping pods #630
SANDBOX-967: check restartCounts and idle crashlooping pods #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a couple of minor comments.
VMStopped(podsRunningForTooLong.virtualmachine). | ||
VMRunning(podsTooEarlyToKill.virtualmachine). | ||
VMRunning(noise.virtualmachine) | ||
|
||
// Still tracking all pods. Even deleted ones. | ||
// Only tracks pods that have not been processed in this reconcile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not been processed or rater not deleted yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated in 82f8118, have also added a comment in the third reconcile loop where we track controlledpods again
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeykazakov, ranakan19 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
… into idlerCrashLoop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just some minor suggestions
// if it is a standalone pod, delete it. | ||
// Send notification if the deleted pod was managed by a controller, was a standalone pod that was not completed or was crashlooping | ||
func deletePodsAndCreateNotification(ctx context.Context, podLogger logr.Logger, pod corev1.Pod, r *Reconciler, idler *toolchainv1alpha1.Idler) error { | ||
lctx := log.IntoContext(ctx, podLogger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed only when you need to propagate some values from the logger to the context, but I don't see anything like that here.
If you want to propagate the values set in the logger before, then it should be done outside of this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lctx is used in the scaleControllerToero function. It was called outside this function before. I also see that this change was purposely made in #495.
I've now passed two contexts to the function deletePodsAndCreateNotification in sync with the changes of the mentioned PR. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two contexts passed in - 9164efe
RestartCountWithinThresholdContainer1 = 30 | ||
RestartCountWithinThresholdContainer2 = 24 | ||
RestartCountOverThreshold = 52 | ||
TestIdlerTimeOutSeconds = 540 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep the testIdlerTimeoutSeconds
to be different from RequeueTimeThresholdSeconds
. Making it shorter was making the VMs idle too fast and assertions to fail at the count of pods being tracked. Keeping the IdlerTimeoutSeconds
greater keeps the existing flow of tests with the max timeout of 5m.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #630 +/- ##
==========================================
+ Coverage 81.64% 81.71% +0.06%
==========================================
Files 29 29
Lines 3302 3330 +28
==========================================
+ Hits 2696 2721 +25
- Misses 457 459 +2
- Partials 149 150 +1
|
// Requeue in idler.Spec.TimeoutSeconds or RequeueTimeThresholdSeconds, whichever is sooner. | ||
nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler) | ||
maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second | ||
idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second | ||
after := findShortestDuration(nextPodToKillAfter, &maxRequeDuration, &idlerTimeoutDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of the findShortestDuration
function, but wouldn't make more sense to move
nextPodToKillAfter := nextPodToBeKilledAfter(logger, idler)
maxRequeDuration := time.Duration(RequeueTimeThresholdSeconds) * time.Second
idlerTimeoutDuration := time.Duration(idler.Spec.TimeoutSeconds) * time.Second
into findShortestDuration
function? the variable don't seem to be used anywhere else in this method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, moved it into the function and renamed it to findShortestRequeueDuration
in bfc3cda
also removed the log.Info in nextPodToKillAfter
- thus only logs the shortest requeue duration in findShortestRequeueDuration
lctx := log.IntoContext(ctx, podLogger) | ||
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero. | ||
err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please clarify what is the point of using two types of context? Why not using one?
Also, I would rename the context to make it clear that it's related to the pod
lctx := log.IntoContext(ctx, podLogger) | |
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero. | |
err := deletePodsAndCreateNotification(ctx, lctx, pod, r, idler) | |
podCtx := log.IntoContext(ctx, podLogger) | |
// Check if it belongs to a controller (Deployment, DeploymentConfig, etc) and scale it down to zero. | |
err := deletePodsAndCreateNotification(podCtx, pod, r, idler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried answering it here - #630 (comment)
To add though, podCtx would have the pod_name and pod_phase which is useful to have in the logs when scalling the controller and pods down, instead of passing it in the logs everytime.
Below is the snippet of logs with normal ctx and podCtx. I'd keep the pod context - its helpful to see which pod is referenced in the log itself.
i'll rename the variable to podCtx though to make it clearer - thanks!
2025-02-26T13:52:38-05:00 INFO Scaling controller to zero {"pod_name": "restartCount-alex-stage-pod-fail", "pod_phase": "", "name": "restartCount-alex-stage-pod-fail"}
2025-02-26T13:52:38-05:00 INFO Deleting pod without controller
2025-02-26T13:52:38-05:00 INFO Pod deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm completely fine with using podCtx
, which is also what I proposed, but I don't see any reason for passing and using the original ctx
context - this is what I'm still missing in your explanation.
we could just use podCtx for the entire function, but it would be an overkill imo - for example: pod_name and pod_phase is not needed in the logs when creating a notification, it'll be good to have sure but i think we can get by without it too.
overkill is using two types of contexts if not really needed for any cancellation reasons or anything like that.
This whole function deletePodsAndCreateNotification
is executed in the context of the specific pod, so it's completely valid (and TBH also expected) to include the pod information in the logs. From your snippet, you can also see that the lines:
2025-02-26T13:52:38-05:00 INFO Deleting pod without controller
2025-02-26T13:52:38-05:00 INFO Pod deleted
are missing which pod it is referring to, which is more a bug than a feature.
Providing the context values is the best practice we do everywhere else, just look at the controller context and the logger - it contains the request metadata of the object the reconcile was triggered for in the whole reconcile loop.
|
||
} else { | ||
newStatusPods = append(newStatusPods, *trackedPod) // keep tracking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a small detail, another option (to keep it consistent with the previous section) would be
continue
}
newStatusPods = append(newStatusPods, *trackedPod) // keep tracking
but feel free to keep it as it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've kept it as it is for now - making the change had some test failures in the number of pods being tracked - i didn't debug further for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Besides Matou's comments :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my comments. Looks good 👍
This PR
wrt - https://issues.redhat.com/browse/SANDBOX-967