-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated cherry pick of #120413: scheduler: fix tracking of concurrent events #120445
Automated cherry pick of #120413: scheduler: fix tracking of concurrent events #120445
Conversation
The previous approach was based on the assumption that an in-flight pod can use the head of the received event list as marker for identifying all events that occur while the pod is in flight. That assumption is incorrect: when that existing element gets removed from the list because all pods that were in-flight when it was received are done, that marker's Next method returns nil and the code which should have seen several concurrent events (if there were any) missed all of those. As a result, a pod with concurrent events could incorrectly get moved to the unschedulable queue where it could got stuck until the next periodic purging after 5 minutes if there was no other event for it. The approach with maintaining a single list of concurrent events can be fixed by inserting each in-flight pod into the list and using that element to identify "more recent" events for the pod.
The problematic scenario was having one pod in flight, one event in the list, and then detecting a concurrent event for a second pod after the first pod is done. The new test case covers that. To make it work without assumptions about the implementation, the QueuedPodInfo returned by Pop must be the one passed to AddUnschedulableIfNotPresent after (potentially) populating UnschedulablePlugins. This is done via callback functions which bind to the same shared variable.
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
LGTM label has been added. Git tree hash: 850ab12d839d47b7f7c3cf2d4ed6169c84552796
|
/cc kubernetes/release-managers |
@pohly @alculquicondor Is this affecting only v1.28 or should this be cherry-picked to other supported release branches as well? |
Just 1.28 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RelEng:
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, pohly, xmudrii The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest The "OOM killer" flake strikes again... |
Cherry pick of #120413 on release-1.28.
#120413: scheduler: fix tracking of concurrent events
For details on the cherry pick process, see the cherry pick requests page.