Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-5007: DRA Device Binding Conditions #5012

Merged
merged 29 commits into from
Feb 12, 2025

Conversation

KobayashiD27
Copy link
Contributor

  • One-line enhancement description: Some network- or fabric-attached devices need to be attached to a node before a pod using them can be scheduled.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Dec 20, 2024
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 20, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @KobayashiD27. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 20, 2024
@KobayashiD27
Copy link
Contributor Author

@pohly
I have drafted the KEP. I would appreciate it if we could discuss it further.

@KobayashiD27 KobayashiD27 marked this pull request as draft December 20, 2024 07:54
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 20, 2024
@KobayashiD27 KobayashiD27 marked this pull request as ready for review December 22, 2024 23:37
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 22, 2024
@KobayashiD27
Copy link
Contributor Author

Hi @pohly,
When you have a moment, could you please take a look at my proposal?

CC: @towca @hase1128
Since this is related to CA and DRA, I would appreciate it if you could also review it when you have the chance.

Thank you!

@pohly
Copy link
Contributor

pohly commented Jan 10, 2025

/ok-to-test

I'm a bit time-constrained right now. Not sure whether I can review for 1.33.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 10, 2025
@wojtek-t
Copy link
Member

@x13n - FYI

@hase1128
Copy link
Contributor

@pohly
I understand your situation. Is there anyone else who might review it?
Should we explain KEP at the regular meeting of wg-device-management?

@pohly
Copy link
Contributor

pohly commented Jan 17, 2025

Discussing it in the meeting would be a good first step to get more people on board.

@zvonkok
Copy link

zvonkok commented Jan 19, 2025

/cc @zvonkok

@k8s-ci-robot k8s-ci-robot requested a review from zvonkok January 19, 2025 17:28
@KobayashiD27
Copy link
Contributor Author

I have implemented an experimental version of this DRA scheduler feature. I hope it will be useful for your review.

KobayashiD27/kubernetes@74e2ab9

@KobayashiD27 KobayashiD27 changed the title KEP-5007: DRA Device Attach Before Pod Scheduled KEP-5007: DRA: Device Binding Conditions Feb 12, 2025
@KobayashiD27 KobayashiD27 changed the title KEP-5007: DRA: Device Binding Conditions KEP-5007: DRA Device Binding Conditions Feb 12, 2025
@KobayashiD27
Copy link
Contributor Author

@dom4ha We modified the KEP to pass the pull-enhancements-verify. So would you please add lgtm again?
@alculquicondor SIG-Scheduling reviewers and DRA maintainer consider this KEP should be allowed to proceed as an alpha, so please add approve label.

@hase1128 there are still some unresolved comments, please resolve those issues first. For example, the naming and types of some fields, and the need to copy over all the "as-was" fields.

I would also like to retitle this KEP to represent more succinctly what it does. What do you think about something like "DRA: Device Binding Conditions"? That is more accurate, rather than naming it after the specific use case.

@johnbelamaric
Thank you for pointing that out. It appears I had a misunderstanding regarding the usage for "Gates" and "Condition." I have revised the document to reflect the reviews.

Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API review: some small gaps in the mechanics (+listType) and suggestions to make the API description a bit more complete.

UsageRestrictedToNode bool

// BindingTimeout indicates the prepare timeout period.
// If the timeout period is exceeded before all BindingConditions reach a True state, the scheduler clears the allocation in the ResourceClaim and reschedules the Pod.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If the timeout period is exceeded before all BindingConditions reach a True state, the scheduler clears the allocation in the ResourceClaim and reschedules the Pod.
// If the timeout period is exceeded before all BindingConditions reach a True state, the scheduler clears the allocation in the ResourceClaim and reschedules the Pod.
//
// The default timeout if not set is 10 minutes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could argue that this should be set via defaulting, but then we get that value set also when there aren't any binding conditions, or need to be very (too!) clever with the defaulting.

// BindingTimeout indicates the prepare timeout period.
// If the timeout period is exceeded before all BindingConditions reach a True state, the scheduler clears the allocation in the ResourceClaim and reschedules the Pod.
//
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// +optional
// The default timeout if not set is 10 minutes.
//
// +optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. I have reflected all of your suggestions and comments.

@dom4ha
Copy link
Member

dom4ha commented Feb 12, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2025
@sanposhiho
Copy link
Member

sanposhiho commented Feb 12, 2025

/lgtm
for alpha, per the discussion. We need more in-depth discussion later, especially when it's proceeding for beta.

(FYI, I'll be taking two days off from tomorrow, so feel free to DM me on Slack, or just skip me if that's something not critical. I've joined only a few discussion here in this PR, so likely there would be nothing such though)

@pohly
Copy link
Contributor

pohly commented Feb 12, 2025

/assign @alculquicondor

For SIG Scheduling approval. For reviews, see #5012 (comment) (@sanposhiho), #5012 (comment) (@dom4ha) and #5012 (comment) (summary by @dom4ha).

PRR is done and only needs final approval, see #5012 (comment).

Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for SIG Scheduling, based on the discussions during the last meeting.

@hase1128
Copy link
Contributor

@alculquicondor
Thank you for your approval.

@johnbelamaric
Please add approval label as PRR approver.

@johnbelamaric
Copy link
Member

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, johnbelamaric, KobayashiD27, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 66cb318 into kubernetes:master Feb 12, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 12, 2025
@KobayashiD27 KobayashiD27 deleted the dra-attach-device-to-node branch February 13, 2025 00:11
@hase1128
Copy link
Contributor

@johnbelamaric @pohly @klueska

I too am wary of this. I would like to think about if there is a better way. But all this KEP actually contains right now is the ability to gate binding on conditions. The modification of ResourceSlices as described here is not actually part of the KEP; it's part of the specific implementation of the driver.

So, I think we can proceed with an alpha of the condition-based gating of binding.

I will create a separate issue about this(updating ResourceSlices). Please let me discuss it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.