Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamically reclaiming resources #756

Merged
merged 7 commits into from
May 12, 2023
Merged

Conversation

trasc
Copy link
Contributor

@trasc trasc commented May 9, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add the support for dynamic resources reclaim.

Which issue(s) this PR fixes:

Fixes #78

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Add the support for dynamic resources reclaim.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels May 9, 2023
@k8s-ci-robot k8s-ci-robot requested review from kerthcet and tenzen-y May 9, 2023 10:17
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 9, 2023
@netlify
Copy link

netlify bot commented May 9, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit cf30176
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/645e7cf2febc69000831fabc

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 9, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 9, 2023
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2023
@trasc
Copy link
Contributor Author

trasc commented May 9, 2023

/cc @alculquicondor

apis/kueue/v1beta1/workload_types.go Show resolved Hide resolved
apis/kueue/webhooks/workload_webhook.go Outdated Show resolved Hide resolved
apis/kueue/v1beta1/workload_types.go Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Show resolved Hide resolved
test/util/util.go Outdated Show resolved Hide resolved
@trasc trasc mentioned this pull request May 10, 2023
3 tasks
apis/kueue/webhooks/workload_webhook.go Outdated Show resolved Hide resolved
apis/kueue/webhooks/workload_webhook.go Outdated Show resolved Hide resolved
apis/kueue/webhooks/workload_webhook.go Outdated Show resolved Hide resolved
pkg/controller/jobframework/interface.go Outdated Show resolved Hide resolved
apis/kueue/webhooks/workload_webhook.go Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
pkg/controller/jobs/job/job_controller.go Outdated Show resolved Hide resolved
pkg/controller/jobs/job/job_controller.go Outdated Show resolved Hide resolved
pkg/workload/workload.go Show resolved Hide resolved
pkg/workload/workload.go Outdated Show resolved Hide resolved
@trasc trasc force-pushed the dyn-reclaim branch 2 times, most recently from de1ca9e to 5f30896 Compare May 11, 2023 07:00
apis/kueue/v1beta1/workload_types.go Show resolved Hide resolved
apis/kueue/webhooks/workload_webhook.go Show resolved Hide resolved
pkg/controller/jobframework/reconciler.go Show resolved Hide resolved
pkg/workload/workload.go Show resolved Hide resolved
// +optional
// +listType=map
// +listMapKey=name
ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having second thoughts here.

If we assume that the admission section can be modified after admission (to update Count and usage), then I prefer we have reclaimableCount inside the admission struct.

I think this might be a better idea as we move towards elastic jobs. @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I prefer we have reclaimableCount inside the admission struct.

Does this mean we have reclaimableCount inside the admission instead of having this ReclaimablePods (removing ReclaimablePods )?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what I'm thinking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we selected ReclaimablePods instead of reclaimableCount since it could be misleading when the job is suspended.

#742 (comment)

However, as @alculquicondor says, we can ignore that concern if we have reclaimableCount inside the Admission.

I think this might be a better idea as we move towards elastic jobs.

right. The Admission has the reclaimableCount would be more natural for the elastic job.

@trasc Are you concerned about having reclaimableCount inside the Admission?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the admission section can be modified after admission (to update Count and usage), then I prefer we have reclaimableCount inside the admission struct.

The admission is not changed when reclaimable changes.

@trasc Are you concerned about having reclaimableCount inside the Admission?

SSA conflicts. We should keep them separated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have elastic jobs. Then the kueue scheduler+preemption could update count.

However, this is generally independent from the concept of reclaiming pods.

Ok, let's keep it like this.

Any reason why not make this map[string]int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to look similar to flavors and usage

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits

apis/kueue/v1beta1/workload_types.go Show resolved Hide resolved
// +optional
// +listType=map
// +listMapKey=name
ReclaimablePods []ReclaimablePod `json:"reclaimablePods,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have elastic jobs. Then the kueue scheduler+preemption could update count.

However, this is generally independent from the concept of reclaiming pods.

Ok, let's keep it like this.

Any reason why not make this map[string]int32?

pkg/controller/jobframework/reconciler.go Show resolved Hide resolved
@@ -288,3 +341,30 @@ func GetQueueOrderTimestamp(w *kueue.Workload) *metav1.Time {
func IsAdmitted(w *kueue.Workload) bool {
return apimeta.IsStatusConditionTrue(w.Status.Conditions, kueue.WorkloadAdmitted)
}

// UpdateReclaimablePods updates the ReclaimablePods list for the workload wit SSA.
func UpdateReclaimablePods(ctx context.Context, c client.Client, w *kueue.Workload, reclaimablePods []kueue.ReclaimablePod) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func UpdateReclaimablePods(ctx context.Context, c client.Client, w *kueue.Workload, reclaimablePods []kueue.ReclaimablePod) error {
func ApplyReclaimablePods(ctx context.Context, c client.Client, w *kueue.Workload, reclaimablePods []kueue.ReclaimablePod) error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep "Apply" , for when we only push parts of a workload to the api server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which this does?

Copy link
Contributor Author

@trasc trasc May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reclaimablePods []kueue.ReclaimablePod is not part of the workload now, w *kueue.Workload is used to identify the target.

vs func ApplyAdmissionStatus(ctx context.Context, c client.Client, w *kueue.Workload, strict bool) error where w has the content set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be able to quickly identify which function is using Update (PUT), versus SSA (PATCH). For that reason, I find Apply very obvious.

But we can clean this up later.

r[name] /= f
}
}

// UpdateStatus updates the condition of a workload with ssa,
// filelManager being set to managerPrefix + "-" + conditionType
func UpdateStatus(ctx context.Context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename to ApplyStatus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep "Apply" , for when we only push parts of a workload to the api server.

pkg/controller/jobframework/reconciler.go Outdated Show resolved Hide resolved
gomega.Eventually(func() []kueue.ReclaimablePod {
gomega.Expect(k8sClient.Get(ctx, wlKey, wl)).Should(gomega.Succeed())
return wl.Status.ReclaimablePods

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Reverting this empty line would be good.

@tenzen-y
Copy link
Member

Totally LGTM.

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 12, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2023
@k8s-ci-robot k8s-ci-robot merged commit 60cec2f into kubernetes-sigs:main May 12, 2023
@trasc trasc deleted the dyn-reclaim branch May 15, 2023 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamically reclaiming resources
4 participants