Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nrt: log: introduce and use "generation" for cache #798

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

ffromani
Copy link
Contributor

@ffromani ffromani commented Sep 4, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Improves debuggability of the overreserve cache

We introduce the concept of "generation" which is an opaque monotonically increasing integer similar in spirit to the resourceVersion kube API field.
Every time the internal state of the cache is updated, which happens only in the resync loop by design, we increment the generation.

GetCachedNRTCopy will also return the generation of the data being used, so we have now an uniform way to correlate readers and writer of the cache, and we gain better visibility of the data being used.

Which issue(s) this PR fixes:

Fixes N/A

Special notes for your reviewer:

In a nutshell, this change is to enable better logging and to make it easier to correlate users of the cache (filter/score) with the reconciliation loop, so it's easier to infer which data was used/when was updated.

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 4, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2024
Copy link

netlify bot commented Sep 4, 2024

Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.

Name Link
🔨 Latest commit 0dae3ec
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-scheduler-plugins/deploys/66deee147dc40b0008909ce4

@ffromani
Copy link
Contributor Author

ffromani commented Sep 4, 2024

/test all

@ffromani
Copy link
Contributor Author

ffromani commented Sep 4, 2024

/test all

@ffromani ffromani changed the title WIP: nrt: log: introduce and use "generation" for cache nrt: log: introduce and use "generation" for cache Sep 4, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2024
@ffromani
Copy link
Contributor Author

ffromani commented Sep 4, 2024

/test all

@ffromani
Copy link
Contributor Author

ffromani commented Sep 8, 2024

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2024
@ffromani ffromani marked this pull request as ready for review September 8, 2024 15:11
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 8, 2024
@ffromani
Copy link
Contributor Author

ffromani commented Sep 8, 2024

Howdy @Tal-or @PiotrProkop could you PTAL? Changes to discardreserved are trivial and minimal, but still I'd like a review

@@ -33,6 +31,7 @@ const (
KeyFlow string = "flow"
KeyContainer string = "container"
KeyContainerKind string = "kind"
KeyGeneration string = "gen"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or generation? size of log entries is a concern too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer readability over compactness

Copy link
Contributor Author

@ffromani ffromani Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, "generation" it is

Copy link
Contributor

@PiotrProkop PiotrProkop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small nits, overall looks good!

@@ -97,30 +98,33 @@ func NewOverReserve(ctx context.Context, lh logr.Logger, cfg *apiconfig.NodeReso
return obj, nil
}

func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) {
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, CachedNRTInfo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, CachedNRTInfo) {
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (nrt *topologyv1alpha2.NodeResourceTopology, info CachedNRTInfo) {

and then just change all return to return nrt, info ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the named returns but for silly reasons, but this case seems interesting I'll check. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, tried out. Looks very nice in the overreserved impl, but doesn't look so great in the discardreserved and passthrough impl, IMO leading to a slightly more convoluted code than we have now. I value the fact implementation across the implementations is as consistent as could be, so I think overall I'd like more the current approach and not using the named parameters just yet.

pkg/noderesourcetopology/cache/cache.go Outdated Show resolved Hide resolved
@ffromani ffromani force-pushed the nrt-log-generation branch 2 times, most recently from 454246b to 5bd1762 Compare September 9, 2024 11:50
In order to improve the debuggability of the overreserve cache, we
would like to
1. correlate the cache state being used with
2. the actions the resync loop is doing
3. infer in a easier way the current state of the cache

This change aims to improve points 1 and 2, while also trying to make
3 easier in the future.

We introduce the concept of "generation" which is an opaque
monotonically increasing integer similar in spirit to the
`resourceVersion` kube API field.
Every time the internal state of the cache is updated, which happens
only in the resync loop by design, we increment the generation.

GetCachedNRTCopy will also return the generation of the data
being used, so we have now an uniform way to correlate readers
and writer of the cache, and we gain better visibility of the data
being used.

With verbose enough logging, using the generation is now easier
(albeit admittedly still clunky) to reconstruct the chain of changes
which lead to a given cache state, which was much harder previously.
Similarly, there's now a clear way to learn which cache state was
used to make a given scheduling decision, which was much harder before.

The changes involve mostly logging; to avoid proliferation of return
values, however, a trivial refactoring is done in `GetCachedNRTCopy`.
A beneficial side effect is much improved documentation of the
return values.

Signed-off-by: Francesco Romani <[email protected]>
@PiotrProkop
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2024
@ffromani
Copy link
Contributor Author

@Tal-or please un-hold if you like the PR

@Tal-or
Copy link
Contributor

Tal-or commented Sep 11, 2024

/hold cancel
Thanks LGTM

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 11, 2024
@k8s-ci-robot k8s-ci-robot merged commit 7a836bc into kubernetes-sigs:master Sep 11, 2024
10 checks passed
@ffromani ffromani deleted the nrt-log-generation branch September 11, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants