-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nrt: log: introduce and use "generation" for cache #798
nrt: log: introduce and use "generation" for cache #798
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-scheduler-plugins canceled.
|
/test all |
ab78a02
to
344741b
Compare
/test all |
344741b
to
6de85d7
Compare
/test all |
/hold |
Howdy @Tal-or @PiotrProkop could you PTAL? Changes to |
@@ -33,6 +31,7 @@ const ( | |||
KeyFlow string = "flow" | |||
KeyContainer string = "container" | |||
KeyContainerKind string = "kind" | |||
KeyGeneration string = "gen" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or generation
? size of log entries is a concern too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer readability over compactness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough, "generation" it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some small nits, overall looks good!
@@ -97,30 +98,33 @@ func NewOverReserve(ctx context.Context, lh logr.Logger, cfg *apiconfig.NodeReso | |||
return obj, nil | |||
} | |||
|
|||
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, bool) { | |||
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, CachedNRTInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (*topologyv1alpha2.NodeResourceTopology, CachedNRTInfo) { | |
func (ov *OverReserve) GetCachedNRTCopy(ctx context.Context, nodeName string, pod *corev1.Pod) (nrt *topologyv1alpha2.NodeResourceTopology, info CachedNRTInfo) { |
and then just change all return to return nrt, info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of the named returns but for silly reasons, but this case seems interesting I'll check. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, tried out. Looks very nice in the overreserved
impl, but doesn't look so great in the discardreserved
and passthrough
impl, IMO leading to a slightly more convoluted code than we have now. I value the fact implementation across the implementations is as consistent as could be, so I think overall I'd like more the current approach and not using the named parameters just yet.
454246b
to
5bd1762
Compare
In order to improve the debuggability of the overreserve cache, we would like to 1. correlate the cache state being used with 2. the actions the resync loop is doing 3. infer in a easier way the current state of the cache This change aims to improve points 1 and 2, while also trying to make 3 easier in the future. We introduce the concept of "generation" which is an opaque monotonically increasing integer similar in spirit to the `resourceVersion` kube API field. Every time the internal state of the cache is updated, which happens only in the resync loop by design, we increment the generation. GetCachedNRTCopy will also return the generation of the data being used, so we have now an uniform way to correlate readers and writer of the cache, and we gain better visibility of the data being used. With verbose enough logging, using the generation is now easier (albeit admittedly still clunky) to reconstruct the chain of changes which lead to a given cache state, which was much harder previously. Similarly, there's now a clear way to learn which cache state was used to make a given scheduling decision, which was much harder before. The changes involve mostly logging; to avoid proliferation of return values, however, a trivial refactoring is done in `GetCachedNRTCopy`. A beneficial side effect is much improved documentation of the return values. Signed-off-by: Francesco Romani <[email protected]>
5bd1762
to
0dae3ec
Compare
/lgtm |
@Tal-or please un-hold if you like the PR |
/hold cancel |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Improves debuggability of the overreserve cache
We introduce the concept of "generation" which is an opaque monotonically increasing integer similar in spirit to the
resourceVersion
kube API field.Every time the internal state of the cache is updated, which happens only in the resync loop by design, we increment the generation.
GetCachedNRTCopy will also return the generation of the data being used, so we have now an uniform way to correlate readers and writer of the cache, and we gain better visibility of the data being used.
Which issue(s) this PR fixes:
Fixes N/A
Special notes for your reviewer:
In a nutshell, this change is to enable better logging and to make it easier to correlate users of the cache (filter/score) with the reconciliation loop, so it's easier to infer which data was used/when was updated.