Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for time saved from local and remote caching #11601
Add metrics for time saved from local and remote caching #11601
Changes from all commits
c1e6d18
a1c9821
b1c0572
6dbeb4b
48f0867
3193f89
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW,
hdrhistogram
has a sum method, which should be roughly equivalent here. It won't have perfect accuracy, but it might still be worth it from a clarity/redundancy perspective to avoid double storing things.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I don't see that with the Python library though: https://github.com/HdrHistogram/HdrHistogram_py/blob/master/hdrh/histogram.py
So iiuc, we would need to add special casing code that sums the histogram in Rust, then exposes over FFI to Python. The double counting seems cleaner than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python HdrHistogram exposes an iterator (
get_recorded_iterator
I believe) over all values. You could justsum
over that, right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but that's still adding special-cased code then to our
stats_aggregator.py
, and it has a loss of precision.A major use case I'm trying to facilitate is Pants users reporting in the past that they were trying to answer how much time was saved from caching compared to the cost of uploading/downloading the cache in CI. I want to facilitate answering that question without users needing to add custom code, e.g. put this advice in our Pants And CI docs page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider storing a 0 here instead (similar to the negative case)? The idea would be to always have an entry in this histogram when we end up using the cache, even if we can't report a benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.
Fwit, local execution should always be storing the original execution time. The only reason we might not have the execution time is that we can't guarantee remote execution has set the
ExecutedActionMetadata
. Butcache.rs
andremote_cache.rs
only ever deal with local process executions, so in practicality, those codepaths will always have access to the original time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't skew the data... it would make it more accurate, if anything: if there are a bunch of zeros being recorded, we need to expose that, as it could be a problem in one of a few different places.
In exactly the case I've described, I think. You ran the process in the past: it took 1.5 seconds to run. You're running it again, and racing it against a cache lookup. For some reason, when you run the process again, it takes a lot longer to run than it did the first time... say 3 seconds. Meanwhile, your cache lookup also takes a lot longer, but still wins the race in 2 seconds. You end up with a time saved value of
1.5 - 2
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so there are two different scenarios and I think we want to handle them differently:
cache.rs
andremote_cache.rs
solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or useResult
in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.So, we could change the modeling of
.time_saved_from_cache()
beResult<Duration, String>
, where a Duration of 0 means the cache hit was slower than original process. Or stick withOption<String>
and use 0. Wdyt?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can happen in the context of remote execution, where a server does not return the value. We shouldn't error for a server that cannot return us metrics, but we should do something to indicate that the server is reporting bad metrics.
Warnings would be annoying, and we'd need to add a way to disable them probably. Could put it at debug, but then fair chance that we never notice it. So yea, I'd still vote for 0 here, as it exposes the issue iff someone is actually looking at these metrics... but debug level logging wouldn't be the end of the world.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but
cache.rs
andremote_cache.rs
never will consume the results of remote execution, right? I'm only talking about case #1 here.K, I'll change case #2 to use 0, but keep Option to represent case #1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will if a local cache is wrapped around remote execution, which is common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with this workflow - when does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pants/src/rust/engine/src/context.rs
Lines 157 to 218 in e19336d
maybe_local_cached_command_runner
is wrapped aroundmaybe_remote_enabled_command_runner
.