-
-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for time saved from local and remote caching #11601
Add metrics for time saved from local and remote caching #11601
Conversation
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm generally
Would time saved ever be negative?
We use More generally, speculation with remote cache reads should avoid time saved ever being negative, but I imagine that's not a firm guarantee. And theoretically, a local cache read could be slower. But these scenarios both seem pretty uncommon so I'm not sure we need metrics for it. |
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
It's possible to be negative if the current process run takes longer than the previous one did. I would suggest normalizing negative values into "0"... you still did save time (relative to waiting for the currently running process), we're just not sure how long. |
When would that happen? These metrics are solely used in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Nits mostly.
context | ||
.workunit_store | ||
.increment_counter(Metric::RemoteCacheTotalTimeSavedMs, time_saved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, hdrhistogram
has a sum method, which should be roughly equivalent here. It won't have perfect accuracy, but it might still be worth it from a clarity/redundancy perspective to avoid double storing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, I don't see that with the Python library though: https://github.com/HdrHistogram/HdrHistogram_py/blob/master/hdrh/histogram.py
So iiuc, we would need to add special casing code that sums the histogram in Rust, then exposes over FFI to Python. The double counting seems cleaner than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Python HdrHistogram exposes an iterator (get_recorded_iterator
I believe) over all values. You could just sum
over that, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but that's still adding special-cased code then to our stats_aggregator.py
, and it has a loss of precision.
A major use case I'm trying to facilitate is Pants users reporting in the past that they were trying to answer how much time was saved from caching compared to the cost of uploading/downloading the cache in CI. I want to facilitate answering that question without users needing to add custom code, e.g. put this advice in our Pants And CI docs page.
// If the original process time wasn't recorded, we can't compute the time saved. | ||
assert_eq!( | ||
ProcessResultMetadata::default().time_saved_from_cache(Duration::new(1, 100)), | ||
None | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider storing a 0 here instead (similar to the negative case)? The idea would be to always have an entry in this histogram when we end up using the cache, even if we can't report a benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.
Fwit, local execution should always be storing the original execution time. The only reason we might not have the execution time is that we can't guarantee remote execution has set the ExecutedActionMetadata
. But cache.rs
and remote_cache.rs
only ever deal with local process executions, so in practicality, those codepaths will always have access to the original time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.
It wouldn't skew the data... it would make it more accurate, if anything: if there are a bunch of zeros being recorded, we need to expose that, as it could be a problem in one of a few different places.
It's possible to be negative if the current process run takes longer than the previous one did.
When would that happen?
In exactly the case I've described, I think. You ran the process in the past: it took 1.5 seconds to run. You're running it again, and racing it against a cache lookup. For some reason, when you run the process again, it takes a lot longer to run than it did the first time... say 3 seconds. Meanwhile, your cache lookup also takes a lot longer, but still wins the race in 2 seconds. You end up with a time saved value of 1.5 - 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so there are two different scenarios and I think we want to handle them differently:
- The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and
cache.rs
andremote_cache.rs
solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or useResult
in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case. - The cache hit was slower than the original process, but still faster than rerunning the process (via speculation). This is the negative numbers case. We, by definition, cannot know how slow the process would be on the second run, so the best we can do is represent with a 0. That sounds sensible to me.
So, we could change the modeling of .time_saved_from_cache()
be Result<Duration, String>
, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option<String>
and use 0. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and cache.rs and remote_cache.rs solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or use Result in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.
It can happen in the context of remote execution, where a server does not return the value. We shouldn't error for a server that cannot return us metrics, but we should do something to indicate that the server is reporting bad metrics.
So, we could change the modeling of .time_saved_from_cache() be Result<Duration, String>, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option and use 0. Wdyt?
Warnings would be annoying, and we'd need to add a way to disable them probably. Could put it at debug, but then fair chance that we never notice it. So yea, I'd still vote for 0 here, as it exposes the issue iff someone is actually looking at these metrics... but debug level logging wouldn't be the end of the world.
That sounds sensible to me.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but cache.rs and remote_cache.rs never will consume the results of remote execution, right? I'm only talking about case #1 here.
They will if a local cache is wrapped around remote execution, which is common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will if a local cache is wrapped around remote execution, which is common.
I'm not familiar with this workflow - when does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pants/src/rust/engine/src/context.rs
Lines 157 to 218 in e19336d
// Possibly either add the remote execution runner or the remote cache runner. | |
// `global_options.py` already validates that both are not set at the same time. | |
let maybe_remote_enabled_command_runner: Box<dyn CommandRunner> = | |
if remoting_opts.execution_enable { | |
Box::new(BoundedCommandRunner::new( | |
Box::new(process_execution::remote::CommandRunner::new( | |
// No problem unwrapping here because the global options validation | |
// requires the remoting_opts.execution_server be present when | |
// remoting_opts.execution_enable is set. | |
&remoting_opts.execution_address.clone().unwrap(), | |
remoting_opts.store_addresses.clone(), | |
process_execution_metadata.clone(), | |
root_ca_certs.clone(), | |
remoting_opts.execution_headers.clone(), | |
full_store.clone(), | |
// TODO if we ever want to configure the remote platform to be something else we | |
// need to take an option all the way down here and into the remote::CommandRunner struct. | |
Platform::Linux, | |
remoting_opts.execution_overall_deadline, | |
Duration::from_millis(100), | |
)?), | |
exec_strategy_opts.remote_parallelism, | |
)) | |
} else if remote_caching_used { | |
let action_cache_address = remote_store_addresses | |
.first() | |
.ok_or_else(|| "At least one remote store must be specified".to_owned())?; | |
Box::new(process_execution::remote_cache::CommandRunner::new( | |
local_command_runner.into(), | |
process_execution_metadata.clone(), | |
executor.clone(), | |
full_store.clone(), | |
action_cache_address.as_str(), | |
root_ca_certs.clone(), | |
remoting_opts.store_headers.clone(), | |
Platform::current()?, | |
exec_strategy_opts.remote_cache_read, | |
exec_strategy_opts.remote_cache_write, | |
remoting_opts.cache_eager_fetch, | |
)?) | |
} else { | |
local_command_runner | |
}; | |
// Possibly use the local cache runner, regardless of remote execution/caching. | |
let maybe_local_cached_command_runner = if exec_strategy_opts.use_local_cache { | |
let process_execution_store = ShardedLmdb::new( | |
local_store_dir.join("processes"), | |
2 * DEFAULT_LOCAL_STORE_GC_TARGET_BYTES, | |
executor.clone(), | |
DEFAULT_LEASE_TIME, | |
) | |
.map_err(|err| format!("Could not initialize store for process cache: {:?}", err))?; | |
Box::new(process_execution::cache::CommandRunner::new( | |
maybe_remote_enabled_command_runner.into(), | |
process_execution_store, | |
full_store.clone(), | |
process_execution_metadata.clone(), | |
)) | |
} else { | |
maybe_remote_enabled_command_runner | |
}; |
maybe_local_cached_command_runner
is wrapped around maybe_remote_enabled_command_runner
.
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
This is useful for users to tune caching -- for example, how much time is saved by local caching in CI, and how does this compare to the cache download/upload time?
[ci skip-build-wheels]