Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for time saved from local and remote caching #11601

Merged
merged 6 commits into from
Feb 25, 2021

Conversation

Eric-Arellano
Copy link
Contributor

@Eric-Arellano Eric-Arellano commented Feb 25, 2021

This is useful for users to tune caching -- for example, how much time is saved by local caching in CI, and how does this compare to the cache download/upload time?

❯ ./pants --plugins=hdrhistogram --stats-log --pants-config-files=pants.remote-cache.toml lint src/python/pants/util/strutil.py --no-pantsd --no-process-execution-use-local-cache
...
03:02:24.70 [INFO] Counters:
  ...
  remote_cache_total_time_saved_ms: 22222
  ...
03:02:24.70 [INFO] Observation histogram summaries:
03:02:24.98 [INFO] Summary of `remote_cache_time_saved_ms` observation histogram:
  min: 67
  max: 8239
  mean: 2777.750
  std dev: 2866.382
  total observations: 8
  p25: 201
  p50: 222
  p75: 4359
  p90: 5007
  p95: 8239
  p99: 8239
❯ ./pants --plugins=hdrhistogram --stats-log lint src/python/pants/util/strutil.py --no-pantsd
...
03:04:46.64 [INFO] Counters:
  ...
  local_cache_total_time_saved_ms: 22314
  ...
03:04:46.64 [INFO] Observation histogram summaries:
03:04:46.72 [INFO] Summary of `local_cache_time_saved_ms` observation histogram:
  min: 76
  max: 8247
  mean: 2789.125
  std dev: 2866.593
  total observations: 8
  p25: 207
  p50: 238
  p75: 4371
  p90: 5027
  p95: 8247
  p99: 8247

[ci skip-build-wheels]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano changed the title Add histograms for time saved from local and remote caching Add metrics for time saved from local and remote caching Feb 25, 2021
@Eric-Arellano Eric-Arellano requested a review from asherf February 25, 2021 03:07
Copy link
Contributor

@tdyas tdyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm generally

Would time saved ever be negative?

@Eric-Arellano
Copy link
Contributor Author

Would time saved ever be negative?

We use Duration.checked_sub(), which results in None if the result is negative.

More generally, speculation with remote cache reads should avoid time saved ever being negative, but I imagine that's not a firm guarantee. And theoretically, a local cache read could be slower. But these scenarios both seem pretty uncommon so I'm not sure we need metrics for it.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@stuhood
Copy link
Member

stuhood commented Feb 25, 2021

Would time saved ever be negative?

We use Duration.checked_sub(), which results in None if the result is negative.

More generally, speculation with remote cache reads should avoid time saved ever being negative, but I imagine that's not a firm guarantee.

It's possible to be negative if the current process run takes longer than the previous one did. I would suggest normalizing negative values into "0"... you still did save time (relative to waiting for the currently running process), we're just not sure how long.

@Eric-Arellano
Copy link
Contributor Author

It's possible to be negative if the current process run takes longer than the previous one did.

When would that happen? These metrics are solely used in cache.rs and remote_cache.rs when we had cache hits. If we had a cache miss and need to rerun, then we have already moved past the metrics. We are only comparing the cache overhead to the original cost of running the cached process.

Copy link
Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Nits mostly.

Comment on lines +461 to +463
context
.workunit_store
.increment_counter(Metric::RemoteCacheTotalTimeSavedMs, time_saved);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, hdrhistogram has a sum method, which should be roughly equivalent here. It won't have perfect accuracy, but it might still be worth it from a clarity/redundancy perspective to avoid double storing things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I don't see that with the Python library though: https://github.com/HdrHistogram/HdrHistogram_py/blob/master/hdrh/histogram.py

So iiuc, we would need to add special casing code that sums the histogram in Rust, then exposes over FFI to Python. The double counting seems cleaner than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python HdrHistogram exposes an iterator (get_recorded_iterator I believe) over all values. You could just sum over that, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but that's still adding special-cased code then to our stats_aggregator.py, and it has a loss of precision.

A major use case I'm trying to facilitate is Pants users reporting in the past that they were trying to answer how much time was saved from caching compared to the cost of uploading/downloading the cache in CI. I want to facilitate answering that question without users needing to add custom code, e.g. put this advice in our Pants And CI docs page.

Comment on lines +98 to +102
// If the original process time wasn't recorded, we can't compute the time saved.
assert_eq!(
ProcessResultMetadata::default().time_saved_from_cache(Duration::new(1, 100)),
None
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider storing a 0 here instead (similar to the negative case)? The idea would be to always have an entry in this histogram when we end up using the cache, even if we can't report a benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.

Fwit, local execution should always be storing the original execution time. The only reason we might not have the execution time is that we can't guarantee remote execution has set the ExecutedActionMetadata. But cache.rs and remote_cache.rs only ever deal with local process executions, so in practicality, those codepaths will always have access to the original time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.

It wouldn't skew the data... it would make it more accurate, if anything: if there are a bunch of zeros being recorded, we need to expose that, as it could be a problem in one of a few different places.

It's possible to be negative if the current process run takes longer than the previous one did.

When would that happen?

In exactly the case I've described, I think. You ran the process in the past: it took 1.5 seconds to run. You're running it again, and racing it against a cache lookup. For some reason, when you run the process again, it takes a lot longer to run than it did the first time... say 3 seconds. Meanwhile, your cache lookup also takes a lot longer, but still wins the race in 2 seconds. You end up with a time saved value of 1.5 - 2.

Copy link
Contributor Author

@Eric-Arellano Eric-Arellano Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so there are two different scenarios and I think we want to handle them differently:

  1. The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and cache.rs and remote_cache.rs solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or use Result in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.
  2. The cache hit was slower than the original process, but still faster than rerunning the process (via speculation). This is the negative numbers case. We, by definition, cannot know how slow the process would be on the second run, so the best we can do is represent with a 0. That sounds sensible to me.

So, we could change the modeling of .time_saved_from_cache() be Result<Duration, String>, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option<String> and use 0. Wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and cache.rs and remote_cache.rs solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or use Result in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.

It can happen in the context of remote execution, where a server does not return the value. We shouldn't error for a server that cannot return us metrics, but we should do something to indicate that the server is reporting bad metrics.

So, we could change the modeling of .time_saved_from_cache() be Result<Duration, String>, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option and use 0. Wdyt?

Warnings would be annoying, and we'd need to add a way to disable them probably. Could put it at debug, but then fair chance that we never notice it. So yea, I'd still vote for 0 here, as it exposes the issue iff someone is actually looking at these metrics... but debug level logging wouldn't be the end of the world.

That sounds sensible to me.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can happen in the context of remote execution, where a server does not return the value.

Yes, but cache.rs and remote_cache.rs never will consume the results of remote execution, right? I'm only talking about case #1 here.

K, I'll change case #2 to use 0, but keep Option to represent case #1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but cache.rs and remote_cache.rs never will consume the results of remote execution, right? I'm only talking about case #1 here.

They will if a local cache is wrapped around remote execution, which is common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will if a local cache is wrapped around remote execution, which is common.

I'm not familiar with this workflow - when does this happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Possibly either add the remote execution runner or the remote cache runner.
// `global_options.py` already validates that both are not set at the same time.
let maybe_remote_enabled_command_runner: Box<dyn CommandRunner> =
if remoting_opts.execution_enable {
Box::new(BoundedCommandRunner::new(
Box::new(process_execution::remote::CommandRunner::new(
// No problem unwrapping here because the global options validation
// requires the remoting_opts.execution_server be present when
// remoting_opts.execution_enable is set.
&remoting_opts.execution_address.clone().unwrap(),
remoting_opts.store_addresses.clone(),
process_execution_metadata.clone(),
root_ca_certs.clone(),
remoting_opts.execution_headers.clone(),
full_store.clone(),
// TODO if we ever want to configure the remote platform to be something else we
// need to take an option all the way down here and into the remote::CommandRunner struct.
Platform::Linux,
remoting_opts.execution_overall_deadline,
Duration::from_millis(100),
)?),
exec_strategy_opts.remote_parallelism,
))
} else if remote_caching_used {
let action_cache_address = remote_store_addresses
.first()
.ok_or_else(|| "At least one remote store must be specified".to_owned())?;
Box::new(process_execution::remote_cache::CommandRunner::new(
local_command_runner.into(),
process_execution_metadata.clone(),
executor.clone(),
full_store.clone(),
action_cache_address.as_str(),
root_ca_certs.clone(),
remoting_opts.store_headers.clone(),
Platform::current()?,
exec_strategy_opts.remote_cache_read,
exec_strategy_opts.remote_cache_write,
remoting_opts.cache_eager_fetch,
)?)
} else {
local_command_runner
};
// Possibly use the local cache runner, regardless of remote execution/caching.
let maybe_local_cached_command_runner = if exec_strategy_opts.use_local_cache {
let process_execution_store = ShardedLmdb::new(
local_store_dir.join("processes"),
2 * DEFAULT_LOCAL_STORE_GC_TARGET_BYTES,
executor.clone(),
DEFAULT_LEASE_TIME,
)
.map_err(|err| format!("Could not initialize store for process cache: {:?}", err))?;
Box::new(process_execution::cache::CommandRunner::new(
maybe_remote_enabled_command_runner.into(),
process_execution_store,
full_store.clone(),
process_execution_metadata.clone(),
))
} else {
maybe_remote_enabled_command_runner
};

maybe_local_cached_command_runner is wrapped around maybe_remote_enabled_command_runner.

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@Eric-Arellano Eric-Arellano merged commit 9dd433d into pantsbuild:master Feb 25, 2021
@Eric-Arellano Eric-Arellano deleted the cache-speedup-metrics branch February 25, 2021 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants