Add metrics for time saved from local and remote caching #11601

Eric-Arellano · 2021-02-25T01:34:00Z

This is useful for users to tune caching -- for example, how much time is saved by local caching in CI, and how does this compare to the cache download/upload time?

❯ ./pants --plugins=hdrhistogram --stats-log --pants-config-files=pants.remote-cache.toml lint src/python/pants/util/strutil.py --no-pantsd --no-process-execution-use-local-cache
...
03:02:24.70 [INFO] Counters:
  ...
  remote_cache_total_time_saved_ms: 22222
  ...
03:02:24.70 [INFO] Observation histogram summaries:
03:02:24.98 [INFO] Summary of `remote_cache_time_saved_ms` observation histogram:
  min: 67
  max: 8239
  mean: 2777.750
  std dev: 2866.382
  total observations: 8
  p25: 201
  p50: 222
  p75: 4359
  p90: 5007
  p95: 8239
  p99: 8239

❯ ./pants --plugins=hdrhistogram --stats-log lint src/python/pants/util/strutil.py --no-pantsd
...
03:04:46.64 [INFO] Counters:
  ...
  local_cache_total_time_saved_ms: 22314
  ...
03:04:46.64 [INFO] Observation histogram summaries:
03:04:46.72 [INFO] Summary of `local_cache_time_saved_ms` observation histogram:
  min: 76
  max: 8247
  mean: 2789.125
  std dev: 2866.593
  total observations: 8
  p25: 207
  p50: 238
  p75: 4371
  p90: 5027
  p95: 8247
  p99: 8247

[ci skip-build-wheels]

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

tdyas

lgtm generally

Would time saved ever be negative?

src/rust/engine/process_execution/src/lib.rs

src/rust/engine/process_execution/src/remote_cache.rs

Eric-Arellano · 2021-02-25T03:34:55Z

Would time saved ever be negative?

We use Duration.checked_sub(), which results in None if the result is negative.

More generally, speculation with remote cache reads should avoid time saved ever being negative, but I imagine that's not a firm guarantee. And theoretically, a local cache read could be slower. But these scenarios both seem pretty uncommon so I'm not sure we need metrics for it.

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

stuhood · 2021-02-25T17:26:58Z

Would time saved ever be negative?

We use Duration.checked_sub(), which results in None if the result is negative.

More generally, speculation with remote cache reads should avoid time saved ever being negative, but I imagine that's not a firm guarantee.

It's possible to be negative if the current process run takes longer than the previous one did. I would suggest normalizing negative values into "0"... you still did save time (relative to waiting for the currently running process), we're just not sure how long.

Eric-Arellano · 2021-02-25T17:28:59Z

It's possible to be negative if the current process run takes longer than the previous one did.

When would that happen? These metrics are solely used in cache.rs and remote_cache.rs when we had cache hits. If we had a cache miss and need to rerun, then we have already moved past the metrics. We are only comparing the cache overhead to the original cost of running the cached process.

stuhood

Looks good! Nits mostly.

src/rust/engine/process_execution/src/lib.rs

src/rust/engine/fs/store/src/local.rs

stuhood · 2021-02-25T17:33:12Z

src/rust/engine/process_execution/src/remote_cache.rs

+              context
+                .workunit_store
+                .increment_counter(Metric::RemoteCacheTotalTimeSavedMs, time_saved);


FWIW, hdrhistogram has a sum method, which should be roughly equivalent here. It won't have perfect accuracy, but it might still be worth it from a clarity/redundancy perspective to avoid double storing things.

Interesting, I don't see that with the Python library though: https://github.com/HdrHistogram/HdrHistogram_py/blob/master/hdrh/histogram.py

So iiuc, we would need to add special casing code that sums the histogram in Rust, then exposes over FFI to Python. The double counting seems cleaner than that.

The Python HdrHistogram exposes an iterator (get_recorded_iterator I believe) over all values. You could just sum over that, right?

Sure, but that's still adding special-cased code then to our stats_aggregator.py, and it has a loss of precision.

A major use case I'm trying to facilitate is Pants users reporting in the past that they were trying to answer how much time was saved from caching compared to the cost of uploading/downloading the cache in CI. I want to facilitate answering that question without users needing to add custom code, e.g. put this advice in our Pants And CI docs page.

stuhood · 2021-02-25T17:34:56Z

src/rust/engine/process_execution/src/tests.rs

+  // If the original process time wasn't recorded, we can't compute the time saved.
+  assert_eq!(
+    ProcessResultMetadata::default().time_saved_from_cache(Duration::new(1, 100)),
+    None
+  );


Should we consider storing a 0 here instead (similar to the negative case)? The idea would be to always have an entry in this histogram when we end up using the cache, even if we can't report a benefit.

I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.

Fwit, local execution should always be storing the original execution time. The only reason we might not have the execution time is that we can't guarantee remote execution has set the ExecutedActionMetadata. But cache.rs and remote_cache.rs only ever deal with local process executions, so in practicality, those codepaths will always have access to the original time.

I don't think so, wouldn't that skew the data? Let's say we have 50 cache hits, only 30 of which had the original execution time recorded. That means that 60% of our histogram entries will be 0 and will skew the data to look like the caching is not very helpful.

It wouldn't skew the data... it would make it more accurate, if anything: if there are a bunch of zeros being recorded, we need to expose that, as it could be a problem in one of a few different places.

It's possible to be negative if the current process run takes longer than the previous one did.

When would that happen?

In exactly the case I've described, I think. You ran the process in the past: it took 1.5 seconds to run. You're running it again, and racing it against a cache lookup. For some reason, when you run the process again, it takes a lot longer to run than it did the first time... say 3 seconds. Meanwhile, your cache lookup also takes a lot longer, but still wins the race in 2 seconds. You end up with a time saved value of 1.5 - 2.

Okay so there are two different scenarios and I think we want to handle them differently:

The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and cache.rs and remote_cache.rs solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or use Result in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.

The cache hit was slower than the original process, but still faster than rerunning the process (via speculation). This is the negative numbers case. We, by definition, cannot know how slow the process would be on the second run, so the best we can do is represent with a 0. That sounds sensible to me.

So, we could change the modeling of .time_saved_from_cache() be Result<Duration, String>, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option<String> and use 0. Wdyt?

The original time was not recorded. As explained above, this should never happen because local execution should always have the timing recorded, and cache.rs and remote_cache.rs solely work with local processes, not remote caching. We're only handling the missing data case for type safety - if you want, we could error or use Result in this case, so that we eagerly know when the original process time wasn't recorded? I'm not comfortable with using 0 to represent this case.

It can happen in the context of remote execution, where a server does not return the value. We shouldn't error for a server that cannot return us metrics, but we should do something to indicate that the server is reporting bad metrics.

So, we could change the modeling of .time_saved_from_cache() be Result<Duration, String>, where a Duration of 0 means the cache hit was slower than original process. Or stick with Option and use 0. Wdyt?

Warnings would be annoying, and we'd need to add a way to disable them probably. Could put it at debug, but then fair chance that we never notice it. So yea, I'd still vote for 0 here, as it exposes the issue iff someone is actually looking at these metrics... but debug level logging wouldn't be the end of the world.

That sounds sensible to me.

👍

It can happen in the context of remote execution, where a server does not return the value.

Yes, but cache.rs and remote_cache.rs never will consume the results of remote execution, right? I'm only talking about case #1 here.

K, I'll change case #2 to use 0, but keep Option to represent case #1.

Yes, but cache.rs and remote_cache.rs never will consume the results of remote execution, right? I'm only talking about case #1 here.

They will if a local cache is wrapped around remote execution, which is common.

They will if a local cache is wrapped around remote execution, which is common.

I'm not familiar with this workflow - when does this happen?

pants/src/rust/engine/src/context.rs

Lines 157 to 218 in e19336d

// Possibly either add the remote execution runner or the remote cache runner.

// `global_options.py` already validates that both are not set at the same time.

let maybe_remote_enabled_command_runner: Box<dyn CommandRunner> =

if remoting_opts.execution_enable {

Box::new(BoundedCommandRunner::new(

Box::new(process_execution::remote::CommandRunner::new(

// No problem unwrapping here because the global options validation

// requires the remoting_opts.execution_server be present when

// remoting_opts.execution_enable is set.

&remoting_opts.execution_address.clone().unwrap(),

remoting_opts.store_addresses.clone(),

process_execution_metadata.clone(),

root_ca_certs.clone(),

remoting_opts.execution_headers.clone(),

full_store.clone(),

// TODO if we ever want to configure the remote platform to be something else we

// need to take an option all the way down here and into the remote::CommandRunner struct.

Platform::Linux,

remoting_opts.execution_overall_deadline,

Duration::from_millis(100),

)?),

exec_strategy_opts.remote_parallelism,

))

} else if remote_caching_used {

let action_cache_address = remote_store_addresses

.first()

.ok_or_else(|| "At least one remote store must be specified".to_owned())?;

Box::new(process_execution::remote_cache::CommandRunner::new(

local_command_runner.into(),

process_execution_metadata.clone(),

executor.clone(),

full_store.clone(),

action_cache_address.as_str(),

root_ca_certs.clone(),

remoting_opts.store_headers.clone(),

Platform::current()?,

exec_strategy_opts.remote_cache_read,

exec_strategy_opts.remote_cache_write,

remoting_opts.cache_eager_fetch,

)?)

} else {

local_command_runner

};

// Possibly use the local cache runner, regardless of remote execution/caching.

let maybe_local_cached_command_runner = if exec_strategy_opts.use_local_cache {

let process_execution_store = ShardedLmdb::new(

local_store_dir.join("processes"),

2 * DEFAULT_LOCAL_STORE_GC_TARGET_BYTES,

executor.clone(),

DEFAULT_LEASE_TIME,

)

.map_err(|err| format!("Could not initialize store for process cache: {:?}", err))?;

Box::new(process_execution::cache::CommandRunner::new(

maybe_remote_enabled_command_runner.into(),

process_execution_store,

full_store.clone(),

process_execution_metadata.clone(),

))

} else {

maybe_remote_enabled_command_runner

};

maybe_local_cached_command_runner is wrapped around maybe_remote_enabled_command_runner.

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Add histograms for time saved from local and remote caching

c1e6d18

[ci skip-build-wheels]

Eric-Arellano requested review from stuhood and tdyas February 25, 2021 01:34

Add counters for the total time saved

a1c9821

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano changed the title ~~Add histograms for time saved from local and remote caching~~ Add metrics for time saved from local and remote caching Feb 25, 2021

Eric-Arellano requested a review from asherf February 25, 2021 03:07

tdyas approved these changes Feb 25, 2021

View reviewed changes

src/rust/engine/process_execution/src/lib.rs Outdated Show resolved Hide resolved

src/rust/engine/process_execution/src/remote_cache.rs Outdated Show resolved Hide resolved

Review feedback

b1c0572

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

stuhood approved these changes Feb 25, 2021

View reviewed changes

Revert renaming LocalStoreReadBlobSize

6dbeb4b

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

asherf approved these changes Feb 25, 2021

View reviewed changes

Eric-Arellano added 2 commits February 25, 2021 12:30

Model cache slower than original as 0

48f0867

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Fix clippy

3193f89

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Eric-Arellano merged commit 9dd433d into pantsbuild:master Feb 25, 2021

Eric-Arellano deleted the cache-speedup-metrics branch February 25, 2021 23:06

jsirois mentioned this pull request Feb 28, 2021

Prepare the 2.4.0.dev0 release. #11613

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics for time saved from local and remote caching #11601

Add metrics for time saved from local and remote caching #11601

Eric-Arellano commented Feb 25, 2021 •

edited

Loading

tdyas left a comment

Eric-Arellano commented Feb 25, 2021

stuhood commented Feb 25, 2021

Eric-Arellano commented Feb 25, 2021

stuhood left a comment

stuhood Feb 25, 2021

Eric-Arellano Feb 25, 2021

tdyas Feb 25, 2021

Eric-Arellano Feb 25, 2021

stuhood Feb 25, 2021

Eric-Arellano Feb 25, 2021

stuhood Feb 25, 2021

Eric-Arellano Feb 25, 2021 •

edited

Loading

stuhood Feb 25, 2021

Eric-Arellano Feb 25, 2021

stuhood Feb 25, 2021

Eric-Arellano Feb 25, 2021

stuhood Feb 25, 2021

	// Possibly either add the remote execution runner or the remote cache runner.
	// `global_options.py` already validates that both are not set at the same time.
	let maybe_remote_enabled_command_runner: Box<dyn CommandRunner> =
	if remoting_opts.execution_enable {
	Box::new(BoundedCommandRunner::new(
	Box::new(process_execution::remote::CommandRunner::new(
	// No problem unwrapping here because the global options validation
	// requires the remoting_opts.execution_server be present when
	// remoting_opts.execution_enable is set.
	&remoting_opts.execution_address.clone().unwrap(),
	remoting_opts.store_addresses.clone(),
	process_execution_metadata.clone(),
	root_ca_certs.clone(),
	remoting_opts.execution_headers.clone(),
	full_store.clone(),
	// TODO if we ever want to configure the remote platform to be something else we
	// need to take an option all the way down here and into the remote::CommandRunner struct.
	Platform::Linux,
	remoting_opts.execution_overall_deadline,
	Duration::from_millis(100),
	)?),
	exec_strategy_opts.remote_parallelism,
	))
	} else if remote_caching_used {
	let action_cache_address = remote_store_addresses
	.first()
	.ok_or_else(\|\| "At least one remote store must be specified".to_owned())?;
	Box::new(process_execution::remote_cache::CommandRunner::new(
	local_command_runner.into(),
	process_execution_metadata.clone(),
	executor.clone(),
	full_store.clone(),
	action_cache_address.as_str(),
	root_ca_certs.clone(),
	remoting_opts.store_headers.clone(),
	Platform::current()?,
	exec_strategy_opts.remote_cache_read,
	exec_strategy_opts.remote_cache_write,
	remoting_opts.cache_eager_fetch,
	)?)
	} else {
	local_command_runner
	};

	// Possibly use the local cache runner, regardless of remote execution/caching.
	let maybe_local_cached_command_runner = if exec_strategy_opts.use_local_cache {
	let process_execution_store = ShardedLmdb::new(
	local_store_dir.join("processes"),
	2 * DEFAULT_LOCAL_STORE_GC_TARGET_BYTES,
	executor.clone(),
	DEFAULT_LEASE_TIME,
	)
	.map_err(\|err\| format!("Could not initialize store for process cache: {:?}", err))?;
	Box::new(process_execution::cache::CommandRunner::new(
	maybe_remote_enabled_command_runner.into(),
	process_execution_store,
	full_store.clone(),
	process_execution_metadata.clone(),
	))
	} else {
	maybe_remote_enabled_command_runner
	};

Add metrics for time saved from local and remote caching #11601

Add metrics for time saved from local and remote caching #11601

Conversation

Eric-Arellano commented Feb 25, 2021 • edited Loading

tdyas left a comment

Choose a reason for hiding this comment

Eric-Arellano commented Feb 25, 2021

stuhood commented Feb 25, 2021

Eric-Arellano commented Feb 25, 2021

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Arellano Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Arellano commented Feb 25, 2021 •

edited

Loading

Eric-Arellano Feb 25, 2021 •

edited

Loading