Record a metric of proxy errors #417

olix0r · 2020-01-28T00:09:14Z

The proxy handles errors by synthesizing responses; however, depending
on where such an error occurs, it may not be recorded by a metrics
layer.

This change introduces a request_errors_total counter that records
the number/kind of errors encountered. This is immediately helpful for
debugging and diagnostics, but could also be helpful to view in the UI.
We may want to revisit this metric, especialy its labels, before relying
on it more prominently. But in the meantime it's extraordinarily useful
as it is.

The proxy handles errors by synthesizing responses; however, depending on where such an error occurs, it may not be recorded by a metrics layer. This change introduces an `request_errors_total` counter that records the number/kind of errors encountered. This is immediately helpful for debugging and diagnostics, but could also be helpful to view in the UI. We may want to revisit this metric, especialy its labels, before relying on it more prominently. But in the meantime it's extraordinarily useful as it is.

hawkw

seems very straightforward — lgtm!

Cargo.lock

kleimkuhler

I have a question about the implementation of error-metrics. It's not a blocker on merging, but once that's resolved, everything else looks good to me.

linkerd/app/core/src/errors.rs

kleimkuhler · 2020-02-18T19:51:35Z

linkerd/error-metrics/src/lib.rs

+pub struct RecordErrorLayer<L, K: Hash + Eq> {
+    label: L,
+    errors: Arc<Mutex<IndexMap<K, Counter>>>,
+}
+
+pub struct RecordError<L, K: Hash + Eq, S> {
+    label: L,
+    errors: Arc<Mutex<IndexMap<K, Counter>>>,
+    inner: S,
+}


What are your thoughts on changing these structs to:

pub struct RecordErrorLayer<L, K: Hash + Eq> { label: L, registry: Registry<K>, } pub struct RecordError<L, K: Hash + Eq, S> { label: L, registry: Registry<K>, inner: S, }

When reading through this change, this change is what I expected to see and how I thought about it when understanding the rest of the impls.

Alternatively, this could be similar to the Registry in #428 where:

type Registry<L> = Arc<Mutex<IndexMap<L, Arc<Metrics>>>>;

Eh, the Registry type is really there as a public handle for getting layers and reporting the results... It's a bit confusing for Registry to produce layers but those layers just hold Registries?

This might all be clearer if I split the types up into separate files?

Yea it helps, but I think what I'm reacting to is it seems like each *-metrics crate has it's own way of implementing what it's concept of a Registry is and they are just different enough that I feel like I'm reviewing new impls.

Reviewing this change, the other open #428 metrics change, and the previous http-metrics crate change all have a similar purpose but don't really make their metrics recording services the same way.

I'm okay merging this as is, but it may just be something to be aware of that as more metrics are added, see if it makes sense to introduce a reusable "metrics registry".

linkerd/error-metrics/src/service.rs

This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)

olix0r requested a review from a team January 28, 2020 00:09

olix0r self-assigned this Jan 28, 2020

This comment has been minimized.

Sign in to view

olix0r force-pushed the ver/error-metrics branch from 5eba0a0 to da9767b Compare February 18, 2020 01:22

olix0r changed the base branch from ver/error-handling to master February 18, 2020 01:24

olix0r force-pushed the ver/error-metrics branch 4 times, most recently from f685ccc to 6f5ac1a Compare February 18, 2020 01:39

olix0r marked this pull request as ready for review February 18, 2020 01:41

olix0r force-pushed the ver/error-metrics branch from 6f5ac1a to b98e38a Compare February 18, 2020 01:42

deny warnings

348772d

hawkw approved these changes Feb 18, 2020

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

restore chrono version

4bb8736

olix0r requested a review from a team February 18, 2020 17:06

kleimkuhler reviewed Feb 18, 2020

View reviewed changes

Split crate into modules

d9fea7f

kleimkuhler approved these changes Feb 18, 2020

View reviewed changes

linkerd/error-metrics/src/service.rs Outdated Show resolved Hide resolved

typo

d5cf282

olix0r merged commit d36f76e into master Feb 18, 2020

olix0r deleted the ver/error-metrics branch February 18, 2020 21:47

kleimkuhler mentioned this pull request Feb 18, 2020

Expose the number of service instances in the proxy #428

Merged

olix0r mentioned this pull request Feb 19, 2020

proxy: v2.86.0 linkerd/linkerd2#4075

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record a metric of proxy errors #417

Record a metric of proxy errors #417

olix0r commented Jan 28, 2020 •

edited

Loading

This comment has been minimized.

hawkw left a comment

kleimkuhler left a comment

kleimkuhler Feb 18, 2020

kleimkuhler Feb 18, 2020

olix0r Feb 18, 2020

kleimkuhler Feb 18, 2020

Record a metric of proxy errors #417

Record a metric of proxy errors #417

Conversation

olix0r commented Jan 28, 2020 • edited Loading

This comment has been minimized.

hawkw left a comment

Choose a reason for hiding this comment

kleimkuhler left a comment

Choose a reason for hiding this comment

kleimkuhler Feb 18, 2020

Choose a reason for hiding this comment

kleimkuhler Feb 18, 2020

Choose a reason for hiding this comment

olix0r Feb 18, 2020

Choose a reason for hiding this comment

kleimkuhler Feb 18, 2020

Choose a reason for hiding this comment

olix0r commented Jan 28, 2020 •

edited

Loading