Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record a metric of proxy errors #417

Merged
merged 5 commits into from
Feb 18, 2020
Merged

Record a metric of proxy errors #417

merged 5 commits into from
Feb 18, 2020

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Jan 28, 2020

The proxy handles errors by synthesizing responses; however, depending
on where such an error occurs, it may not be recorded by a metrics
layer.

This change introduces a request_errors_total counter that records
the number/kind of errors encountered. This is immediately helpful for
debugging and diagnostics, but could also be helpful to view in the UI.
We may want to revisit this metric, especialy its labels, before relying
on it more prominently. But in the meantime it's extraordinarily useful
as it is.

@olix0r olix0r requested a review from a team January 28, 2020 00:09
@olix0r olix0r self-assigned this Jan 28, 2020
@olix0r

This comment has been minimized.

@olix0r olix0r changed the base branch from ver/error-handling to master February 18, 2020 01:24
@olix0r olix0r force-pushed the ver/error-metrics branch 4 times, most recently from f685ccc to 6f5ac1a Compare February 18, 2020 01:39
@olix0r olix0r marked this pull request as ready for review February 18, 2020 01:41
The proxy handles errors by synthesizing responses; however, depending
on where such an error occurs, it may not be recorded by a metrics
layer.

This change introduces an `request_errors_total` counter that records
the number/kind of errors encountered. This is immediately helpful for
debugging and diagnostics, but could also be helpful to view in the UI.
We may want to revisit this metric, especialy its labels, before relying
on it more prominently. But in the meantime it's extraordinarily useful
as it is.
Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems very straightforward — lgtm!

Cargo.lock Outdated Show resolved Hide resolved
@olix0r olix0r requested a review from a team February 18, 2020 17:06
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the implementation of error-metrics. It's not a blocker on merging, but once that's resolved, everything else looks good to me.

linkerd/app/core/src/errors.rs Show resolved Hide resolved
Comment on lines 28 to 37
pub struct RecordErrorLayer<L, K: Hash + Eq> {
label: L,
errors: Arc<Mutex<IndexMap<K, Counter>>>,
}

pub struct RecordError<L, K: Hash + Eq, S> {
label: L,
errors: Arc<Mutex<IndexMap<K, Counter>>>,
inner: S,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on changing these structs to:

pub struct RecordErrorLayer<L, K: Hash + Eq> {
    label: L,
    registry: Registry<K>,
}

pub struct RecordError<L, K: Hash + Eq, S> {
    label: L,
    registry: Registry<K>,
    inner: S,
}

When reading through this change, this change is what I expected to see and how I thought about it when understanding the rest of the impls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could be similar to the Registry in #428 where:

type Registry<L> = Arc<Mutex<IndexMap<L, Arc<Metrics>>>>;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, the Registry type is really there as a public handle for getting layers and reporting the results... It's a bit confusing for Registry to produce layers but those layers just hold Registries?

This might all be clearer if I split the types up into separate files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea it helps, but I think what I'm reacting to is it seems like each *-metrics crate has it's own way of implementing what it's concept of a Registry is and they are just different enough that I feel like I'm reviewing new impls.

Reviewing this change, the other open #428 metrics change, and the previous http-metrics crate change all have a similar purpose but don't really make their metrics recording services the same way.

I'm okay merging this as is, but it may just be something to be aware of that as more metrics are added, see if it makes sense to introduce a reusable "metrics registry".

linkerd/error-metrics/src/service.rs Outdated Show resolved Hide resolved
@olix0r olix0r merged commit d36f76e into master Feb 18, 2020
@olix0r olix0r deleted the ver/error-metrics branch February 18, 2020 21:47
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Feb 19, 2020
This release includes the results from continued profiling & performance
analysis. In addition to modifying internals to prevent unwarranted
memory growth, we've introduced new metrics to aid in debugging and
diagnostics: a new `request_errors_total` metric exposes the number of
requests that receive synthesized responses due to proxy errors; and a
suite of `stack_*` metrics expose proxy internals that can help us
identify unexpected behavior.

---

* trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426)
* Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427)
* Add the request_errors_total metric (linkerd/linkerd2-proxy#417)
* Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428)
* concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429)
* profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406)
* http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430)
* lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Feb 19, 2020
This release includes the results from continued profiling & performance
analysis. In addition to modifying internals to prevent unwarranted
memory growth, we've introduced new metrics to aid in debugging and
diagnostics: a new `request_errors_total` metric exposes the number of
requests that receive synthesized responses due to proxy errors; and a
suite of `stack_*` metrics expose proxy internals that can help us
identify unexpected behavior.

---

* trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426)
* Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427)
* Add the request_errors_total metric (linkerd/linkerd2-proxy#417)
* Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428)
* concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429)
* profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406)
* http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430)
* lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants