-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record a metric of proxy errors #417
Conversation
This comment has been minimized.
This comment has been minimized.
5eba0a0
to
da9767b
Compare
f685ccc
to
6f5ac1a
Compare
The proxy handles errors by synthesizing responses; however, depending on where such an error occurs, it may not be recorded by a metrics layer. This change introduces an `request_errors_total` counter that records the number/kind of errors encountered. This is immediately helpful for debugging and diagnostics, but could also be helpful to view in the UI. We may want to revisit this metric, especialy its labels, before relying on it more prominently. But in the meantime it's extraordinarily useful as it is.
6f5ac1a
to
b98e38a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems very straightforward — lgtm!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question about the implementation of error-metrics
. It's not a blocker on merging, but once that's resolved, everything else looks good to me.
linkerd/error-metrics/src/lib.rs
Outdated
pub struct RecordErrorLayer<L, K: Hash + Eq> { | ||
label: L, | ||
errors: Arc<Mutex<IndexMap<K, Counter>>>, | ||
} | ||
|
||
pub struct RecordError<L, K: Hash + Eq, S> { | ||
label: L, | ||
errors: Arc<Mutex<IndexMap<K, Counter>>>, | ||
inner: S, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are your thoughts on changing these structs to:
pub struct RecordErrorLayer<L, K: Hash + Eq> {
label: L,
registry: Registry<K>,
}
pub struct RecordError<L, K: Hash + Eq, S> {
label: L,
registry: Registry<K>,
inner: S,
}
When reading through this change, this change is what I expected to see and how I thought about it when understanding the rest of the impl
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, this could be similar to the Registry
in #428 where:
type Registry<L> = Arc<Mutex<IndexMap<L, Arc<Metrics>>>>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, the Registry
type is really there as a public handle for getting layers and reporting the results... It's a bit confusing for Registry to produce layers but those layers just hold Registries?
This might all be clearer if I split the types up into separate files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea it helps, but I think what I'm reacting to is it seems like each *-metrics
crate has it's own way of implementing what it's concept of a Registry
is and they are just different enough that I feel like I'm reviewing new impls.
Reviewing this change, the other open #428 metrics change, and the previous http-metrics
crate change all have a similar purpose but don't really make their metrics recording services the same way.
I'm okay merging this as is, but it may just be something to be aware of that as more metrics are added, see if it makes sense to introduce a reusable "metrics registry".
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
The proxy handles errors by synthesizing responses; however, depending
on where such an error occurs, it may not be recorded by a metrics
layer.
This change introduces a
request_errors_total
counter that recordsthe number/kind of errors encountered. This is immediately helpful for
debugging and diagnostics, but could also be helpful to view in the UI.
We may want to revisit this metric, especialy its labels, before relying
on it more prominently. But in the meantime it's extraordinarily useful
as it is.