-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose the number of service instances in the proxy #428
Conversation
When diagnosing issues, it can be difficult to reason about the states of the various caches in the proxy. This change introduces a set of `stack` metrics to help proxy developers diagnose issues. The metrics track stack creations, stack deletions, and the amount of time a given service is not available (via `stack_poll_total_ms`) ``` :; kubectl exec -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack stack_create_total{direction="outbound",name="fallback.endpoint"} 0 stack_create_total{direction="outbound",name="balance.endpoint"} 1 stack_create_total{direction="outbound",name="logical.dst"} 1 stack_create_total{direction="outbound",name="addr"} 1 stack_create_total{direction="outbound",name="source"} 100 stack_create_total{direction="inbound",name="endpoint"} 0 stack_create_total{direction="inbound",name="route"} 0 stack_create_total{direction="inbound",name="logical"} 0 stack_create_total{direction="inbound",name="source"} 0 stack_drop_total{direction="outbound",name="fallback.endpoint"} 0 stack_drop_total{direction="outbound",name="balance.endpoint"} 0 stack_drop_total{direction="outbound",name="logical.dst"} 0 stack_drop_total{direction="outbound",name="addr"} 0 stack_drop_total{direction="outbound",name="source"} 0 stack_drop_total{direction="inbound",name="endpoint"} 0 stack_drop_total{direction="inbound",name="route"} 0 stack_drop_total{direction="inbound",name="logical"} 0 stack_drop_total{direction="inbound",name="source"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371 stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600 stack_poll_total{direction="outbound",name="addr",ready="true"} 23600 stack_poll_total{direction="outbound",name="source",ready="true"} 23600 stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0 stack_poll_total{direction="inbound",name="route",ready="true"} 0 stack_poll_total{direction="inbound",name="logical",ready="true"} 0 stack_poll_total{direction="inbound",name="source",ready="true"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175 stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0 stack_poll_total{direction="outbound",name="addr",ready="false"} 1 stack_poll_total{direction="outbound",name="source",ready="false"} 0 stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0 stack_poll_total{direction="inbound",name="route",ready="false"} 0 stack_poll_total{direction="inbound",name="logical",ready="false"} 0 stack_poll_total{direction="inbound",name="source",ready="false"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0 stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0 stack_poll_total{direction="outbound",name="addr",ready="error"} 0 stack_poll_total{direction="outbound",name="source",ready="error"} 0 stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0 stack_poll_total{direction="inbound",name="route",ready="error"} 0 stack_poll_total{direction="inbound",name="logical",ready="error"} 0 stack_poll_total{direction="inbound",name="source",ready="error"} 0 stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0 stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40 stack_poll_total_ms{direction="outbound",name="logical.dst"} 0 stack_poll_total_ms{direction="outbound",name="addr"} 2 stack_poll_total_ms{direction="outbound",name="source"} 0 stack_poll_total_ms{direction="inbound",name="endpoint"} 0 stack_poll_total_ms{direction="inbound",name="route"} 0 stack_poll_total_ms{direction="inbound",name="logical"} 0 stack_poll_total_ms{direction="inbound",name="source"} 0 ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
(I've recently read through finagle stats again, and the ease of adding new stats is enviable...)
This seems like something that could potentially be a good use-case for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new metrics seem like they could potentially be very useful, this is great.
I was curious about the potential performance impact of taking the timestamps necessary for the poll duration metric. It would be nice to know if this actually makes a meaningful difference or not...
linkerd/stack-metrics/src/lib.rs
Outdated
if self.blocked_since.is_none() { | ||
self.blocked_since = Some(Instant::now()); | ||
} | ||
Ok(Async::NotReady) | ||
} | ||
Ok(Async::Ready(())) => { | ||
self.metrics.ready_total.incr(); | ||
if let Some(t0) = self.blocked_since.take() { | ||
let not_ready = Instant::now() - t0; | ||
self.metrics.poll_millis.add(not_ready.as_millis() as u64); | ||
} | ||
Ok(Async::Ready(())) | ||
} | ||
Err(e) => { | ||
self.metrics.error_total.incr(); | ||
if let Some(t0) = self.blocked_since.take() { | ||
let not_ready = Instant::now() - t0; | ||
self.metrics.poll_millis.add(not_ready.as_millis() as u64); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concerns about the performance implications of recording all these timestamps? It seems like if we are instrumenting every stack in the proxy with one of these layers, we are potentially adding a fair number of syscalls in the request path.
It could be good to do some benchmarking and see if there's a noticeable perf impact from this? We might want to add a config to turn off the stack metrics, or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from any of the other latency metrics we record?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, this is only engaged when a service is NotReady, which is fairly rare in the scheme of things. I'm not particularly worried about this? Unless you really think we should be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good. +1 to @hawkw's question as I had similar thoughts on the Instant::now()
calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good and that final reorganization helps a lot.
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.
This change introduces a set of
stack
metrics to help proxy developersdiagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
stack_poll_total_ms
)