Expose the number of service instances in the proxy #428

olix0r · 2020-02-18T15:59:53Z

When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.

This change introduces a set of stack metrics to help proxy developers
diagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
stack_poll_total_ms)

:; kubectl exec  -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack
stack_create_total{direction="outbound",name="fallback.endpoint"} 0
stack_create_total{direction="outbound",name="balance.endpoint"} 1
stack_create_total{direction="outbound",name="logical.dst"} 1
stack_create_total{direction="outbound",name="addr"} 1
stack_create_total{direction="outbound",name="source"} 100
stack_create_total{direction="inbound",name="endpoint"} 0
stack_create_total{direction="inbound",name="route"} 0
stack_create_total{direction="inbound",name="logical"} 0
stack_create_total{direction="inbound",name="source"} 0
stack_drop_total{direction="outbound",name="fallback.endpoint"} 0
stack_drop_total{direction="outbound",name="balance.endpoint"} 0
stack_drop_total{direction="outbound",name="logical.dst"} 0
stack_drop_total{direction="outbound",name="addr"} 0
stack_drop_total{direction="outbound",name="source"} 0
stack_drop_total{direction="inbound",name="endpoint"} 0
stack_drop_total{direction="inbound",name="route"} 0
stack_drop_total{direction="inbound",name="logical"} 0
stack_drop_total{direction="inbound",name="source"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371
stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600
stack_poll_total{direction="outbound",name="addr",ready="true"} 23600
stack_poll_total{direction="outbound",name="source",ready="true"} 23600
stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0
stack_poll_total{direction="inbound",name="route",ready="true"} 0
stack_poll_total{direction="inbound",name="logical",ready="true"} 0
stack_poll_total{direction="inbound",name="source",ready="true"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175
stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0
stack_poll_total{direction="outbound",name="addr",ready="false"} 1
stack_poll_total{direction="outbound",name="source",ready="false"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0
stack_poll_total{direction="inbound",name="route",ready="false"} 0
stack_poll_total{direction="inbound",name="logical",ready="false"} 0
stack_poll_total{direction="inbound",name="source",ready="false"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0
stack_poll_total{direction="outbound",name="addr",ready="error"} 0
stack_poll_total{direction="outbound",name="source",ready="error"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0
stack_poll_total{direction="inbound",name="route",ready="error"} 0
stack_poll_total{direction="inbound",name="logical",ready="error"} 0
stack_poll_total{direction="inbound",name="source",ready="error"} 0
stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0
stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40
stack_poll_total_ms{direction="outbound",name="logical.dst"} 0
stack_poll_total_ms{direction="outbound",name="addr"} 2
stack_poll_total_ms{direction="outbound",name="source"} 0
stack_poll_total_ms{direction="inbound",name="endpoint"} 0
stack_poll_total_ms{direction="inbound",name="route"} 0
stack_poll_total_ms{direction="inbound",name="logical"} 0
stack_poll_total_ms{direction="inbound",name="source"} 0

When diagnosing issues, it can be difficult to reason about the states of the various caches in the proxy. This change introduces a set of `stack` metrics to help proxy developers diagnose issues. The metrics track stack creations, stack deletions, and the amount of time a given service is not available (via `stack_poll_total_ms`) ``` :; kubectl exec -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack stack_create_total{direction="outbound",name="fallback.endpoint"} 0 stack_create_total{direction="outbound",name="balance.endpoint"} 1 stack_create_total{direction="outbound",name="logical.dst"} 1 stack_create_total{direction="outbound",name="addr"} 1 stack_create_total{direction="outbound",name="source"} 100 stack_create_total{direction="inbound",name="endpoint"} 0 stack_create_total{direction="inbound",name="route"} 0 stack_create_total{direction="inbound",name="logical"} 0 stack_create_total{direction="inbound",name="source"} 0 stack_drop_total{direction="outbound",name="fallback.endpoint"} 0 stack_drop_total{direction="outbound",name="balance.endpoint"} 0 stack_drop_total{direction="outbound",name="logical.dst"} 0 stack_drop_total{direction="outbound",name="addr"} 0 stack_drop_total{direction="outbound",name="source"} 0 stack_drop_total{direction="inbound",name="endpoint"} 0 stack_drop_total{direction="inbound",name="route"} 0 stack_drop_total{direction="inbound",name="logical"} 0 stack_drop_total{direction="inbound",name="source"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371 stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600 stack_poll_total{direction="outbound",name="addr",ready="true"} 23600 stack_poll_total{direction="outbound",name="source",ready="true"} 23600 stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0 stack_poll_total{direction="inbound",name="route",ready="true"} 0 stack_poll_total{direction="inbound",name="logical",ready="true"} 0 stack_poll_total{direction="inbound",name="source",ready="true"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175 stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0 stack_poll_total{direction="outbound",name="addr",ready="false"} 1 stack_poll_total{direction="outbound",name="source",ready="false"} 0 stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0 stack_poll_total{direction="inbound",name="route",ready="false"} 0 stack_poll_total{direction="inbound",name="logical",ready="false"} 0 stack_poll_total{direction="inbound",name="source",ready="false"} 0 stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0 stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0 stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0 stack_poll_total{direction="outbound",name="addr",ready="error"} 0 stack_poll_total{direction="outbound",name="source",ready="error"} 0 stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0 stack_poll_total{direction="inbound",name="route",ready="error"} 0 stack_poll_total{direction="inbound",name="logical",ready="error"} 0 stack_poll_total{direction="inbound",name="source",ready="error"} 0 stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0 stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40 stack_poll_total_ms{direction="outbound",name="logical.dst"} 0 stack_poll_total_ms{direction="outbound",name="addr"} 2 stack_poll_total_ms{direction="outbound",name="source"} 0 stack_poll_total_ms{direction="inbound",name="endpoint"} 0 stack_poll_total_ms{direction="inbound",name="route"} 0 stack_poll_total_ms{direction="inbound",name="logical"} 0 stack_poll_total_ms{direction="inbound",name="source"} 0 ```

seanmonstar

🎉

(I've recently read through finagle stats again, and the ease of adding new stats is enviable...)

hawkw · 2020-02-18T20:06:25Z

This seems like something that could potentially be a good use-case for the tracing metrics integration that some folks in the community are supposedly working on, but that's not here yet, and this works now...

hawkw

The new metrics seem like they could potentially be very useful, this is great.

I was curious about the potential performance impact of taking the timestamps necessary for the poll duration metric. It would be nice to know if this actually makes a meaningful difference or not...

linkerd/stack-metrics/src/lib.rs

hawkw · 2020-02-18T21:01:15Z

linkerd/stack-metrics/src/lib.rs

+                if self.blocked_since.is_none() {
+                    self.blocked_since = Some(Instant::now());
+                }
+                Ok(Async::NotReady)
+            }
+            Ok(Async::Ready(())) => {
+                self.metrics.ready_total.incr();
+                if let Some(t0) = self.blocked_since.take() {
+                    let not_ready = Instant::now() - t0;
+                    self.metrics.poll_millis.add(not_ready.as_millis() as u64);
+                }
+                Ok(Async::Ready(()))
+            }
+            Err(e) => {
+                self.metrics.error_total.incr();
+                if let Some(t0) = self.blocked_since.take() {
+                    let not_ready = Instant::now() - t0;
+                    self.metrics.poll_millis.add(not_ready.as_millis() as u64);
+                }


Any concerns about the performance implications of recording all these timestamps? It seems like if we are instrumenting every stack in the proxy with one of these layers, we are potentially adding a fair number of syscalls in the request path.

It could be good to do some benchmarking and see if there's a noticeable perf impact from this? We might want to add a config to turn off the stack metrics, or something?

How is this different from any of the other latency metrics we record?

Furthermore, this is only engaged when a service is NotReady, which is fairly rare in the scheme of things. I'm not particularly worried about this? Unless you really think we should be?

linkerd/app/core/src/svc.rs

kleimkuhler

Change looks good. +1 to @hawkw's question as I had similar thoughts on the Instant::now() calls.

linkerd/stack-metrics/src/lib.rs

kleimkuhler

Change looks good and that final reorganization helps a lot.

This release includes the results from continued profiling & performance analysis. In addition to modifying internals to prevent unwarranted memory growth, we've introduced new metrics to aid in debugging and diagnostics: a new `request_errors_total` metric exposes the number of requests that receive synthesized responses due to proxy errors; and a suite of `stack_*` metrics expose proxy internals that can help us identify unexpected behavior. --- * trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426) * Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427) * Add the request_errors_total metric (linkerd/linkerd2-proxy#417) * Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428) * concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429) * profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406) * http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430) * lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)

olix0r requested a review from a team February 18, 2020 15:59

seanmonstar approved these changes Feb 18, 2020

View reviewed changes

olix0r requested a review from a team February 18, 2020 17:06

kleimkuhler mentioned this pull request Feb 18, 2020

Record a metric of proxy errors #417

Merged

hawkw reviewed Feb 18, 2020

View reviewed changes

kleimkuhler reviewed Feb 18, 2020

View reviewed changes

linkerd/stack-metrics/src/lib.rs Outdated Show resolved Hide resolved

olix0r added 2 commits February 18, 2020 22:21

Reorganize to be more like error-metrics

02dd89c

Merge branch 'master' into ver/stack-metrics

3a078e2

kleimkuhler approved these changes Feb 18, 2020

View reviewed changes

hawkw approved these changes Feb 18, 2020

View reviewed changes

olix0r merged commit f44d273 into master Feb 18, 2020

olix0r deleted the ver/stack-metrics branch February 18, 2020 22:54

olix0r mentioned this pull request Feb 19, 2020

proxy: v2.86.0 linkerd/linkerd2#4075

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose the number of service instances in the proxy #428

Expose the number of service instances in the proxy #428

olix0r commented Feb 18, 2020

seanmonstar left a comment

hawkw commented Feb 18, 2020

hawkw left a comment

hawkw Feb 18, 2020

olix0r Feb 18, 2020

olix0r Feb 18, 2020 •

edited

Loading

kleimkuhler left a comment

kleimkuhler left a comment

Expose the number of service instances in the proxy #428

Expose the number of service instances in the proxy #428

Conversation

olix0r commented Feb 18, 2020

seanmonstar left a comment

Choose a reason for hiding this comment

hawkw commented Feb 18, 2020

hawkw left a comment

Choose a reason for hiding this comment

hawkw Feb 18, 2020

Choose a reason for hiding this comment

olix0r Feb 18, 2020

Choose a reason for hiding this comment

olix0r Feb 18, 2020 • edited Loading

Choose a reason for hiding this comment

kleimkuhler left a comment

Choose a reason for hiding this comment

kleimkuhler left a comment

Choose a reason for hiding this comment

olix0r Feb 18, 2020 •

edited

Loading