Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose the number of service instances in the proxy #428

Merged
merged 3 commits into from
Feb 18, 2020
Merged

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Feb 18, 2020

When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.

This change introduces a set of stack metrics to help proxy developers
diagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
stack_poll_total_ms)

:; kubectl exec  -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack
stack_create_total{direction="outbound",name="fallback.endpoint"} 0
stack_create_total{direction="outbound",name="balance.endpoint"} 1
stack_create_total{direction="outbound",name="logical.dst"} 1
stack_create_total{direction="outbound",name="addr"} 1
stack_create_total{direction="outbound",name="source"} 100
stack_create_total{direction="inbound",name="endpoint"} 0
stack_create_total{direction="inbound",name="route"} 0
stack_create_total{direction="inbound",name="logical"} 0
stack_create_total{direction="inbound",name="source"} 0
stack_drop_total{direction="outbound",name="fallback.endpoint"} 0
stack_drop_total{direction="outbound",name="balance.endpoint"} 0
stack_drop_total{direction="outbound",name="logical.dst"} 0
stack_drop_total{direction="outbound",name="addr"} 0
stack_drop_total{direction="outbound",name="source"} 0
stack_drop_total{direction="inbound",name="endpoint"} 0
stack_drop_total{direction="inbound",name="route"} 0
stack_drop_total{direction="inbound",name="logical"} 0
stack_drop_total{direction="inbound",name="source"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371
stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600
stack_poll_total{direction="outbound",name="addr",ready="true"} 23600
stack_poll_total{direction="outbound",name="source",ready="true"} 23600
stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0
stack_poll_total{direction="inbound",name="route",ready="true"} 0
stack_poll_total{direction="inbound",name="logical",ready="true"} 0
stack_poll_total{direction="inbound",name="source",ready="true"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175
stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0
stack_poll_total{direction="outbound",name="addr",ready="false"} 1
stack_poll_total{direction="outbound",name="source",ready="false"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0
stack_poll_total{direction="inbound",name="route",ready="false"} 0
stack_poll_total{direction="inbound",name="logical",ready="false"} 0
stack_poll_total{direction="inbound",name="source",ready="false"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0
stack_poll_total{direction="outbound",name="addr",ready="error"} 0
stack_poll_total{direction="outbound",name="source",ready="error"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0
stack_poll_total{direction="inbound",name="route",ready="error"} 0
stack_poll_total{direction="inbound",name="logical",ready="error"} 0
stack_poll_total{direction="inbound",name="source",ready="error"} 0
stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0
stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40
stack_poll_total_ms{direction="outbound",name="logical.dst"} 0
stack_poll_total_ms{direction="outbound",name="addr"} 2
stack_poll_total_ms{direction="outbound",name="source"} 0
stack_poll_total_ms{direction="inbound",name="endpoint"} 0
stack_poll_total_ms{direction="inbound",name="route"} 0
stack_poll_total_ms{direction="inbound",name="logical"} 0
stack_poll_total_ms{direction="inbound",name="source"} 0

When diagnosing issues, it can be difficult to reason about the states
of the various caches in the proxy.

This change introduces a set of `stack` metrics to help proxy developers
diagnose issues. The metrics track stack creations, stack deletions, and
the amount of time a given service is not available (via
`stack_poll_total_ms`)

```
:; kubectl exec  -n burn-http deploy/client-burst -c client -- curl -s localhost:4191/metrics |grep -e ^stack
stack_create_total{direction="outbound",name="fallback.endpoint"} 0
stack_create_total{direction="outbound",name="balance.endpoint"} 1
stack_create_total{direction="outbound",name="logical.dst"} 1
stack_create_total{direction="outbound",name="addr"} 1
stack_create_total{direction="outbound",name="source"} 100
stack_create_total{direction="inbound",name="endpoint"} 0
stack_create_total{direction="inbound",name="route"} 0
stack_create_total{direction="inbound",name="logical"} 0
stack_create_total{direction="inbound",name="source"} 0
stack_drop_total{direction="outbound",name="fallback.endpoint"} 0
stack_drop_total{direction="outbound",name="balance.endpoint"} 0
stack_drop_total{direction="outbound",name="logical.dst"} 0
stack_drop_total{direction="outbound",name="addr"} 0
stack_drop_total{direction="outbound",name="source"} 0
stack_drop_total{direction="inbound",name="endpoint"} 0
stack_drop_total{direction="inbound",name="route"} 0
stack_drop_total{direction="inbound",name="logical"} 0
stack_drop_total{direction="inbound",name="source"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="true"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="true"} 70371
stack_poll_total{direction="outbound",name="logical.dst",ready="true"} 23600
stack_poll_total{direction="outbound",name="addr",ready="true"} 23600
stack_poll_total{direction="outbound",name="source",ready="true"} 23600
stack_poll_total{direction="inbound",name="endpoint",ready="true"} 0
stack_poll_total{direction="inbound",name="route",ready="true"} 0
stack_poll_total{direction="inbound",name="logical",ready="true"} 0
stack_poll_total{direction="inbound",name="source",ready="true"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="false"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="false"} 23175
stack_poll_total{direction="outbound",name="logical.dst",ready="false"} 0
stack_poll_total{direction="outbound",name="addr",ready="false"} 1
stack_poll_total{direction="outbound",name="source",ready="false"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="false"} 0
stack_poll_total{direction="inbound",name="route",ready="false"} 0
stack_poll_total{direction="inbound",name="logical",ready="false"} 0
stack_poll_total{direction="inbound",name="source",ready="false"} 0
stack_poll_total{direction="outbound",name="fallback.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="balance.endpoint",ready="error"} 0
stack_poll_total{direction="outbound",name="logical.dst",ready="error"} 0
stack_poll_total{direction="outbound",name="addr",ready="error"} 0
stack_poll_total{direction="outbound",name="source",ready="error"} 0
stack_poll_total{direction="inbound",name="endpoint",ready="error"} 0
stack_poll_total{direction="inbound",name="route",ready="error"} 0
stack_poll_total{direction="inbound",name="logical",ready="error"} 0
stack_poll_total{direction="inbound",name="source",ready="error"} 0
stack_poll_total_ms{direction="outbound",name="fallback.endpoint"} 0
stack_poll_total_ms{direction="outbound",name="balance.endpoint"} 40
stack_poll_total_ms{direction="outbound",name="logical.dst"} 0
stack_poll_total_ms{direction="outbound",name="addr"} 2
stack_poll_total_ms{direction="outbound",name="source"} 0
stack_poll_total_ms{direction="inbound",name="endpoint"} 0
stack_poll_total_ms{direction="inbound",name="route"} 0
stack_poll_total_ms{direction="inbound",name="logical"} 0
stack_poll_total_ms{direction="inbound",name="source"} 0
```
@olix0r olix0r requested a review from a team February 18, 2020 15:59
Copy link
Contributor

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

(I've recently read through finagle stats again, and the ease of adding new stats is enviable...)

@olix0r olix0r requested a review from a team February 18, 2020 17:06
@hawkw
Copy link
Contributor

hawkw commented Feb 18, 2020

This seems like something that could potentially be a good use-case for the tracing metrics integration that some folks in the community are supposedly working on, but that's not here yet, and this works now...

Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new metrics seem like they could potentially be very useful, this is great.

I was curious about the potential performance impact of taking the timestamps necessary for the poll duration metric. It would be nice to know if this actually makes a meaningful difference or not...

linkerd/stack-metrics/src/lib.rs Outdated Show resolved Hide resolved
linkerd/stack-metrics/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines 119 to 137
if self.blocked_since.is_none() {
self.blocked_since = Some(Instant::now());
}
Ok(Async::NotReady)
}
Ok(Async::Ready(())) => {
self.metrics.ready_total.incr();
if let Some(t0) = self.blocked_since.take() {
let not_ready = Instant::now() - t0;
self.metrics.poll_millis.add(not_ready.as_millis() as u64);
}
Ok(Async::Ready(()))
}
Err(e) => {
self.metrics.error_total.incr();
if let Some(t0) = self.blocked_since.take() {
let not_ready = Instant::now() - t0;
self.metrics.poll_millis.add(not_ready.as_millis() as u64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concerns about the performance implications of recording all these timestamps? It seems like if we are instrumenting every stack in the proxy with one of these layers, we are potentially adding a fair number of syscalls in the request path.

It could be good to do some benchmarking and see if there's a noticeable perf impact from this? We might want to add a config to turn off the stack metrics, or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from any of the other latency metrics we record?

Copy link
Member Author

@olix0r olix0r Feb 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, this is only engaged when a service is NotReady, which is fairly rare in the scheme of things. I'm not particularly worried about this? Unless you really think we should be?

linkerd/app/core/src/svc.rs Show resolved Hide resolved
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. +1 to @hawkw's question as I had similar thoughts on the Instant::now() calls.

linkerd/stack-metrics/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good and that final reorganization helps a lot.

@olix0r olix0r merged commit f44d273 into master Feb 18, 2020
@olix0r olix0r deleted the ver/stack-metrics branch February 18, 2020 22:54
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Feb 19, 2020
This release includes the results from continued profiling & performance
analysis. In addition to modifying internals to prevent unwarranted
memory growth, we've introduced new metrics to aid in debugging and
diagnostics: a new `request_errors_total` metric exposes the number of
requests that receive synthesized responses due to proxy errors; and a
suite of `stack_*` metrics expose proxy internals that can help us
identify unexpected behavior.

---

* trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426)
* Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427)
* Add the request_errors_total metric (linkerd/linkerd2-proxy#417)
* Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428)
* concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429)
* profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406)
* http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430)
* lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Feb 19, 2020
This release includes the results from continued profiling & performance
analysis. In addition to modifying internals to prevent unwarranted
memory growth, we've introduced new metrics to aid in debugging and
diagnostics: a new `request_errors_total` metric exposes the number of
requests that receive synthesized responses due to proxy errors; and a
suite of `stack_*` metrics expose proxy internals that can help us
identify unexpected behavior.

---

* trace: update `tracing-subscriber` dependency to 0.2.1 (linkerd/linkerd2-proxy#426)
* Reimplement the Lock middleware with tokio::sync (linkerd/linkerd2-proxy#427)
* Add the request_errors_total metric (linkerd/linkerd2-proxy#417)
* Expose the number of service instances in the proxy (linkerd/linkerd2-proxy#428)
* concurrency-limit: Share a limit across Services (linkerd/linkerd2-proxy#429)
* profiling: add benchmark and profiling scripts (linkerd/linkerd2-proxy#406)
* http-box: Box HTTP payloads via middleware (linkerd/linkerd2-proxy#430)
* lock: Generalize to protect a guarded value (linkerd/linkerd2-proxy#431)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants