Skip to content

Commit

Permalink
[DOC] Add service graph metrics queries (#2815) (#2821)
Browse files Browse the repository at this point in the history
* Add service graph metrics queries

* Fix broken links

* Added intro

* Update docs/sources/tempo/metrics-generator/service_graphs/metrics-queries.md

* Update metrics-queries.md

Tried a small language tweak

* Apply suggestions from code review

Co-authored-by: Jennifer Villa <[email protected]>

---------

Co-authored-by: Jennifer Villa <[email protected]>
(cherry picked from commit f1c7e6b)

Co-authored-by: Kim Nylander <[email protected]>
  • Loading branch information
github-actions[bot] and knylander-grafana authored Aug 21, 2023
1 parent a48e5bb commit 39a1c6f
Show file tree
Hide file tree
Showing 8 changed files with 180 additions and 91 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,4 @@ metrics:
The same service graph metrics can also be generated by Tempo.
This is more efficient and recommended for larger installations.

For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs#enable-service-graphs-in-Grafana" >}}).
For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs/enable-service-graphs" >}}).
4 changes: 2 additions & 2 deletions docs/sources/tempo/metrics-generator/service-graph-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ You have to enable span metrics and service graph generation on the Grafana back
To use the service graph view, you need:

* Tempo or Grafana Cloud Traces with either 1) the metrics generator enabled and configured or 2) the Grafana Agent enabled and configured to send data to a Prometheus-compatible metrics store
* [Services graphs]({{< relref "../metrics-generator/service_graphs#how-to-run" >}}), which are enabled by default in Grafana
* [Services graphs]({{< relref "../metrics-generator/service_graphs/enable-service-graphs" >}}), which are enabled by default in Grafana
* [Span metrics]({{< relref "../metrics-generator/span_metrics#how-to-run" >}}) enabled in your Tempo data source configuration

The service graph view can be derived from metrics generated by either Tempo's metrics-generator or by the Grafana Agent.
Expand Down Expand Up @@ -108,7 +108,7 @@ If you are using the metrics-generator, then it processes traces and generates s
tempo_service_graph_request_total{client="app", server="db"} 20
```

For information about service graphs and how they are calculated, refer to the [Service Graphs documentation]({{< relref "../metrics-generator/service_graphs.md" >}}).
For information about service graphs and how they are calculated, refer to the [Service Graphs documentation]({{< relref "../metrics-generator/service_graphs" >}}).

## Use filters to reveal details

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -78,91 +78,3 @@ Additional labels can be included using the `dimensions` configuration option.
Since the service graph processor has to process both sides of an edge,
it needs to process all spans of a trace to function properly.
If spans of a trace are spread out over multiple instances, spans are not paired up reliably.

## Estimate cardinality from traces

Cardinality can pose a problem when you have lots of services.
There isn't a direct formula or solution to this issue.
The following guide should help estimate the cardinality that the feature will generate.

For more information on cardinality, refer to the [Cardinality]({{< relref "./cardinality" >}}) documentation.

### How to estimate the cardinality

The amount of edges depends on the number of nodes in the system and the direction of the requests between them.
Let’s call this amount hops. Every hop will be a unique combination of client + server labels.

For example:
- A system with 3 nodes `(A, B, C)` of which A only calls B and B only calls C will have 2 hops `(A → B, B → C)`
- A system with 3 nodes `(A, B, C)` that call each other (i.e., all bidirectional link) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)`

We can’t calculate the amount of hops automatically based upon the nodes,
but it should be a value between `#services - 1` and `#services!`.

If we know the amount of hops in a system, we can calculate the cardinality of the generated
[service graphs]({{< relref "./service_graphs" >}}):

```
traces_service_graph_request_total: #hops
traces_service_graph_request_failed_total: #hops
traces_service_graph_request_server_seconds: 3 buckets * #hops
traces_service_graph_request_client_seconds: 3 buckets * #hops
traces_service_graph_unpaired_spans_total: #services (absolute worst case)
traces_service_graph_dropped_spans_total: #services (absolute worst case)
```

Finally, we get the following cardinality estimation:

```
Sum: 8 * #hops + 2 * #services
```

{{% admonition type="note" %}}
To estimate the number of metrics, refer to the [Dry run metrics generator]({{< relref "./cardinality" >}}) documentation.
{{% /admonition %}}

## How to run

Service graphs are generated in Tempo and pushed to a metrics storage.
Then, they can be represented in Grafana as a graph.
You will need those components to fully use service graphs.

{{% admonition type="note" %}}
Cardinality can pose a problem when you have lots of services.
To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "./cardinality" >}}).
{{% /admonition %}}

### Enable service graphs in Tempo/GET

To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator. See [here for configuration details]({{< relref "../configuration#metrics-generator" >}}).

### Enable service graphs in Grafana

{{% admonition type="note" %}}
Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden
under the [feature toggle](/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`.
{{% /admonition %}}

Configure a Tempo data source's 'Service Graphs' by linking to the Prometheus backend where metrics are being sent:

```
apiVersion: 1
datasources:
# Prometheus backend where metrics are sent
- name: Prometheus
type: prometheus
uid: prometheus
url: <prometheus-url>
jsonData:
httpMethod: GET
version: 1
- name: Tempo
type: tempo
uid: tempo
url: <tempo-url>
jsonData:
httpMethod: GET
serviceMap:
datasourceUid: 'prometheus'
version: 1
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
aliases:
- /docs/tempo/latest/server_side_metrics/service_graphs/
- /docs/tempo/latest/metrics-generator/service_graphs/
title: Enable service graphs
description: Learn how to enable service graphs
weight: 300
---


## Enable service graphs

Service graphs are generated in Tempo and pushed to a metrics storage.
Then, they can be represented in Grafana as a graph.
You will need those components to fully use service graphs.

{{% admonition type="note" %}}
Cardinality can pose a problem when you have lots of services.
To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "../cardinality" >}}).
{{% /admonition %}}

### Enable service graphs in Tempo/GET

To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator.
For more information, refer to the [configuration details]({{< relref "../../configuration#metrics-generator" >}}).

### Enable service graphs in Grafana

{{% admonition type="note" %}}
Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden
under the [feature toggle](/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`.
{{% /admonition %}}

Configure a Tempo data source's service graphs by linking to the Prometheus backend where metrics are being sent:

```
apiVersion: 1
datasources:
# Prometheus backend where metrics are sent
- name: Prometheus
type: prometheus
uid: prometheus
url: <prometheus-url>
jsonData:
httpMethod: GET
version: 1
- name: Tempo
type: tempo
uid: tempo
url: <tempo-url>
jsonData:
httpMethod: GET
serviceMap:
datasourceUid: 'prometheus'
version: 1
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: Estimate cardinality from traces
menuTitle: Estimate cardinality
description: Service graphs help you understand the structure of a distributed system and the connections and dependencies between its components.
weight: 300
---

## Estimate cardinality from traces

Cardinality can pose a problem when you have lots of services.
There isn't a direct formula or solution to this issue.
The following guide should help estimate the cardinality that the feature will generate.

For more information on cardinality, refer to the [Cardinality]({{< relref "../cardinality" >}}) documentation.

### How to estimate the cardinality

The amount of edges depends on the number of nodes in the system and the direction of the requests between them.
Let’s call this amount hops. Every hop will be a unique combination of client + server labels.

For example:
- A system with 3 nodes `(A, B, C)` of which A only calls B and B only calls C will have 2 hops `(A → B, B → C)`
- A system with 3 nodes `(A, B, C)` that call each other (i.e., all bidirectional link) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)`

We can’t calculate the amount of hops automatically based upon the nodes,
but it should be a value between `#services - 1` and `#services!`.

If we know the amount of hops in a system, we can calculate the cardinality of the generated
[service graphs]({{< relref "../service_graphs" >}}):

```
traces_service_graph_request_total: #hops
traces_service_graph_request_failed_total: #hops
traces_service_graph_request_server_seconds: 3 buckets * #hops
traces_service_graph_request_client_seconds: 3 buckets * #hops
traces_service_graph_unpaired_spans_total: #services (absolute worst case)
traces_service_graph_dropped_spans_total: #services (absolute worst case)
```

Finally, we get the following cardinality estimation:

```
Sum: 8 * #hops + 2 * #services
```

{{% admonition type="note" %}}
To estimate the number of metrics, refer to the [Dry run metrics generator]({{< relref "../cardinality" >}}) documentation.
{{% /admonition %}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: Service graph metrics queries
menuTitle: Metrics queries
description: Use PromQL queries to access metrics from service graphs
weight: 300
---

# Service graph metrics queries

A collection of useful PromQL queries for service graphs.

In most cases, users want to see a visual representation of their service graph. Grafana uses the service graph metrics created by Tempo and builds that visual for the user. However, in some cases, users may want to interact with the metrics that define that service graph directly. They may want to, for example, programmatically analyze how their services are interconnected and build downstream applications that use this information.

To help with this, we've provided a collection of useful PromQL queries that can be used to explore service graph metrics.

## Instant Queries

An instant query will give a single value at the end of the selected time range.
[Instant queries](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries) are quicker to execute and it often easier to understand their results. We will prefer them in some scenarios:

![Instant query in Grafana](screenshot-serv-graph-instant-query.png)

### Connectivity between services

Show me the total calls in the last 7 days for every client/server pair:

```promql
sum(increase(traces_service_graph_request_server_seconds_count{}[7d])) by (server, client) > 0
```

If you'd like to only see when a single service is the server:

```promql
sum(increase(traces_service_graph_request_server_seconds_count{server="foo"}[7d])) by (client) > 0
```

If you'd like to only see when a single service is the client:

```promql
sum(increase(traces_service_graph_request_server_seconds_count{client="foo"}[7d])) by (server) > 0
```

In all of the above queries, you can adjust the interval to change the amount of time this is calculated for. So if you wanted the same analysis done over one day:

```promql
sum(increase(traces_service_graph_request_server_seconds_count{}[1d])) by (server, client) > 0
```

## Range queries

Range queries are nice for calculating service graph info over a time range instead of a single point in time.

![Range query in Grafana](screenshot-serv-graph-range-query.png)

### Rates over time between services

Taking two of the queries above, we can request the rate over time that any given service acted as the client or server:

```promql
sum(rate(traces_service_graph_request_server_seconds_count{server="foo"}[5m])) by (client) > 0
sum(rate(traces_service_graph_request_server_seconds_count{client="foo"}[5m])) by (server) > 0
```

Notice that our interval dropped to 5m. This is so we only calculate the rate over the past 5 minutes which creates a more responsive graph.

### Latency percentiles over time between services

These queries will give us latency quantiles for the above rate. If we were interested in how the latency changed over time between any two services we could use these. In the following query the `.9` means we're calculating the 90th percentile. Adjust this value if you want to calculate a different percentile for latency (e.g. p50, p95, p99, etc).

```promql
histogram_quantile(.9, sum(rate(traces_service_graph_request_server_seconds_bucket{client="foo"}[5m])) by (server, le))
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 39a1c6f

Please sign in to comment.