[DOC] Add service graph metrics queries (#2815) (#2821)

* Add service graph metrics queries * Fix broken links * Added intro * Update docs/sources/tempo/metrics-generator/service_graphs/metrics-queries.md * Update metrics-queries.md Tried a small language tweak * Apply suggestions from code review Co-authored-by: Jennifer Villa <[email protected]> --------- Co-authored-by: Jennifer Villa <[email protected]> (cherry picked from commit f1c7e6b) Co-authored-by: Kim Nylander <[email protected]>
grafana · Aug 21, 2023 · 39a1c6f · 39a1c6f
1 parent a48e5bb
commit 39a1c6f
Show file tree

Hide file tree

Showing 8 changed files with 180 additions and 91 deletions.
diff --git a/docs/sources/tempo/configuration/grafana-agent/service-graphs.md b/docs/sources/tempo/configuration/grafana-agent/service-graphs.md
@@ -62,4 +62,4 @@ metrics:
 The same service graph metrics can also be generated by Tempo.
 This is more efficient and recommended for larger installations.
 
-For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs#enable-service-graphs-in-Grafana" >}}).
+For additional information about viewing service graph metrics in Grafana and calculating cardinality, refer to the [server side documentation]({{< relref "../../metrics-generator/service_graphs/enable-service-graphs" >}}).
diff --git a/docs/sources/tempo/metrics-generator/service-graph-view.md b/docs/sources/tempo/metrics-generator/service-graph-view.md
@@ -28,7 +28,7 @@ You have to enable span metrics and service graph generation on the Grafana back
 To use the service graph view, you need:
 
 * Tempo or Grafana Cloud Traces with either 1) the metrics generator enabled and configured or 2) the Grafana Agent enabled and configured to send data to a Prometheus-compatible metrics store
-* [Services graphs]({{< relref "../metrics-generator/service_graphs#how-to-run" >}}), which are enabled by default in Grafana
+* [Services graphs]({{< relref "../metrics-generator/service_graphs/enable-service-graphs" >}}), which are enabled by default in Grafana
 * [Span metrics]({{< relref "../metrics-generator/span_metrics#how-to-run" >}}) enabled in your Tempo data source configuration
 
 The service graph view can be derived from metrics generated by either Tempo's metrics-generator or by the Grafana Agent.
@@ -108,7 +108,7 @@ If you are using the metrics-generator, then it processes traces and generates s
 tempo_service_graph_request_total{client="app", server="db"} 20
 ```
 
-For information about service graphs and how they are calculated, refer to the [Service Graphs documentation]({{< relref "../metrics-generator/service_graphs.md" >}}).
+For information about service graphs and how they are calculated, refer to the [Service Graphs documentation]({{< relref "../metrics-generator/service_graphs" >}}).
 
 ## Use filters to reveal details
 

diff --git a/...tempo/metrics-generator/service_graphs.md → ...etrics-generator/service_graphs/_index.md b/...tempo/metrics-generator/service_graphs.md → ...etrics-generator/service_graphs/_index.md
@@ -78,91 +78,3 @@ Additional labels can be included using the `dimensions` configuration option.
 Since the service graph processor has to process both sides of an edge,
 it needs to process all spans of a trace to function properly.
 If spans of a trace are spread out over multiple instances, spans are not paired up reliably.
-
-## Estimate cardinality from traces
-
-Cardinality can pose a problem when you have lots of services.
-There isn't a direct formula or solution to this issue.
-The following guide should help estimate the cardinality that the feature will generate.
-
-For more information on cardinality, refer to the [Cardinality]({{< relref "./cardinality" >}}) documentation.
-
-### How to estimate the cardinality
-
-The amount of edges depends on the number of nodes in the system and the direction of the requests between them.
-Let’s call this amount hops. Every hop will be a unique combination of client + server labels.
-
-For example:
-- A system with 3 nodes `(A, B, C)` of which A only calls B and B only calls C will have 2 hops `(A → B, B → C)`
-- A system with 3 nodes `(A, B, C)` that call each other (i.e., all bidirectional link) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)`
-
-We can’t calculate the amount of hops automatically based upon the nodes,
-but it should be a value between `#services - 1` and `#services!`.
-
-If we know the amount of hops in a system, we can calculate the cardinality of the generated
-[service graphs]({{< relref "./service_graphs" >}}):
-
-```
-  traces_service_graph_request_total: #hops
-  traces_service_graph_request_failed_total: #hops
-  traces_service_graph_request_server_seconds: 3 buckets * #hops
-  traces_service_graph_request_client_seconds: 3 buckets * #hops
-  traces_service_graph_unpaired_spans_total: #services (absolute worst case)
-  traces_service_graph_dropped_spans_total: #services (absolute worst case)
-```
-
-Finally, we get the following cardinality estimation:
-
-```
-  Sum: 8 * #hops + 2 * #services
-```
-
-{{% admonition type="note" %}}
-To estimate the number of metrics, refer to the [Dry run metrics generator]({{< relref "./cardinality" >}}) documentation.
-{{% /admonition %}}
-
-## How to run
-
-Service graphs are generated in Tempo and pushed to a metrics storage.
-Then, they can be represented in Grafana as a graph.
-You will need those components to fully use service graphs.
-
-{{% admonition type="note" %}}
-Cardinality can pose a problem when you have lots of services.
-To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "./cardinality" >}}).
-{{% /admonition %}}
-
-### Enable service graphs in Tempo/GET
-
-To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator. See [here for configuration details]({{< relref "../configuration#metrics-generator" >}}).
-
-### Enable service graphs in Grafana
-
-{{% admonition type="note" %}}
-Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden
-under the [feature toggle](/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`.
-{{% /admonition %}}
-
-Configure a Tempo data source's 'Service Graphs' by linking to the Prometheus backend where metrics are being sent:
-
-```
-apiVersion: 1
-datasources:
-  # Prometheus backend where metrics are sent
-  - name: Prometheus
-    type: prometheus
-    uid: prometheus
-    url: <prometheus-url>
-    jsonData:
-        httpMethod: GET
-    version: 1
-  - name: Tempo
-    type: tempo
-    uid: tempo
-    url: <tempo-url>
-    jsonData:
-      httpMethod: GET
-      serviceMap:
-        datasourceUid: 'prometheus'
-    version: 1
-```
diff --git a/docs/sources/tempo/metrics-generator/service_graphs/enable-service-graphs.md b/docs/sources/tempo/metrics-generator/service_graphs/enable-service-graphs.md
@@ -0,0 +1,56 @@
+---
+aliases:
+- /docs/tempo/latest/server_side_metrics/service_graphs/
+- /docs/tempo/latest/metrics-generator/service_graphs/
+title: Enable service graphs
+description: Learn how to enable service graphs
+weight: 300
+---
+
+
+## Enable service graphs
+
+Service graphs are generated in Tempo and pushed to a metrics storage.
+Then, they can be represented in Grafana as a graph.
+You will need those components to fully use service graphs.
+
+{{% admonition type="note" %}}
+Cardinality can pose a problem when you have lots of services.
+To learn more about cardinality and how to perform a dry run of the metrics generator, see the [Cardinality documentation]({{< relref "../cardinality" >}}).
+{{% /admonition %}}
+
+### Enable service graphs in Tempo/GET
+
+To enable service graphs in Tempo/GET, enable the metrics generator and add an overrides section which enables the `service-graphs` generator.
+For more information, refer to the [configuration details]({{< relref "../../configuration#metrics-generator" >}}).
+
+### Enable service graphs in Grafana
+
+{{% admonition type="note" %}}
+Since Grafana 9.0.4, service graphs have been enabled by default. Prior to Grafana 9.0.4, service graphs were hidden
+under the [feature toggle](/docs/grafana/latest/setup-grafana/configure-grafana/#feature_toggles) `tempoServiceGraph`.
+{{% /admonition %}}
+
+Configure a Tempo data source's service graphs by linking to the Prometheus backend where metrics are being sent:
+
+```
+apiVersion: 1
+datasources:
+  # Prometheus backend where metrics are sent
+  - name: Prometheus
+    type: prometheus
+    uid: prometheus
+    url: <prometheus-url>
+    jsonData:
+        httpMethod: GET
+    version: 1
+  - name: Tempo
+    type: tempo
+    uid: tempo
+    url: <tempo-url>
+    jsonData:
+      httpMethod: GET
+      serviceMap:
+        datasourceUid: 'prometheus'
+    version: 1
+```
diff --git a/docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md b/docs/sources/tempo/metrics-generator/service_graphs/estimate-cardinality.md
@@ -0,0 +1,48 @@
+---
+title: Estimate cardinality from traces
+menuTitle: Estimate cardinality
+description: Service graphs help you understand the structure of a distributed system and the connections and dependencies between its components.
+weight: 300
+---
+
+## Estimate cardinality from traces
+
+Cardinality can pose a problem when you have lots of services.
+There isn't a direct formula or solution to this issue.
+The following guide should help estimate the cardinality that the feature will generate.
+
+For more information on cardinality, refer to the [Cardinality]({{< relref "../cardinality" >}}) documentation.
+
+### How to estimate the cardinality
+
+The amount of edges depends on the number of nodes in the system and the direction of the requests between them.
+Let’s call this amount hops. Every hop will be a unique combination of client + server labels.
+
+For example:
+- A system with 3 nodes `(A, B, C)` of which A only calls B and B only calls C will have 2 hops `(A → B, B → C)`
+- A system with 3 nodes `(A, B, C)` that call each other (i.e., all bidirectional link) will have 6 hops `(A → B, B → A, B → C, C → B, A → C, C → A)`
+
+We can’t calculate the amount of hops automatically based upon the nodes,
+but it should be a value between `#services - 1` and `#services!`.
+
+If we know the amount of hops in a system, we can calculate the cardinality of the generated
+[service graphs]({{< relref "../service_graphs" >}}):
+
+```
+  traces_service_graph_request_total: #hops
+  traces_service_graph_request_failed_total: #hops
+  traces_service_graph_request_server_seconds: 3 buckets * #hops
+  traces_service_graph_request_client_seconds: 3 buckets * #hops
+  traces_service_graph_unpaired_spans_total: #services (absolute worst case)
+  traces_service_graph_dropped_spans_total: #services (absolute worst case)
+```
+
+Finally, we get the following cardinality estimation:
+
+```
+  Sum: 8 * #hops + 2 * #services
+```
+
+{{% admonition type="note" %}}
+To estimate the number of metrics, refer to the [Dry run metrics generator]({{< relref "../cardinality" >}}) documentation.
+{{% /admonition %}}
diff --git a/docs/sources/tempo/metrics-generator/service_graphs/metrics-queries.md b/docs/sources/tempo/metrics-generator/service_graphs/metrics-queries.md
@@ -0,0 +1,73 @@
+---
+title: Service graph metrics queries
+menuTitle: Metrics queries
+description: Use PromQL queries to access metrics from service graphs
+weight: 300
+---
+
+# Service graph metrics queries
+
+A collection of useful PromQL queries for service graphs.
+
+In most cases, users want to see a visual representation of their service graph. Grafana uses the service graph metrics created by Tempo and builds that visual for the user. However, in some cases, users may want to interact with the metrics that define that service graph directly. They may want to, for example, programmatically analyze how their services are interconnected and build downstream applications that use this information. 
+
+To help with this, we've provided a collection of useful PromQL queries that can be used to explore service graph metrics. 
+
+## Instant Queries
+
+An instant query will give a single value at the end of the selected time range.
+[Instant queries](https://prometheus.io/docs/prometheus/latest/querying/api/#instant-queries) are quicker to execute and it often easier to understand their results. We will prefer them in some scenarios:
+
+![Instant query in Grafana](screenshot-serv-graph-instant-query.png)
+
+### Connectivity between services
+
+Show me the total calls in the last 7 days for every client/server pair:
+
+```promql
+sum(increase(traces_service_graph_request_server_seconds_count{}[7d])) by (server, client) > 0
+```
+
+If you'd like to only see when a single service is the server:
+
+```promql
+sum(increase(traces_service_graph_request_server_seconds_count{server="foo"}[7d])) by (client) > 0
+```
+
+If you'd like to only see when a single service is the client:
+
+```promql
+sum(increase(traces_service_graph_request_server_seconds_count{client="foo"}[7d])) by (server) > 0
+```
+
+In all of the above queries, you can adjust the interval to change the amount of time this is calculated for. So if you wanted the same analysis done over one day:
+
+```promql
+sum(increase(traces_service_graph_request_server_seconds_count{}[1d])) by (server, client) > 0
+```
+
+## Range queries
+
+Range queries are nice for calculating service graph info over a time range instead of a single point in time.
+
+![Range query in Grafana](screenshot-serv-graph-range-query.png)
+
+### Rates over time between services
+
+Taking two of the queries above, we can request the rate over time that any given service acted as the client or server:
+
+```promql
+sum(rate(traces_service_graph_request_server_seconds_count{server="foo"}[5m])) by (client) > 0
+
+sum(rate(traces_service_graph_request_server_seconds_count{client="foo"}[5m])) by (server) > 0
+```
+
+Notice that our interval dropped to 5m. This is so we only calculate the rate over the past 5 minutes which creates a more responsive graph.
+
+### Latency percentiles over time between services
+
+These queries will give us latency quantiles for the above rate. If we were interested in how the latency changed over time between any two services we could use these. In the following query the `.9` means we're calculating the 90th percentile. Adjust this value if you want to calculate a different percentile for latency (e.g. p50, p95, p99, etc). 
+
+```promql
+histogram_quantile(.9, sum(rate(traces_service_graph_request_server_seconds_bucket{client="foo"}[5m])) by (server, le))
+```
diff --git a/.../tempo/metrics-generator/service_graphs/screenshot-serv-graph-instant-query.png b/.../tempo/metrics-generator/service_graphs/screenshot-serv-graph-instant-query.png
diff --git a/...es/tempo/metrics-generator/service_graphs/screenshot-serv-graph-range-query.png b/...es/tempo/metrics-generator/service_graphs/screenshot-serv-graph-range-query.png