Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for max_span_attr_byte and restructure troubleshoot doc #4551

Merged
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 20 additions & 6 deletions docs/sources/tempo/configuration/_index.md
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@ The Tempo configuration options include:
- [Use environment variables in the configuration](#use-environment-variables-in-the-configuration)
- [Server](#server)
- [Distributor](#distributor)
- [Set max attribute size to help control out of memory errors](#set-max-attribute-size-to-help-control-out-of-memory-errors)
- [Ingester](#ingester)
- [Metrics-generator](#metrics-generator)
- [Query-frontend](#query-frontend)
- [Limit query size to improve performance and stability](#limit-query-size-to-improve-performance-and-stability)
- [Limit the spans per spanset](#limit-the-spans-per-spanset)
- [Cap the maximum query length](#cap-the-maximum-query-length)
- [Querier](#querier)
- [Cap the maximum query length](#cap-the-maximum-query-length)
- [Querier](#querier)
- [Compactor](#compactor)
- [Storage](#storage)
- [Local storage recommendations](#local-storage-recommendations)
Expand Down Expand Up @@ -251,6 +252,19 @@ distributor:
[stale_duration: <duration> | default = 15m0s]
```

### Set max attribute size to help control out of memory errors
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved

Tempo queriers can run out of memory when fetching traces that have spans with very large attributes.
This issue has been observed when trying to fetch a single trace using the [`tracebyID` endpoint](https://grafana.com/docs/tempo/latest/api_docs/#query).
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
While a trace might not have a lot of spans (roughly 500), it can have a larger size (approximately 250KB).
Some of the spans in that trace had attributes whose values were very large in size.

To avoid these out-of-memory crashes, use `max_span_attr_byte` to limit the maximum allowable size of any individual attribute.
Any key or values that exceed the configured limit are truncated before storing.
The default value is `2048`.
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved

Use the `tempo_distributor_attributes_truncated_total` metric to track how many attributes are truncated.

knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
## Ingester

For more information on configuration options, refer to [this file](https://github.com/grafana/tempo/blob/main/modules/ingester/config.go).
Expand Down Expand Up @@ -315,7 +329,7 @@ If you want to enable metrics-generator for your Grafana Cloud account, refer to
You can limit spans with end times that occur within a configured duration to be considered in metrics generation using `metrics_ingestion_time_range_slack`.
In Grafana Cloud, this value defaults to 30 seconds so all spans sent to the metrics-generation more than 30 seconds in the past are discarded or rejected.

For more information about the `local-blocks` configuration option, refer to [TraceQL metrics](https://grafana.com/docs/tempo/latest/operations/traceql-metrics/#configure-the-local-blocks-processor).
For more information about the `local-blocks` configuration option, refer to [TraceQL metrics](https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/traceql-metrics/#configure-the-local-blocks-processor).
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved

```yaml
# Metrics-generator configuration block
Expand Down Expand Up @@ -724,14 +738,14 @@ In a similar manner, excessive queries result size can also negatively impact qu
#### Limit the spans per spanset

You can set the maximum spans per spanset by setting `max_spans_per_span_set` for the query-frontend.
The default value is 100.
The default value is 100.

In Grafana or Grafana Cloud, you can use the **Span Limit** field in the [TraceQL query editor](https://grafana.com/docs/grafana-cloud/connect-externally-hosted/data-sources/tempo/query-editor/) in Grafana Explore.
This field sets the maximum number of spans to return for each span set.
The maximum value that you can set for the **Span Limit** value (or the spss query) is controlled by `max_spans_per_span_set`.
To disable the maximum spans per span set limit, set `max_spans_per_span_set` to `0`.
When set to `0`, there is no maximum and users can put any value in **Span Limit**.
However, this can only be set by a Tempo administrator, not by the user.
When set to `0`, there is no maximum and users can put any value in **Span Limit**.
However, this can only be set by a Tempo administrator, not by the user.

#### Cap the maximum query length

Expand Down
20 changes: 12 additions & 8 deletions docs/sources/tempo/troubleshooting/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,22 @@ In addition, the [Tempo runbook](https://github.com/grafana/tempo/blob/main/oper

## Sending traces

- [Spans are being refused with "pusher failed to consume trace data"](https://grafana.com/docs/tempo/<TEMMPO_VERSION>/troubleshooting/max-trace-limit-reached/)
- [Is Grafana Alloy sending to the backend?](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/alloy/)
- [Spans are being refused with "pusher failed to consume trace data"](https://grafana.com/docs/tempo/<TEMMPO_VERSION>/troubleshooting/send-traces/max-trace-limit-reached/)
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
- [Is Grafana Alloy sending to the backend?](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/send-traces/alloy/)

## Querying

- [Unable to find my traces in Tempo](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/unable-to-see-trace/)
- [Error message "Too many jobs in the queue"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/too-many-jobs-in-queue/)
- [Queries fail with 500 and "error using pageFinder"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/bad-blocks/)
- [I can search traces, but there are no service name or span name values available](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/search-tag)
- [Error message `response larger than the max (<number> vs <limit>)`](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/response-too-large/)
- [Search results don't match trace lookup results with long-running traces](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/long-running-traces/)
- [Unable to find my traces in Tempo](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/unable-to-see-trace/)
- [Error message "Too many jobs in the queue"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/too-many-jobs-in-queue/)
- [Queries fail with 500 and "error using pageFinder"](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/bad-blocks/)
- [I can search traces, but there are no service name or span name values available](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/search-tag)
- [Error message `response larger than the max (<number> vs <limit>)`](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/response-too-large/)
- [Search results don't match trace lookup results with long-running traces](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/querying/long-running-traces/)

## Metrics-generator

- [Metrics or service graphs seem incomplete](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/metrics-generator/)

## Out-of-memory errors

- [Set the max attribute size to help control out of memory errors](https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/)
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
10 changes: 5 additions & 5 deletions docs/sources/tempo/troubleshooting/metrics-generator.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,17 @@ aliases:

If you are concerned with data quality issues in the metrics-generator, we'd first recommend:

- Reviewing your telemetry pipeline to determine the number of dropped spans. We are only looking for major issues here.
- Reviewing the [service graph documentation]({{< relref "../metrics-generator/service_graphs" >}}) to understand how they are built.
- Reviewing your telemetry pipeline to determine the number of dropped spans. You are only looking for major issues here.
- Reviewing the [service graph documentation](https://grafana.com/docs/tempo/<TEMPO_VERSION>/metrics-generator/service_graphs/) to understand how they are built.

If everything seems ok from these two perspectives, consider the following topics to help resolve general issues with all metrics and span metrics specifically.
If everything seems acceptable from these two perspectives, consider the following topics to help resolve general issues with all metrics and span metrics specifically.

## All metrics

### Dropped spans in the distributor

The distributor has a queue of outgoing spans to the metrics-generators. If that queue is full then the distributor
will drop spans before they reach the generator. Use the following metric to determine if that is happening:
The distributor has a queue of outgoing spans to the metrics-generators.
If the queue is full, then the distributor drops spans before they reach the generator. Use the following metric to determine if that's happening:

```
sum(rate(tempo_distributor_queue_pushes_failures_total{}[1m]))
Expand Down
29 changes: 29 additions & 0 deletions docs/sources/tempo/troubleshooting/out-of-memory-errors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
title: Troubleshoot out-of-memory errors
menuTitle: Out-of-memory errors
description: Gain an understanding of how to debug out-of-memory (OOM) errors.
weight: 600
---

# Troubleshoot out-of-memory errors

Learn about out-of-memory (OOM) errors and how to troubleshoot them.
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved

## Set the max attribute size to help control out of memory errors

Tempo queriers can run out of memory when fetching traces that have spans with very large attributes.
This issue has been observed when trying to fetch a single trace using the [`tracebyID` endpoint](https://grafana.com/docs/tempo/latest/api_docs/#query).

To avoid these out-of-memory crashes, use `max_span_attr_byte` to limit the maximum allowable size of any individual attribute.
Any key or values that exceed the configured limit are truncated before storing.

Use the `tempo_distributor_attributes_truncated_total` metric to track how many attributes are truncated.

```yaml
# Optional
# Configures the max size an attribute can be. Any key or value that exceeds this limit will be truncated before storing
# Setting this parameter to '0' would disable this check against attribute size
[max_span_attr_byte: <int> | default = '2048']
```

Refer to the [configuration for distributors](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#distributor) documentation for more information.
12 changes: 12 additions & 0 deletions docs/sources/tempo/troubleshooting/querying/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Issues with querying
menuTitle: Querying
description: Troubleshoot issues related to querying.
weight: 300
---

# Issues with querying

Learn about issues related to querying.

{{< section withDescriptions="true">}}
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Bad blocks
description: Troubleshoot queries failing with an error message indicating bad blocks.
weight: 475
aliases:
- ../operations/troubleshooting/bad-blocks/
- ../../operations/troubleshooting/bad-blocks/
- ../bad-blocks/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/bad-blocks/
---

# Bad blocks
Expand All @@ -26,7 +27,7 @@ To fix such a block, first download it onto a machine where you can run the `tem

Next run the `tempo-cli`'s `gen index` / `gen bloom` commands depending on which file is corrupt/deleted.
The command will create a fresh index/bloom-filter from the data file at the required location (in the block folder).
To view all of the options for this command, see the [cli docs]({{< relref "../operations/tempo_cli" >}}).
To view all of the options for this command, see the [CLI docs](https://grafana.com/docs/tempo/<TEMPO_VERSION>/operations/tempo_cli/).

Finally, upload the generated index or bloom-filter onto the object store backend under the folder for the block.

Expand Down
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Long-running traces
description: Troubleshoot search results when using long-running traces
weight: 479
aliases:
- ../operations/troubleshooting/long-running-traces/
- ../../operations/troubleshooting/long-running-traces/
- ../long-running-traces/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/long-running-traces/
---

# Long-running traces
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ description: Troubleshoot response larger than the max error message
weight: 477
aliases:
- ../operations/troubleshooting/response-too-large/
- ../response-too-large/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/response-too-large/
---

# Response larger than the max

The error message will take a similar form to the following:
The error message is similar to the following:

```
500 Internal Server Error Body: response larger than the max (<size> vs <limit>)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@ title: Tag search
description: Troubleshoot No options found in Grafana tag search
weight: 476
aliases:
- ../operations/troubleshooting/search-tag/
- ../../operations/troubleshooting/search-tag/
- ../search-tag/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/search-tag/
---

# Tag search
Expand All @@ -25,4 +26,4 @@ when a query exceeds the configured value.
There are two main solutions to this issue:

* Reduce the cardinality of tags pushed to Tempo. Reducing the number of unique tag values will reduce the size returned by a tag search query.
* Increase the `max_bytes_per_tag_values_query` parameter in the [overrides]({{< relref "../configuration#overrides" >}}) block of your Tempo configuration to a value as high as 50MB.
* Increase the `max_bytes_per_tag_values_query` parameter in the [overrides](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#overrides) block of your Tempo configuration to a value as high as 50MB.
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ description: Troubleshoot too many jobs in the queue
weight: 474
aliases:
- ../operations/troubleshooting/too-many-jobs-in-queue/
- ../too-many-jobs-in-queue/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/too-many-jobs-in-queue/
---

# Too many jobs in the queue
Expand All @@ -18,28 +19,32 @@ Possible reasons why the compactor may not be running are:
- Insufficient permissions.
- Compactor sitting idle because no block is hashing to it.
- Incorrect configuration settings.
## Diagnosing the issue

## Diagnose the issue

- Check metric `tempodb_compaction_bytes_written_total`
If this is greater than zero (0), it means the compactor is running and writing to the backend.
- Check metric `tempodb_compaction_errors_total`
If this metric is greater than zero (0), check the logs of the compactor for an error message.

## Solutions

- Verify that the Compactor has the LIST, GET, PUT, and DELETE permissions on the bucket objects.
- If these permissions are missing, assign them to the compactor container.
- For detailed information, check - https://grafana.com/docs/tempo/latest/configuration/s3/#permissions
- For detailed information, refer to the [Amazon S3 permissions](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/hosted-storage/s3/#permissions).
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
- If there’s a compactor sitting idle while others are running, port-forward to the compactor’s http endpoint. Then go to `/compactor/ring` and click **Forget** on the inactive compactor.
- Check the following configuration parameters to ensure that there are correct settings:
- `max_block_bytes` to determine when the ingester cuts blocks. A good number is anywhere from 100MB to 2GB depending on the workload.
- `max_compaction_objects` to determine the max number of objects in a compacted block. This should relatively high, generally in the millions.
- `retention_duration` for how long traces should be retained in the backend.
- Check the storage section of the config and increase `queue_depth`. Do bear in mind that a deeper queue could mean longer
- Check the storage section of the configuration and increase `queue_depth`. Do bear in mind that a deeper queue could mean longer
waiting times for query responses. Adjust `max_workers` accordingly, which configures the number of parallel workers
that query backend blocks.
```

```yaml
storage:
trace:
pool:
max_workers: 100 # worker pool determines the number of parallel requests to the object store backend
max_workers: 100 # worker pool determines the number of parallel requests to the object store backend
queue_depth: 10000
```
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@ title: Unable to find traces
description: Troubleshoot missing traces
weight: 473
aliases:
- ../operations/troubleshooting/missing-trace/
- ../operations/troubleshooting/unable-to-see-trace/
- ../../operations/troubleshooting/missing-trace/
- ../../operations/troubleshooting/unable-to-see-trace/
- ../unable-to-see-trace/ # htt/docs/tempo/<TEMPO_VERSION>/troubleshooting/unable-to-see-trace/
---

# Unable to find traces

The two main causes of missing traces are:

- Issues in ingestion of the data into Tempo. Spans are either not being sent correctly to Tempo or they are not getting sampled.
- Issues in ingestion of the data into Tempo. Spans are either not sent correctly to Tempo or they aren't getting sampled.
- Issues querying for traces that have been received by Tempo.

## Section 1: Diagnose and fix ingestion issues
Expand Down Expand Up @@ -106,8 +107,8 @@ If the pipeline isn't reporting any dropped spans, check whether application spa
- If you require a higher ingest volume, increase the configuration for the rate limiting by adjusting the `max_traces_per_user` property in the [configured override limits](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#standard-overrides).

{{< admonition type="note" >}}
Check the [ingestion limits page]({{< relref "../configuration#ingestion-limits" >}}) for further information on limits.
{{% /admonition %}}
Check the [ingestion limits page](https://grafana.com/docs/tempo/<TEMPO_VERSION>/configuration/#overrides) for further information on limits.
{{< /admonition >}}

## Section 3: Diagnose and fix issues with querying traces

Expand Down
12 changes: 12 additions & 0 deletions docs/sources/tempo/troubleshooting/send-traces/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
title: Issues with sending traces
menuTitle: Sending traces
description: Troubleshoot issues related to sending traces.
weight: 200
---

# Issues with sending traces

Learn about issues related to sending traces.

{{< section withDescriptions="true">}}
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ description: Gain visibility on how many traces are being pushed to Grafana Allo
weight: 472
aliases:
- ../operations/troubleshooting/agent/
- ./agent.md # /docs/tempo/<TEMPO_VERSION>/troubleshooting/agent.md
- ../agent.md # /docs/tempo/<TEMPO_VERSION>/troubleshooting/agent.md
knylander-grafana marked this conversation as resolved.
Show resolved Hide resolved
- ../alloy/ # https://grafana.com/docs/tempo/<TEMPO_VERSION>/troubleshooting/alloy/
---

# Troubleshoot Grafana Alloy
Expand All @@ -22,21 +23,21 @@ If your logs are showing no obvious errors, one of the following suggestions may
Alloy publishes a few Prometheus metrics that are useful to determine how much trace traffic it receives and successfully forwards.
These metrics are a good place to start when diagnosing tracing Alloy issues.

From the [`otelcol.receiver.otlp`](https://grafana.com/docs/alloy/<ALLOY_LATEST>/reference/components/otelcol/otelcol.receiver.otlp/) component:
From the [`otelcol.receiver.otlp`](https://grafana.com/docs/alloy/<ALLOY_VERSION>/reference/components/otelcol/otelcol.receiver.otlp/) component:
```
receiver_accepted_spans_ratio_total
receiver_refused_spans_ratio_total
```

From the [`otelcol.exporter.otlp`](https://grafana.com/docs/alloy/<ALLOY_LATEST>/reference/components/otelcol/otelcol.exporter.otlp/) component:
From the [`otelcol.exporter.otlp`](https://grafana.com/docs/alloy/<ALLOY_VERSION>/reference/components/otelcol/otelcol.exporter.otlp/) component:
```
exporter_sent_spans_ratio_total
exporter_send_failed_spans_ratio_total
```

Alloy has a Prometheus scrape endpoint, `/metrics`, that you can use to check metrics locally by opening a browser to `http://localhost:12345/metrics`.
The `/metrics` HTTP endpoint of the Alloy HTTP server exposes the Alloy component and controller metrics.
Refer to the [Monitor the Grafana Alloy component controller](https://grafana.com/docs/alloy/latest/troubleshoot/controller_metrics/) documentation for more information.
Refer to the [Monitor the Grafana Alloy component controller](https://grafana.com/docs/alloy/<ALLOY_VERSION>/troubleshoot/controller_metrics/) documentation for more information.

### Check metrics in Grafana Cloud

Expand Down
Loading
Loading