Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pusher failed to consume trace data due to "context canceled" #3957

Closed
dastrobu opened this issue Aug 13, 2024 · 9 comments · Fixed by #4341
Closed

pusher failed to consume trace data due to "context canceled" #3957

dastrobu opened this issue Aug 13, 2024 · 9 comments · Fixed by #4341

Comments

@dastrobu
Copy link
Contributor

Is your feature request related to a problem? Please describe.

We encountered the log message msg="pusher failed to consume trace data" err="context canceled" and had significant difficulty identifying the cause. The root cause was not related to the reasons documented in the troubleshooting guide.

The metric tempo_discarded_spans_total indicated that spans were being discarded, but the logs did not clarify the reason.

We ended up patching Grafana Tempo and even the Go SDK to gather more information about the discarded spans. I would like to propose improvements to observability in this area.

Here is what we did:

  1. Additional Logging in Distributor

    We added extra logging to distributor.go to identify which spans were causing issues:

    if err != nil {
        logSpans(batches, &LogReceivedSpansConfig{ // TODO should be LogDiscardedSpansConfig
            Enabled:              true,
            IncludeAllAttributes: true,
            FilterByStatusError:  false,
        }, d.logger)
        return nil, err
    }

    I propose adding a LogDiscardedSpansConfig in addition to LogReceivedSpansConfig to log discarded spans that are causing issues. I can open a PR for this if you think it is a good idea.

  2. Context Cancellation Debugging

    It was extremely challenging to determine why the Context was canceled. This is a weak spot in Go, in my opinion. We had to patch the Go SDK to get more information about context cancellation.

    We added stack traces to context.go:

    func (c *cancelCtx) cancel(removeFromParent bool, err, cause error) {
    	if err == nil {
    		panic("context: internal error: missing cancel error")
    	}
    	if err == Canceled {
    		stack := debug.Stack()
    		s := string(stack)
    		err = fmt.Errorf("%w stack: %v", err, s)
    	}

    This led to the following logs:

    level=error ts=2024-06-20T06:05:40.085835663Z caller=rate_limited_logger.go:27 msg="pusher failed to consume trace data" err="context canceled stack:
    goroutine 869 [running]:
    runtime/debug.Stack()
        /opt/hostedtoolcache/go/1.22.2/x64/src/runtime/debug/stack.go:24 +0x6b
    context.(*cancelCtx).cancel(0xc047f05a90, 0x1, {0x42ee3a0, 0xc000130c20}, {0x0, 0x0})
        /opt/hostedtoolcache/go/1.22.2/x64/src/context/context.go:546 +0x8a
    context.WithCancel.func1()
        /opt/hostedtoolcache/go/1.22.2/x64/src/context/context.go:240 +0x34
    google.golang.org/grpc/internal/transport.(*http2Server).closeStream(0xc0438ad380, 0xc047e65b00, 0x0, 0x0, 0x0)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:1313 +0x39
    google.golang.org/grpc/internal/transport.(*http2Server).handleRSTStream(0xc0438ad380, 0xc047da5d60)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:833 +0x94
    google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc0438ad380, {0x43139d0, 0xc04410f830}, 0xc04410f8c0)
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/internal/transport/http2_server.go:685 +0xaa9
    google.golang.org/grpc.(*Server).serveStreams(0xc000670a00, {0x43139d0, 0xc04410f830}, {0x432e560, 0xc0438ad380}, {0x432a618, 0xc0410f96d0})
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:1013 +0x4f3
    google.golang.org/grpc.(*Server).handleRawConn.func1()
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:949 +0x92
    created by google.golang.org/grpc.(*Server).handleRawConn in goroutine 866
        /home/runner/actions-runner/_work/grafana-tempo/grafana-tempo/vendor/google.golang.org/grpc/server.go:948 +0x389

    As you can see, the context was closed due to an RST frame sent by the client. See handleRSTStream and closeStream.

    Therefore, Grafana Tempo is behaving correctly, but it was extremely difficult to determine why the context was canceled.

  3. Confusing Metric Labels

    The discarded spans are added to the tempo_discarded_spans_total metric with the reason reasonInternalError: distributor.go#L440, which was also confusing.

Describe the solution you'd like

I propose two improvements:

  1. Add a LogDiscardedSpansConfig to the distributor to log discarded spans causing issues.
    distributor:
      config:
        log_discarded_spans:
          enabled: true
        log_received_spans:
          # ...
    This should be easily doable.
  2. Add a new reason label to tempo_discarded_spans_total to indicate when the client's fault causes traces to be discarded. I am unsure how to implement this, though.

Describe alternatives you've considered

Switching to Jaeger 😉.

Additional context

Grafana was deployed using the tempo-distributed Helm chart.

@joe-elliott
Copy link
Member

Thank you for the very detailed issue. Please don't switch to Jaeger ;).

I propose adding a LogDiscardedSpansConfig in addition to LogReceivedSpansConfig to log discarded spans that are causing issues. I can open a PR for this if you think it is a good idea.

100%. I think this is a great idea and would love a PR.

It was extremely challenging to determine why the Context was canceled. This is a weak spot in Go, in my opinion.

Yes. I have mostly recently felt this pain the frontend/read path.

Therefore, Grafana Tempo is behaving correctly, but it was extremely difficult to determine why the context was canceled.

Thank you for digging so deep. This is impressive research and a great contribution to the project.

Add a new reason label to tempo_discarded_spans_total to indicate when the client's fault causes traces to be discarded. I am unsure how to implement this, though.

I don't have a good answer for this either. Perhaps we can log the cancel cause (added in go 1.20):
https://pkg.go.dev/context#WithCancelCause

I don't think much uses this yet but we could slowly add causes in our code. Also presumably go will also add causes to the framework. Perhaps this will grow in value over time?

dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 14, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 14, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 14, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 14, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 15, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
@dastrobu
Copy link
Contributor Author

dastrobu commented Aug 19, 2024

@joe-elliott, thanks for your feedback.

100%. I think this is a great idea and would love a PR.

See #3964 – it took a few days to get it right (hopefully).

I don't have a good answer for this either. Perhaps we can log the cancel cause.

True, but for the particular case I encountered, we would need to convince the google.golang.org/grpc team to add this. Nevertheless, I'd second adding a cause to all Tempo internal context cancellations in some cases.

dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 20, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
@dastrobu
Copy link
Contributor Author

The grpc team is considering to add the cancel cause, see grpc/grpc-go#7541 (comment).

dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 21, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 21, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 21, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 23, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
dastrobu added a commit to dastrobu/tempo that referenced this issue Aug 23, 2024
…ns causing issues.

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to grafana#3957.

Signed-off-by: Daniel Strobusch <[email protected]>
@dastrobu
Copy link
Contributor Author

Found another issue, which would need to be fixed to get the cancel cause reported. batch.go would need bo the changed:

	case <-ctx.Done():
-		return ctx.Err()
+		return context.Cause(ctx)
	}

When I find time, I might open an issue/PR to https://github.com/grafana/dskit.

joe-elliott pushed a commit that referenced this issue Aug 23, 2024
…ns causing issues. (#3964)

```yaml
distributor:
  config:
    log_discarded_spans:
      enabled: true
    log_received_spans:
      # ...

```

Relates to #3957.

Signed-off-by: Daniel Strobusch <[email protected]>
@dastrobu
Copy link
Contributor Author

@joe-elliott I created grafana/dskit#576 and grafana/dskit#577 in dskit. But I am not sure if there is a process to encourage someone to look at it.

Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Oct 29, 2024
@dastrobu
Copy link
Contributor Author

remove stale.

Open tasks:

@dastrobu
Copy link
Contributor Author

@joe-elliott

It looks like the gRPC team is not convinced by the use of context cancel causes. If you or someone from the Tempo team wishes, you might want to step into the discussion at grpc/grpc-go#7541. I personally will probably not be able to invest much more time, as I see the proposal as not very likely to be accepted.

I personally think that adding cancel causes everywhere would be a nice non-breaking improvement for gRPC, Tempo, and other packages. I would love to see cancel causes in the Tempo codebase as well in the future.

It was a pleasure to contribute to Grafana Tempo on this topic, and I value the open and constructive culture of the Tempo team very much.

@joe-elliott
Copy link
Member

Thx for the kind words @dastrobu. We appreciate your contributions to Tempo and the wider go and OSS community. Reviewing the linked thread, you've made some great arguments and I doubt I have anything to add. Agree 100% that cancel cause is a great addition to go and we should use it as much as possible.

Thank you for all your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants