Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo occasionally 404s on existing traces #565

Closed
joe-elliott opened this issue Mar 2, 2021 · 1 comment · Fixed by #583
Closed

Tempo occasionally 404s on existing traces #565

joe-elliott opened this issue Mar 2, 2021 · 1 comment · Fixed by #583
Assignees
Labels
type/bug Something isn't working

Comments

@joe-elliott
Copy link
Member

Describe the bug
We have very rarely (<.1% of queries) seen a trace 404 that actually existed in the backend. Some example logs:

level=info ts=2021-03-02T18:35:10.996468172Z caller=frontend.go:63 method=GET traceID=14369e2468612583 url=/api/traces/7ec7cc871f20688b duration=1.708811458s status=404
level=info ts=2021-03-02T20:01:51.099858089Z caller=frontend.go:63 method=GET traceID=26c59851cbe88622 url=/api/traces/00000000000000007ec7cc871f20688b duration=2.547382198s status=200
level=info ts=2021-03-02T20:03:09.096726326Z caller=frontend.go:63 method=GET traceID=2f9c2851e68eff69 url=/api/traces/00000000000000007ec7cc871f20688b duration=4.932888762s status=200

The trace was initially created at 15:04 UTC and the first failed query occurred ~3.5 hours later. Based on our internal operations the trace was definitely in the backend at the time. Perhaps related to a polling sync issue?

@annanay25

Expected behavior
Tempo to always return a trace correctly that is in the backend (or 500).

@joe-elliott joe-elliott added this to the kv store complete milestone Mar 2, 2021
@annanay25 annanay25 added the type/bug Something isn't working label Mar 3, 2021
@annanay25
Copy link
Contributor

annanay25 commented Mar 3, 2021

We now have a working theory for why this occurs. The race condition discovered between Queriers and Compactors has been introduced by the addition of query sharding through the query frontend.

  • Each of the queriers can have a different view of the state of the blocklist due to randomness in polling times.
  • The Query Frontend shards queriers and assigns a block range to each querier.
  • Compactors move traces around blocks that can cross these block ranges.
  • There are windows in which
    • some blocks are compacted and the instance of the querier responsible for these has an updated view of the blocklist
    • the instance of the querier responsible for the block range of the compacted block has a stale view of the blocklist

During these windows, a query for traces in the compacted block will return 404.

We are actively working towards a fix for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants