You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The trace was initially created at 15:04 UTC and the first failed query occurred ~3.5 hours later. Based on our internal operations the trace was definitely in the backend at the time. Perhaps related to a polling sync issue?
We now have a working theory for why this occurs. The race condition discovered between Queriers and Compactors has been introduced by the addition of query sharding through the query frontend.
Each of the queriers can have a different view of the state of the blocklist due to randomness in polling times.
The Query Frontend shards queriers and assigns a block range to each querier.
Compactors move traces around blocks that can cross these block ranges.
There are windows in which
some blocks are compacted and the instance of the querier responsible for these has an updated view of the blocklist
the instance of the querier responsible for the block range of the compacted block has a stale view of the blocklist
During these windows, a query for traces in the compacted block will return 404.
Describe the bug
We have very rarely (<.1% of queries) seen a trace 404 that actually existed in the backend. Some example logs:
The trace was initially created at 15:04 UTC and the first failed query occurred ~3.5 hours later. Based on our internal operations the trace was definitely in the backend at the time. Perhaps related to a polling sync issue?
@annanay25
Expected behavior
Tempo to always return a trace correctly that is in the backend (or 500).
The text was updated successfully, but these errors were encountered: