Tempo occasionally 404s on existing traces #565

joe-elliott · 2021-03-02T21:38:45Z

Describe the bug
We have very rarely (<.1% of queries) seen a trace 404 that actually existed in the backend. Some example logs:

level=info ts=2021-03-02T18:35:10.996468172Z caller=frontend.go:63 method=GET traceID=14369e2468612583 url=/api/traces/7ec7cc871f20688b duration=1.708811458s status=404
level=info ts=2021-03-02T20:01:51.099858089Z caller=frontend.go:63 method=GET traceID=26c59851cbe88622 url=/api/traces/00000000000000007ec7cc871f20688b duration=2.547382198s status=200
level=info ts=2021-03-02T20:03:09.096726326Z caller=frontend.go:63 method=GET traceID=2f9c2851e68eff69 url=/api/traces/00000000000000007ec7cc871f20688b duration=4.932888762s status=200

The trace was initially created at 15:04 UTC and the first failed query occurred ~3.5 hours later. Based on our internal operations the trace was definitely in the backend at the time. Perhaps related to a polling sync issue?

@annanay25

Expected behavior
Tempo to always return a trace correctly that is in the backend (or 500).

The text was updated successfully, but these errors were encountered:

annanay25 · 2021-03-03T16:42:27Z

We now have a working theory for why this occurs. The race condition discovered between Queriers and Compactors has been introduced by the addition of query sharding through the query frontend.

Each of the queriers can have a different view of the state of the blocklist due to randomness in polling times.
The Query Frontend shards queriers and assigns a block range to each querier.
Compactors move traces around blocks that can cross these block ranges.
There are windows in which
- some blocks are compacted and the instance of the querier responsible for these has an updated view of the blocklist
- the instance of the querier responsible for the block range of the compacted block has a stale view of the blocklist

During these windows, a query for traces in the compacted block will return 404.

We are actively working towards a fix for this!

joe-elliott assigned annanay25 Mar 2, 2021

joe-elliott added this to the kv store complete milestone Mar 2, 2021

annanay25 added the type/bug Something isn't working label Mar 3, 2021

annanay25 mentioned this issue Mar 10, 2021

Query compacted blocks that fall within 2 * BlocklistPoll #583

Merged

3 tasks

annanay25 closed this as completed in #583 Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tempo occasionally 404s on existing traces #565

Tempo occasionally 404s on existing traces #565

joe-elliott commented Mar 2, 2021

annanay25 commented Mar 3, 2021 •

edited

Loading

Tempo occasionally 404s on existing traces #565

Tempo occasionally 404s on existing traces #565

Comments

joe-elliott commented Mar 2, 2021

annanay25 commented Mar 3, 2021 • edited Loading

annanay25 commented Mar 3, 2021 •

edited

Loading