Handle Partial Success in ConsumeTraces #2571

ie-pham · 2023-06-19T23:01:36Z

What this PR does:

Right now if a single trace in the batch triggers an error, Tempo drops the entire batch. This PR allows every trace in the batch to get a turn at getting pushed before returning a response to the client.
From internal discussion, we are going to match what is required here for partial success and return a 200 instead of an error.

The tricky part was tallying up the discarded span count.

If replication factor is 3, then the minimum success is 2 (rep factor / 2 +1) and the max allowable error is 1. So initially I thought a simple logic should account for this. Tally up all the responses and if a trace is marked as "error" twice then we can consider that trace discarded. However, this approach is only doable if we were working with cases where the number of responses = replication factor. That is not the case. We think the ring.DoBatch function returns as soon as 2 errors or 2 successes are recorded without waiting for the third response (a partial success is still considered a success). In a perfect world where two those responses are exact the same, we would not have a problem. However, if we have a scenario where the lists of discarded traces are different in the two responses, it is hard to accurately determine whether a trace was discarded or not without the third response.

Decision:
Since the discarded span count currently does not play an important role in the functionality of Tempo, I am going with the "over counting" solution.

Since the number of responses is always the minimum success or more

	// max number of error allowed is {number of responses} - {minimum success required}

So if the replication factor is 3 and we get only 2 responses (and the minimum success required is 2), then the max number of error allowed is 0. If a trace is marked "error" even once, it will be considered discarded. However, for example, for replication of 5 with a minimum success of 3, if we only get 4 responses then with the same rules, the max error allowed is 1 not 0.

This solution has drawbacks because we may overcount the number of discarded spans. For example if we have one partial success and one total success, without the third response, there is a chance that whatever trace failed in the partial success actually was successfully processed in the third response but we will mark it as discarded.

Another note is I am adding a mutex lock to collect the responses from the ingester. Will need to test this to make sure it does not impact performance too much.

Which issue(s) this PR fixes:
Fixes #1957

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

ie-pham · 2023-06-30T03:31:07Z

So the logic works but I'm just not sure if it's too convoluted to record the correct discarded span count by reason

joe-elliott

I question the efforts to perfectly record dropped spans for each reason. I'm wondering if a better approach would be to change PushResponse to return trace indexes and rejection reasons back to the distributor. The distributor could then update the appropriate metrics based on the response.

integration/util.go

integration/e2e/limits_test.go

modules/distributor/distributor.go

pkg/tempopb/tempo.proto

modules/ingester/instance.go

joe-elliott

I have reviewed mainly the distributor part and I think we need to make some changes so I'm not quite going to dig deep on the ingester part yet.

Overall, this is really heading the right direction. A lot of comments and questions b/c this is a very performance sensitive part of the code and a very nuanced change.

Also, I think this invalidates: #1958 since we are just going to start returning 200 on dropped spans. Can you confirm?

CHANGELOG.md

integration/e2e/limits_test.go

modules/distributor/distributor.go

modules/ingester/ingester.go

pkg/tempopb/tempo.proto

CHANGELOG.md

modules/distributor/distributor.go

modules/ingester/ingester.go

modules/ingester/instance.go

joe-elliott

Nice addition!

adugar-conga · 2024-02-07T15:58:54Z

@ie-pham @joe-elliott In which release version of Grafana Tempo will this be addressed ?

adugar-conga · 2024-02-12T18:11:36Z

@joe-elliott @ie-pham In which release version of Grafana Tempo will this be addressed ?

joe-elliott · 2024-02-12T20:03:30Z

This feature is merged into main and will be in the next release of Tempo. We attempt a 3 month cadence on OSS releases but make no guarantees.

Releases are listed here: https://github.com/grafana/tempo/releases/

ie-pham and others added 5 commits June 8, 2023 09:31

testing

728f91d

Merge branch 'grafana:main' into main

fea278d

undo logs

e75219c

Merge branch 'grafana:main' into main

b90e527

Merge branch 'grafana:main' into main

8f78ff2

ie-pham marked this pull request as ready for review June 30, 2023 14:30

ie-pham requested review from joe-elliott, annanay25, mdisibio, mapno, yvrhdn, zalegrala, electron0zero and stoewer as code owners June 30, 2023 14:30

joe-elliott reviewed Jun 30, 2023

View reviewed changes

Merge branch 'grafana:main' into main

2aabff9

ie-pham requested a review from joe-elliott July 6, 2023 13:59

ie-pham added 2 commits July 13, 2023 15:49

Merge branch 'grafana:main' into main

90c9912

Merge branch 'grafana:main' into main

d6ee86a

ie-pham changed the title ~~Fix1957~~ Handle Partial Success in ConsumeTraces Jul 24, 2023

joe-elliott reviewed Jul 26, 2023

View reviewed changes

pkg/tempopb/tempo.proto Outdated Show resolved Hide resolved

ie-pham added 2 commits July 26, 2023 14:25

Merge branch 'grafana:main' into main

95ca0d1

Merge branch 'grafana:main' into main

974920f

ie-pham requested a review from joe-elliott July 27, 2023 21:29

Merge branch 'grafana:main' into main

8153c92

joe-elliott reviewed Jul 31, 2023

View reviewed changes

Merge branch 'grafana:main' into main

78e1b9c

ie-pham requested a review from joe-elliott August 2, 2023 02:23

ie-pham added 20 commits January 3, 2024 14:24

fix

6ef65d6

test log commit

8a38ac5

refactor

7795d04

no list

d795d85

lint

77bf98a

changed proto

5fd59b5

handle old proto

fafc6c2

rebase

082f608

using two lists instead of nested list

5822956

clean up tests

16c9194

make ingester more efficent

90ae1b2

lint

f48dcb4

lint

529b466

lint

15a3792

remove pkg logger

0a84958

moar lint

6a94734

refactor response handling code

7b37e5e

refactor

03228dc

refactored instance

6985e74

add unknown error as reason

a879efc

ie-pham requested a review from joe-elliott January 4, 2024 13:54

joe-elliott approved these changes Jan 4, 2024

View reviewed changes

ie-pham added 2 commits January 4, 2024 14:25

lint

c0b9f4e

more lint

a0c2a62

ie-pham merged commit 306fdd7 into grafana:main Jan 4, 2024
13 checks passed

ie-pham mentioned this pull request Jan 15, 2024

Op dashboard #3296

Closed

3 tasks

adugar-conga mentioned this pull request Feb 12, 2024

When will this PR be part of a tempo release #3384

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Partial Success in ConsumeTraces #2571

Handle Partial Success in ConsumeTraces #2571

ie-pham commented Jun 19, 2023 •

edited

Loading

ie-pham commented Jun 30, 2023

joe-elliott left a comment

joe-elliott left a comment

joe-elliott left a comment

adugar-conga commented Feb 7, 2024 •

edited

Loading

adugar-conga commented Feb 12, 2024

joe-elliott commented Feb 12, 2024

Handle Partial Success in ConsumeTraces #2571

Handle Partial Success in ConsumeTraces #2571

Conversation

ie-pham commented Jun 19, 2023 • edited Loading

ie-pham commented Jun 30, 2023

joe-elliott left a comment

Choose a reason for hiding this comment

joe-elliott left a comment

Choose a reason for hiding this comment

joe-elliott left a comment

Choose a reason for hiding this comment

adugar-conga commented Feb 7, 2024 • edited Loading

adugar-conga commented Feb 12, 2024

joe-elliott commented Feb 12, 2024

ie-pham commented Jun 19, 2023 •

edited

Loading

adugar-conga commented Feb 7, 2024 •

edited

Loading