Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid False Positives 409s in Thanos Receive Replication #1615

Closed
squat opened this issue Oct 9, 2019 · 0 comments · Fixed by #1616
Closed

Avoid False Positives 409s in Thanos Receive Replication #1615

squat opened this issue Oct 9, 2019 · 0 comments · Fixed by #1616

Comments

@squat
Copy link
Member

squat commented Oct 9, 2019

There is a little bit of nuance in the replication error logic that we can improve in order to avoid false positives when discovering 409s. Namely, when the replicate func checks for errors, it should only return the 409 if the number of 409s in the multi-error was greater than or equal to the minimum replication threshold. Otherwise, the replication should be retried.

In order words, let's say we have replication factor 3. This means the minimum replication threshold is 2.

  • if there was only 1 error, then no matter what the error was, the replication will succeed and return a nil error;
  • if there were two or more errors:
    • if just one of the errors was a 409, then we shouldn't return a 409 because this didn't really cause the failure, the 500 from the other error is responsible; the error should either be fixed in code, or prometheus should retry sending the sample;
    • if two or more errors were 409s, then the sample is definitely conflicting and should not be retried. There is no internal server error that could be fixed and so retrying will never work

xref: #1563 #1586

squat added a commit to squat/thanos that referenced this issue Oct 9, 2019
This commit ensures that the `replicate` func does not falsely identify
a replication error as a conflict. If the number of conflict errors
during replication is less than the minimum replication threshold, then
replication cannot be said to have failed due to a conflict and the
request should be retried.

Fixes: thanos-io#1615

xref: thanos-io#1563
Signed-off-by: Lucas Servén Marín <[email protected]>
squat added a commit to squat/thanos that referenced this issue Oct 9, 2019
This commit ensures that the `replicate` func does not falsely identify
a replication error as a conflict. If the number of conflict errors
during replication is less than the minimum replication threshold, then
replication cannot be said to have failed due to a conflict and the
request should be retried.

Fixes: thanos-io#1615

xref: thanos-io#1563
Signed-off-by: Lucas Servén Marín <[email protected]>
brancz pushed a commit that referenced this issue Oct 9, 2019
This commit ensures that the `replicate` func does not falsely identify
a replication error as a conflict. If the number of conflict errors
during replication is less than the minimum replication threshold, then
replication cannot be said to have failed due to a conflict and the
request should be retried.

Fixes: #1615

xref: #1563
Signed-off-by: Lucas Servén Marín <[email protected]>
GiedriusS pushed a commit that referenced this issue Oct 28, 2019
This commit ensures that the `replicate` func does not falsely identify
a replication error as a conflict. If the number of conflict errors
during replication is less than the minimum replication threshold, then
replication cannot be said to have failed due to a conflict and the
request should be retried.

Fixes: #1615

xref: #1563
Signed-off-by: Lucas Servén Marín <[email protected]>
Signed-off-by: Giedrius Statkevičius <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant