Fix ketama quorum #5910

fpetkovski · 2022-11-20T13:37:50Z

The quorum calculation is currently broken when using the Ketama
hashring. The reasons are explained in detail in issue #5784.

This commit fixes quorum calculation by tracking successfull writes
for each individual time-series inside a remote-write request.

The commit also removes the replicate() method inside the Handler
and moves the entire logic of fanning out and calculating success
into the fanoutForward() method.

Signed-off-by: Filip Petkovski [email protected]

Fixes #5784

matej-g

Thanks for this @fpetkovski, I think the approach in general looks good and somewhat converges with #5791 as you mentioned. Couple of thoughts:

I'm not sure if error determination works correctly here, it might happen that we will have a mix of different failure reasons for different replication batches (some may end up with 409, some with 500) - in such case I think we don't have other option but to tell client to retry (i.e. return server error).
Would be good to add more tests cases with different numbers of nodes / replication factor + E2E tests, perhaps taken over from Receiver: Fix quorum handling for all hashing algorithms #5791

pkg/receive/handler.go

The quorum calculation is currently broken when using the Ketama hashring. The reasons are explained in detail in issue thanos-io#5784. This commit fixes quorum calculation by tracking successfull writes for each individual time-series inside a remote-write request. The commit also removes the replicate() method inside the Handler and moves the entire logic of fanning out and calculating success into the fanoutForward() method. Signed-off-by: Filip Petkovski <[email protected]>

fpetkovski · 2022-11-24T11:12:06Z

Thanks everyone for the review. We had a sync with @matej-g and it seems like the only correct way to verify quorum is to track successful writes for each individual time-series. I've updated this PR to reflect that.

pkg/receive/handler.go

matej-g

As we spoke, overall this approach seems fine and more understandable than with replicating batches. One more part to figure out is the error handling / determination.

On the other hand, I'm uncertain about the performance implications, as we're changing the characteristics of how replication in receiver works. That's on both micro level (as we'll now track each series replication instead of batches) and macro level (we'll send fewer but bigger requests). It would be nice to run some of the benchmarks we have for handler as well as see this in action on a cluster with some real traffic or a synthetic load test.

matej-g · 2022-11-24T18:16:48Z

pkg/receive/handler.go

+					if seriesReplicated {
+						errs.Add(rerr.Err())
+					} else if uint64(len(rerr)) >= failureThreshold {
+						cause := determineWriteErrorCause(rerr.Err(), quorum)


I think we'll also have to change how we determine the HTTP error we return to client, when this cause error bubbles up back to handleRequest. Right now, we'll return error that occurs the most or the original multi error, since we use threshold 1. But this might be incorrect, as if a cause error for any individual series replications will be server error, we have to retry the whole request. I think solution would be:

Return server error, if any of the cause errors is an unknown error / unavailable / not ready (cases when we have to retry). Tricky but less important part here might be exactly which error to return if we have a mixed bag of server errors - the behavior of client should be same though regardless of the error message we decide to return.

Otherwise we should only have conflict errors and can return conflict

Another thing we should be mindful of here is when cause will return the original multi-error (and same actually above for the if branch), we are putting a multi-error inside of the errs multi-error, which can lead to erroneous 5xx as described in #5407 (comment) .

Yes you are correct. However I wonder if we already have this issue in main because we calculate the top-level cause the same way using threshold=1. So if we have 2 batches with conflict and 1 batch with server error, we will return conflict to the user and not retry the request.

In any case, I would also prefer to solve this problem now since it can lead to data loss.

One thing I am not sure about is what should the error code be when we try to replicate a series and we get one success, one server error and one client error. Right now I believe we return client-error, but if we change the rules, we would return a server error. It also means that in case of 2 conflicts (samples already exist in TSDB) and 1 server error, we would still return a server error even though that might not be necessary.

Maybe for replicating an individual series we can treat client-errors as success and only return 5xx when two replicas fail. For the overall response, we can return 5xx if any series has a 5xx.

Yes, I believe we basically have to treat conflict as if we have 'success'. It's just important to return the correct status to the upstream so if we have any conflicts in the replication, we'll want to return this to the client. Otherwise 5xx and OK should be clear (5xx if any series fails quorum; OK if no failed quorums or conflicts).

Makes sense. I think the MultiError and determineWriteErrorCause are not good abstractions for this. The determineWriteErrorCause function is overloaded and tries to determine the error for both cases.

Because of this, I added two error types writeErrors, and replicationErrors with their own own Cause() methods. The writeErrors cause prioritizes server errors, while the one from replicationErrors is mostly identical to determineWriteErrorCause and is used for determining the error of replicating a single series.

This way we always use the Cause method and depending on the error type we will bubble the appropriate error.

Signed-off-by: fpetkovski <[email protected]>

fpetkovski · 2022-11-28T06:15:00Z

Here are the benchmark results with the per-series error tracking:

name                                                                               old time/op    new time/op    delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                         576µs ± 2%     611µs ± 3%    +6.20%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8            649µs ± 3%    1057µs ± 2%   +62.75%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                       5.60ms ± 2%    5.95ms ± 1%    +6.10%  (p=0.000 n=9+8)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8          6.34ms ± 1%   10.57ms ± 1%   +66.61%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                      22.6ms ± 1%    23.6ms ± 1%    +4.44%  (p=0.000 n=8+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8         25.6ms ± 2%    41.3ms ± 2%   +61.48%  (p=0.000 n=10+9)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                 51.7ms ± 1%    51.7ms ± 1%      ~     (p=0.796 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8    52.2ms ± 1%    52.0ms ± 2%      ~     (p=0.436 n=9+9)

name                                                                               old alloc/op   new alloc/op   delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                        1.15MB ± 0%    1.20MB ± 0%    +3.57%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8           1.41MB ± 0%    1.79MB ± 0%   +27.16%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                       13.0MB ± 0%    13.6MB ± 0%    +4.36%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8          15.5MB ± 0%    19.5MB ± 1%   +25.67%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                      52.6MB ± 1%    56.9MB ± 1%    +8.29%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8         62.0MB ± 0%    78.6MB ± 1%   +26.80%  (p=0.000 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                  110MB ± 0%     110MB ± 0%      ~     (p=0.105 n=10+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8     110MB ± 0%     110MB ± 0%    +0.00%  (p=0.050 n=10+10)

name                                                                               old allocs/op  new allocs/op  delta
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/OK-8                         3.10k ± 0%     3.61k ± 0%   +16.63%  (p=0.000 n=9+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_500_of_them/conflict_errors-8            6.63k ± 0%    14.65k ± 0%  +120.83%  (p=0.000 n=10+10)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/OK-8                        30.3k ± 0%     35.4k ± 0%   +16.57%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_5000_of_them/conflict_errors-8           65.1k ± 0%    145.2k ± 0%  +122.84%  (p=0.000 n=10+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/OK-8                        121k ± 0%      141k ± 0%   +16.56%  (p=0.000 n=9+9)
HandlerReceiveHTTP/typical_labels_under_1KB,_20000_of_them/conflict_errors-8           260k ± 0%      580k ± 0%  +123.02%  (p=0.000 n=7+9)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/OK-8                   87.1 ± 1%     107.4 ± 1%   +23.29%  (p=0.000 n=9+10)
HandlerReceiveHTTP/extremely_large_label_value_10MB,_10_of_them/conflict_errors-8       219 ± 1%       384 ± 0%   +75.33%  (p=0.000 n=10+9)

There is a notable difference when we have actual errors, but this is likely expected because we have more errors to work with, and more objects to manage.

Signed-off-by: fpetkovski <[email protected]>

Signed-off-by: Filip Petkovski <[email protected]>

fpetkovski · 2022-11-28T13:43:31Z

Rolled this out in staging today to see if there are any differences in resource usage. We have a routers and this is what cpu and memory looks like before and after the rollout

The spike indicates when the rollout took place.

matej-g

Latest changes are looking good, differentiating between write and replication errors makes the error handling more digestible 👍 I have couple more nits and would be good if in general we could add few more comments here and there in the forward method to make the 'funneling' from writer errors -> replication errors -> final error a bit more obvious. I'm also wondering since now we have quite a lot of error handling code, if it would make sense to extract these types and methods into separate file (e.g. receive/errors.go).

We also load tested the changes with @philipgough on our test cluster but could not see any difference in performance. The microbenchmark runs also look acceptable for me. So performance-wise I'd expect this to be all good.

pkg/receive/handler.go

matej-g · 2022-11-28T16:52:17Z

pkg/receive/handler.go

+	}
+
+	expErrs := expectedErrors{
+		{err: errUnavailable, cause: isUnavailable},


Technically, we should not expect unavailable here, as that is expected on node level. I think we can only expect not ready (if TSDB appender is not ready) or conflict.

I think we can still have an unavailable error because write errors can come either from writing to a local TSDB, or from sending a request for replication to a different node. And that node can return unavailable in various different cases:

thanos/pkg/receive/handler.go

Lines 837 to 850 in afdb30e

switch determineWriteErrorCause(err, 1) {

case nil:

return &storepb.WriteResponse{}, nil

case errNotReady:

return nil, status.Error(codes.Unavailable, err.Error())

case errUnavailable:

return nil, status.Error(codes.Unavailable, err.Error())

case errConflict:

return nil, status.Error(codes.AlreadyExists, err.Error())

case errBadReplica:

return nil, status.Error(codes.InvalidArgument, err.Error())

default:

return nil, status.Error(codes.Internal, err.Error())

}

If the cause of a replicationErr is an unavailable error, then this error will bubble up to the write errors and we need to be able to detect it.

Got it, you're right, I see the flow now. I got confused because I associated write errors only in narrow sense (i.e. TSDB write errors) but we're also using them to capture remote write errors on line 626 that can originate in node's unavailability etc.

Signed-off-by: Filip Petkovski <[email protected]>

matej-g

This PR looks good to me now 👍, great job @fpetkovski.

One more theoretical concern I discussed with @fpetkovski was what kind of effect would the increased resource usage for error handling have on an 'unhappy' path kind of scenario (e.g. we have some nodes down in our hashring or clients keep sending us invalid data, resulting in an increased error rate in the system - since on microbenchmarks we see this could consume ~20% more memory, would this translate to an overall increase of memory usage in a receive replica? Could that lead to further destabilization of a hashring?). We could run some additional load test to try out this hypothesis (cc @philipgough).

With this in mind, I'm still happy to go forward and iterate on this solution if any performance issues pop up.

Still I'd also like more eyes on this, nominating @bwplotka @philipgough @douglascamata 😜

matej-g · 2022-11-29T09:08:50Z

pkg/receive/handler_test.go

@@ -51,196 +47,6 @@ import (
 	"github.com/thanos-io/thanos/pkg/testutil"
 )

-func TestDetermineWriteErrorCause(t *testing.T) {


I wonder if we could replace this with couple of test cases for replicationErrors and writeErrors cause?

douglascamata

Suggesting some changes to variable names to make understand this code slightly easier.

douglascamata · 2022-11-29T16:06:34Z

pkg/receive/handler.go

+				return err
+			}
+			key := endpointReplica{endpoint: endpoint, replica: rn}
+			er, ok := wreqs[key]


Could this variable named er receive a better name? I have no clue what an er is and it's easy to mistake it for err and even endpointReplica (often variables of this type have the name er, which is something else I think we have to slowly move away from).

Makes sense, I renamed this variable to writeTarget for clarity.

douglascamata · 2022-11-29T16:14:35Z

pkg/receive/handler.go

+		if er.endpoint == h.options.Endpoint {
+			go func(er endpointReplica) {


Similar comment here about the er variable name, which also applies to other occurrences: it gives no clue of what it is in this context and can be easily confused with err . Could we rename it? Some suggestions: replicationKey, replicaKey, replicationID, endpointReplica.

douglascamata · 2022-11-29T16:16:25Z

pkg/receive/handler.go

@@ -607,68 +644,41 @@ func (h *Handler) fanoutForward(pctx context.Context, tenant string, wreqs map[e
 		tLogger = log.With(h.logger, logTags)
 	}

-	ec := make(chan error)
+	ec := make(chan writeResponse)


Could ec receive a better name? It's used many times in the next hundreds of lines and the name isn't clear. Suggestions: errorChannel, if that's even what is actually is. 😅

Renamed to responses.

Signed-off-by: Filip Petkovski <[email protected]>

douglascamata

Thanks a lot for the work, @fpetkovski. 🙇

This is a LATM (looks amazing to me)! 🚀

bwplotka

Nice job, especially on tests. LGTM 👍🏽

Although I would really want to batch those requests at some point.

bwplotka · 2022-12-07T12:40:07Z

pkg/receive/handler.go

-// It will return cause of each contained error but will not traverse any deeper.
-func determineWriteErrorCause(err error, threshold int) error {
+// errorSet is a set of errors.
+type errorSet struct {


Long term, perhaps it would be better to just use merrors and some Dedup function?

Let's also consider some compatibility or usage of the official error (un)wrapping coming with Go 1.20: https://tip.golang.org/doc/go1.20#errors

That looks awesome 👍

* Fix quorum calculation for Ketama hashring The quorum calculation is currently broken when using the Ketama hashring. The reasons are explained in detail in issue thanos-io#5784. This commit fixes quorum calculation by tracking successfull writes for each individual time-series inside a remote-write request. The commit also removes the replicate() method inside the Handler and moves the entire logic of fanning out and calculating success into the fanoutForward() method. Signed-off-by: Filip Petkovski <[email protected]> * Fix error propagation Signed-off-by: fpetkovski <[email protected]> * Fix writer errors Signed-off-by: fpetkovski <[email protected]> * Separate write from replication errors Signed-off-by: fpetkovski <[email protected]> * Add back replication metric Signed-off-by: Filip Petkovski <[email protected]> * Address PR comments Signed-off-by: Filip Petkovski <[email protected]> * Address code review comments Signed-off-by: Filip Petkovski <[email protected]> Signed-off-by: Filip Petkovski <[email protected]> Signed-off-by: fpetkovski <[email protected]>

pull-request-size bot added the size/L label Nov 20, 2022

fpetkovski mentioned this pull request Nov 20, 2022

Receiver: Fix quorum handling for all hashing algorithms #5791

Closed

2 tasks

fpetkovski force-pushed the ketama-quorum branch 6 times, most recently from a37d49f to c120cfb Compare November 21, 2022 07:29

matej-g reviewed Nov 21, 2022

View reviewed changes