-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gateway: use keyspace events when buffering requests #8700
Conversation
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
OK, I think this PR is ready for review merge. I spent all of today implementing tests for the new functionality and I think I was successful. I've split the existing The new tests ensure that queries are buffered and do not error out when a Vitess cluster is reparented or resharded. They both use the new Keyspace Events Watcher API for event detection. The reparenting tests are green when using the KEW and the old HealthCheck API, but I decided to run them only with KEW because we're deprecating the old API and these are expensive tests. The resharding tests are green with the KEW and faill with the HealthCheck API, which is, huh, kinda the whole point of this project. One last thing I noticed while stress testing the resharding operation is another failure case that we never detected and never discussed before: in a high-traffic Vitess cluster with high QPS, it's possible that a query arrives to the VTGate during a resharding operation such as that the query is planned for shard This is easy to reproduce with high enough QPS, and in practice results in a very short burst of instantly failed queries from the VTGate in the exact same moment than the reparenting starts. Fortunately, the new Keyspace Event Watcher is smart enough to notice this new corner case, so I've updated |
// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded | ||
if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) { | ||
err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded") | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the special handling I was talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to make sure I'm following correctly this change: Now, in addition to previous existent logic, it will use these events to retry.
One follow up question for these retries: other errors in this block, were not that time sensitive. For instance, if a connection to a tablet failed, in the next iteration of the block another tablet would be used for the retry. In this case, there is some amount of time that needs to elapsed before new primaries will become available. Would it make sense to start thinking of exponential backoff between retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need for exponential backoff here. In this specific case, we're setting an explicit buffering error, so that the next iteration of the loop will block directly on the buffering code and will not resume until the primary becomes available. This is not a busy loop, so we don't need to backoff. 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that the "buffer" request here is not really put a "request" into a queue, we are queueing the entry and the actaul request goroutine just gets block until the shard split/failover finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct. What is stored in memory is a watcher-struct which is shared between all the requests to the same target (i.e. shard + keyspace), and the individual requests for the target become individually blocked on their corresponding goroutines until the buffering has finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. Very clean as usual.
I had a couple of nits, otherwise LGTM.
Cluster test |
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Fixed all the tests 🍏, I'll merge at the end of the day in case somebody else wants to review. |
Use the new flag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arriving late to this party. This looks good to me as well. I left couple comments that are more for my own curiosity and understanding of the change.
@@ -231,10 +258,22 @@ func (gw *TabletGateway) withRetry(ctx context.Context, target *querypb.Target, | |||
defer retryDone() | |||
bufferedOnce = true | |||
} | |||
|
|||
if bufferErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide some context on this change in the order of the logic here? We are checking retryDone, before checking the error.
I'm not familiarized with the details of the logic here, but this chance caught my attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small corner case fix: there are now cases when the buffering code returns both an error and a cancelation function. It's important to defer
the cancelation function whenever it's returned, even if we also have an error and must exit the function right away -- not defering the function would cause a (tiny) memory leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool. Makes sense as well.
// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded | ||
if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) { | ||
err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded") | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to make sure I'm following correctly this change: Now, in addition to previous existent logic, it will use these events to retry.
One follow up question for these retries: other errors in this block, were not that time sensitive. For instance, if a connection to a tablet failed, in the next iteration of the block another tablet would be used for the retry. In this case, there is some amount of time that needs to elapsed before new primaries will become available. Would it make sense to start thinking of exponential backoff between retries?
func TestBufferResharding(t *testing.T) { | ||
t.Run("slow queries", func(t *testing.T) { | ||
bt := &buffer.BufferingTest{ | ||
Assert: assertResharding, | ||
Failover: reshard02, | ||
SlowQueries: true, | ||
VSchema: vschema, | ||
} | ||
bt.Test(t) | ||
}) | ||
|
||
t.Run("fast queries", func(t *testing.T) { | ||
bt := &buffer.BufferingTest{ | ||
Assert: assertResharding, | ||
Failover: reshard02, | ||
SlowQueries: false, | ||
VSchema: vschema, | ||
} | ||
bt.Test(t) | ||
}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like we have reparenting tests with reserved connections, we should also check resharding with reserved connections.
for _, shard := range ksevent.Shards { | ||
sb := b.getOrCreateBuffer(shard.Target.Keyspace, shard.Target.Shard) | ||
if sb != nil { | ||
sb.recordKeyspaceEvent(shard.Tablet, shard.Serving) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, how would this work when vtgate gets a healthcheck for primary being down? My mental model is if we detect the primary is not serving via health check, we should start buffering the request, but looks likerecordKeyspaceEvent
will always call stopBufferingLocked so nothing will be buffered?
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually don't trigger any buffering operations when we receive the healcheck fail for a primary! That's because these healthchecks are processed by the topology engine, which often lags behind the actual availability issue. Instead, what we do it start buffering once the vtgate
itself fails to reach the primary (i.e. when we get an error return during a request), and the Keyspace Events handler is designed to detect when the availability incident is over cluster-wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vmg that makes sense.
start buffering once the vtgate itself fails to reach the primary (i.e. when we get an error return during a request)
it looks like the retry is triggered on these 3 kinds of error code: https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/queryservice/wrapped.go#L75
My current understanding is vttablet maps to those error code here: Code_FAILED_PRECONDITION here, Code_UNAVAILABLE here and Code_CLUSTER_EVENT here - they seem not related to a primary failure. The other place where we return Code_CLUSTER_EVENT
is when the primary tablet has !serving servingState, which only happen after a reparent event.
I think I'm missing some details here on the failover behavior, could you shed some lights on it? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling on the tablet was wired up by @harshit-gangal, so I don't know all the details, but the actual buffering code is triggered with this check, so the only relevant error code when it comes to buffering is CLUSTER_EVENT
:
vitess/go/vt/vtgate/buffer/buffer.go
Lines 72 to 75 in ca5a27e
func CausedByFailover(err error) bool { | |
log.V(2).Infof("Checking error (type: %T) if it is caused by a failover. err: %v", err, err) | |
return vterrors.Code(err) == vtrpcpb.Code_CLUSTER_EVENT | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...In retrospect, this function could use a different name since CLUSTER_EVENT
can also be caused by a resharding operation. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, looking at the check here, looks like the buffering logic only covers after a reparent (but before vtgate get notified via health check), a vttablet would return Code_CLUSTER_EVENT. Since replHealthy
is always true for primary tablet and serving state only changes via a reparent event. Does that sound right to you?
} | ||
|
||
for shard, sstate := range kss.shards { | ||
if sstate.serving && !activeShardsInPartition[shard] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cosmetic nit: this seems won't happen because of the early return at L209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not accurate: the first check in line 209 is iterating through all the shards that the topology service knows and making sure we already know about them and that we know them to be healthy. This second loop is iterating through all the shards that we know about, to make sure there are no healthy shards that we know about but the topology service doesn't.
activeShardsInPartition := make(map[string]bool) | ||
for _, shard := range primary.ShardReferences { | ||
sstate := kss.shards[shard.Name] | ||
if sstate == nil || !sstate.serving { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand it correctly: this is only for shard split cases right? i.e., when we mark source shard as "not_serving", it will be reflected srv keyspace and we only care the "end" of the shard split - therefore we can early return at L209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've documented this behavior in #8890 -- it explains all the different consistency checks.
// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded | ||
if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) { | ||
err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded") | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that the "buffer" request here is not really put a "request" into a queue, we are queueing the entry and the actaul request goroutine just gets block until the shard split/failover finished?
// If result is nil it must mean the channel has been closed. Stop goroutine in that case | ||
bufferCancel() | ||
gw.setupBuffering(ctx) | ||
gw.QueryService = queryservice.Wrap(nil, gw.withRetry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we rely on srv keyspace to detect shard split events, how would a failover being handled here? Say we have a failover in a shard from A to B:
t1: A became problematic / not responding
t2: Orchestrator detects the problem and do an external reparent from A to B
t3: healthcheck detect the reparent event and set primary of the shard to B
After t3, requests should be handled properly. I'm thinking before t3, what is the mechanism to handle / buffer requests (if it exists)? I'm asking because if we can buffer as much requests as possible between t1 and t3, in theory we should have higher availability.
I see tablet has the logic to return Code_CLUSTER_EVENT when it is not serving.
Since replHealthy
should always be true for primary (or is it?), in order to trigger buffer in vtgate, the orchestrator needs to modify sm.state
on the old primary (A) at t2 so that vtgate can buffer requests from t2. Is my understanding correct or sm.state
is set somehow magically by vttablet already when the primary is unhealthy?
Description
Fixes #8462 (lol, hopefully).
Fixes #7059
Fixes #7061
Alright, this is the first draft of a solution for the dreaded request buffering issue. After a lot of investigation, we've arrived at this approach to fix the issue.
Summarizing:
HealthCheck
-based approach to find out when a failover is finished with a brand newKeyspaceEventWatcher
.KeyspaceEventWatcher
is a new implementation that augmentsHealthCheck
events with metadata from the topology server for keyspace changes. This allows it to process any primary promotion events wholistically for the whole keyspace, so when these events are part of a resharding operation, they are only reported once the whole keyspace has been properly sharded and all the new shards are healthy and serving.As far as we can tell (and as far as I can test manually), this implementation can resume buffering after any kind of resharding event. Notably however, it does not support MoveTable events, on which we're punting for now.
The implementation has a bit of duplicated code on
Buffer
because I wanted to leave in the old implementation to make them swappable while we gain confidence on the newKeyspaceEvent
-based buffering code.I need a drink.
cc @deepthi @harshit-gangal @sougou
Related Issue(s)
Checklist
Deployment Notes