gateway: use keyspace events when buffering requests #8700

vmg · 2021-08-27T15:40:01Z

Description

Fixes #8462 (lol, hopefully).
Fixes #7059
Fixes #7061

Alright, this is the first draft of a solution for the dreaded request buffering issue. After a lot of investigation, we've arrived at this approach to fix the issue.

Summarizing:

We're still performing the request buffering at the shard level, like we were before, but we've replaced the old HealthCheck-based approach to find out when a failover is finished with a brand new KeyspaceEventWatcher.
The KeyspaceEventWatcher is a new implementation that augments HealthCheck events with metadata from the topology server for keyspace changes. This allows it to process any primary promotion events wholistically for the whole keyspace, so when these events are part of a resharding operation, they are only reported once the whole keyspace has been properly sharded and all the new shards are healthy and serving.
Because of this, we can now distinguish in the Buffer code between plained failovers for a single primary (whose buffered requests can be retried) and primary promotions as part of a resharding operation (whose buffered requests cannot be retried because the shard they were targetting is now gone as part of the resharding process).
When the latter situation arises, the buffering code reports a special error code that is handled in the execution engine. This handling has been implemented at the primitive level (as opposed to the initial approach that intended to handle this at the plan generation level). This means that the buffered query is only re-executed in the new shard for the primitive subquery, and not for the whole plan -- this is required for handling some corner cases, see engine: add support for finding a plan's affected shards #8681 for examples.

As far as we can tell (and as far as I can test manually), this implementation can resume buffering after any kind of resharding event. Notably however, it does not support MoveTable events, on which we're punting for now.

The implementation has a bit of duplicated code on Buffer because I wanted to leave in the old implementation to make them swappable while we gain confidence on the new KeyspaceEvent-based buffering code.

I need a drink.

cc @deepthi @harshit-gangal @sougou

Related Issue(s)

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-09-01T17:39:22Z

OK, I think this PR is ready for review merge. I spent all of today implementing tests for the new functionality and I think I was successful. I've split the existing tabletgateway/buffer end-to-end tests into "reparenting" and "resharding" tests, to verify both behaviors.

The new tests ensure that queries are buffered and do not error out when a Vitess cluster is reparented or resharded. They both use the new Keyspace Events Watcher API for event detection. The reparenting tests are green when using the KEW and the old HealthCheck API, but I decided to run them only with KEW because we're deprecating the old API and these are expensive tests. The resharding tests are green with the KEW and faill with the HealthCheck API, which is, huh, kinda the whole point of this project.

One last thing I noticed while stress testing the resharding operation is another failure case that we never detected and never discussed before: in a high-traffic Vitess cluster with high QPS, it's possible that a query arrives to the VTGate during a resharding operation such as that the query is planned for shard 0 but such shard has been marked as unhealthy by the resharding operation by the time the query is executed. This would usually cause TabletGateway to fail fast with an UNAVAILABLE error that is not retried, since the topology server would return no shards capable of serving the query, and the buffering code would not be aware (yet) of any buffering events for the shard (as the shard was removed before any queries could fail on it and hence start the buffering process).

This is easy to reproduce with high enough QPS, and in practice results in a very short burst of instantly failed queries from the VTGate in the exact same moment than the reparenting starts. Fortunately, the new Keyspace Event Watcher is smart enough to notice this new corner case, so I've updated TabletGateway to handle it explicitly: if there are no shards available to serve the query, but the KEW knows that no shards are healthy because the keyspace is currently being resharded, we'll buffer the query all the same and re-try it at the end of the event so that it lands in the new shard. 👌

vmg · 2021-09-01T17:40:06Z

go/vt/vtgate/tabletgateway.go

+			// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded
+			if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) {
+				err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded")
+				continue
+			}


This is the special handling I was talking about.

Question to make sure I'm following correctly this change: Now, in addition to previous existent logic, it will use these events to retry.

One follow up question for these retries: other errors in this block, were not that time sensitive. For instance, if a connection to a tablet failed, in the next iteration of the block another tablet would be used for the retry. In this case, there is some amount of time that needs to elapsed before new primaries will become available. Would it make sense to start thinking of exponential backoff between retries?

There is no need for exponential backoff here. In this specific case, we're setting an explicit buffering error, so that the next iteration of the loop will block directly on the buffering code and will not resume until the primary becomes available. This is not a busy loop, so we don't need to backoff. 👌

Nice, this makes sense.

Am I right that the "buffer" request here is not really put a "request" into a queue, we are queueing the entry and the actaul request goroutine just gets block until the shard split/failover finished?

This is correct. What is stored in memory is a watcher-struct which is shared between all the requests to the same target (i.e. shard + keyspace), and the individual requests for the target become individually blocked on their corresponding goroutines until the buffering has finished.

deepthi

Nice work. Very clean as usual.
I had a couple of nits, otherwise LGTM.

go/vt/vtgate/buffer/buffer.go

go/vt/vtgate/buffer/flags.go

go/vt/vtgate/gateway.go

deepthi · 2021-09-02T03:49:13Z

Cluster test tabletgateway_buffer_reshard is failing right now, that needs to be resolved.

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-09-02T14:00:58Z

Fixed all the tests 🍏, I'll merge at the end of the day in case somebody else wants to review.

deepthi · 2021-09-02T17:01:12Z

Use the new flag buffer_implementation=keyspace_events to enable this feature.

rafael

Arriving late to this party. This looks good to me as well. I left couple comments that are more for my own curiosity and understanding of the change.

rafael · 2021-09-02T17:48:21Z

go/vt/vtgate/tabletgateway.go

@@ -231,10 +258,22 @@ func (gw *TabletGateway) withRetry(ctx context.Context, target *querypb.Target,
 				defer retryDone()
 				bufferedOnce = true
 			}
+
+			if bufferErr != nil {


Could you provide some context on this change in the order of the logic here? We are checking retryDone, before checking the error.

I'm not familiarized with the details of the logic here, but this chance caught my attention.

This is a small corner case fix: there are now cases when the buffering code returns both an error and a cancelation function. It's important to defer the cancelation function whenever it's returned, even if we also have an error and must exit the function right away -- not defering the function would cause a (tiny) memory leak.

Ah cool. Makes sense as well.

rafael · 2021-09-02T17:54:46Z

go/vt/vtgate/tabletgateway.go

+			// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded
+			if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) {
+				err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded")
+				continue
+			}


Question to make sure I'm following correctly this change: Now, in addition to previous existent logic, it will use these events to retry.

One follow up question for these retries: other errors in this block, were not that time sensitive. For instance, if a connection to a tablet failed, in the next iteration of the block another tablet would be used for the retry. In this case, there is some amount of time that needs to elapsed before new primaries will become available. Would it make sense to start thinking of exponential backoff between retries?

harshit-gangal · 2021-09-06T08:22:32Z

go/test/endtoend/tabletgateway/buffer/reshard/sharded_buffer_test.go

+func TestBufferResharding(t *testing.T) {
+	t.Run("slow queries", func(t *testing.T) {
+		bt := &buffer.BufferingTest{
+			Assert:      assertResharding,
+			Failover:    reshard02,
+			SlowQueries: true,
+			VSchema:     vschema,
+		}
+		bt.Test(t)
+	})
+
+	t.Run("fast queries", func(t *testing.T) {
+		bt := &buffer.BufferingTest{
+			Assert:      assertResharding,
+			Failover:    reshard02,
+			SlowQueries: false,
+			VSchema:     vschema,
+		}
+		bt.Test(t)
+	})
+}


like we have reparenting tests with reserved connections, we should also check resharding with reserved connections.

5antelope · 2021-09-24T22:36:42Z

go/vt/vtgate/buffer/buffer.go

+	for _, shard := range ksevent.Shards {
+		sb := b.getOrCreateBuffer(shard.Target.Keyspace, shard.Target.Shard)
+		if sb != nil {
+			sb.recordKeyspaceEvent(shard.Tablet, shard.Serving)


For my own understanding, how would this work when vtgate gets a healthcheck for primary being down? My mental model is if we detect the primary is not serving via health check, we should start buffering the request, but looks likerecordKeyspaceEvent will always call stopBufferingLocked so nothing will be buffered?

What am I missing?

We actually don't trigger any buffering operations when we receive the healcheck fail for a primary! That's because these healthchecks are processed by the topology engine, which often lags behind the actual availability issue. Instead, what we do it start buffering once the vtgate itself fails to reach the primary (i.e. when we get an error return during a request), and the Keyspace Events handler is designed to detect when the availability incident is over cluster-wide.

Thanks @vmg that makes sense.

start buffering once the vtgate itself fails to reach the primary (i.e. when we get an error return during a request)

it looks like the retry is triggered on these 3 kinds of error code: https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/queryservice/wrapped.go#L75

My current understanding is vttablet maps to those error code here: Code_FAILED_PRECONDITION here, Code_UNAVAILABLE here and Code_CLUSTER_EVENT here - they seem not related to a primary failure. The other place where we return Code_CLUSTER_EVENT is when the primary tablet has !serving servingState, which only happen after a reparent event.

I think I'm missing some details here on the failover behavior, could you shed some lights on it? Thanks

The error handling on the tablet was wired up by @harshit-gangal, so I don't know all the details, but the actual buffering code is triggered with this check, so the only relevant error code when it comes to buffering is CLUSTER_EVENT:

vitess/go/vt/vtgate/buffer/buffer.go

Lines 72 to 75 in ca5a27e

func CausedByFailover(err error) bool {

log.V(2).Infof("Checking error (type: %T) if it is caused by a failover. err: %v", err, err)

return vterrors.Code(err) == vtrpcpb.Code_CLUSTER_EVENT

}

...In retrospect, this function could use a different name since CLUSTER_EVENT can also be caused by a resharding operation. 😅

Got it, looking at the check here, looks like the buffering logic only covers after a reparent (but before vtgate get notified via health check), a vttablet would return Code_CLUSTER_EVENT. Since replHealthy is always true for primary tablet and serving state only changes via a reparent event. Does that sound right to you?

5antelope · 2021-09-24T22:46:35Z

go/vt/discovery/keyspace_events.go

+	}
+
+	for shard, sstate := range kss.shards {
+		if sstate.serving && !activeShardsInPartition[shard] {


cosmetic nit: this seems won't happen because of the early return at L209

That's not accurate: the first check in line 209 is iterating through all the shards that the topology service knows and making sure we already know about them and that we know them to be healthy. This second loop is iterating through all the shards that we know about, to make sure there are no healthy shards that we know about but the topology service doesn't.

5antelope · 2021-09-24T22:49:06Z

go/vt/discovery/keyspace_events.go

+	activeShardsInPartition := make(map[string]bool)
+	for _, shard := range primary.ShardReferences {
+		sstate := kss.shards[shard.Name]
+		if sstate == nil || !sstate.serving {


Just to make sure I understand it correctly: this is only for shard split cases right? i.e., when we mark source shard as "not_serving", it will be reflected srv keyspace and we only care the "end" of the shard split - therefore we can early return at L209

I've documented this behavior in #8890 -- it explains all the different consistency checks.

5antelope · 2021-09-24T23:07:26Z

go/vt/vtgate/tabletgateway.go

+			// if we have a keyspace event watcher, check if the reason why our primary is not available is that it's currently being resharded
+			if target.TabletType == topodatapb.TabletType_PRIMARY && gw.kev != nil && gw.kev.TargetIsBeingResharded(target.Keyspace, target.Shard) {
+				err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is being resharded")
+				continue
+			}


Am I right that the "buffer" request here is not really put a "request" into a queue, we are queueing the entry and the actaul request goroutine just gets block until the shard split/failover finished?

5antelope · 2021-09-24T23:26:18Z

go/vt/vtgate/tabletgateway.go

-					// If result is nil it must mean the channel has been closed. Stop goroutine in that case
-					bufferCancel()
+	gw.setupBuffering(ctx)
+	gw.QueryService = queryservice.Wrap(nil, gw.withRetry)


I see we rely on srv keyspace to detect shard split events, how would a failover being handled here? Say we have a failover in a shard from A to B:
t1: A became problematic / not responding
t2: Orchestrator detects the problem and do an external reparent from A to B
t3: healthcheck detect the reparent event and set primary of the shard to B

After t3, requests should be handled properly. I'm thinking before t3, what is the mechanism to handle / buffer requests (if it exists)? I'm asking because if we can buffer as much requests as possible between t1 and t3, in theory we should have higher availability.

I see tablet has the logic to return Code_CLUSTER_EVENT when it is not serving.

Since replHealthy should always be true for primary (or is it?), in order to trigger buffer in vtgate, the orchestrator needs to modify sm.state on the old primary (A) at t2 so that vtgate can buffer requests from t2. Is my understanding correct or sm.state is set somehow magically by vttablet already when the primary is unhealthy?

vmg force-pushed the vmg/buffer-2 branch from 0d5ed4d to 0ae119f Compare August 30, 2021 09:08

vmg added Component: Cluster management Component: Query Serving Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Aug 30, 2021

vmg mentioned this pull request Aug 30, 2021

engine: allow retrying partial primitives #8727

Merged

3 tasks

vmg force-pushed the vmg/buffer-2 branch from 7a9f234 to f3cdaee Compare August 30, 2021 14:37

vmg marked this pull request as ready for review August 30, 2021 15:26

vmg requested review from deepthi, harshit-gangal, rafael, rohit-nayak-ps and systay as code owners August 30, 2021 15:26

vmg force-pushed the vmg/buffer-2 branch from f3cdaee to f37b955 Compare August 31, 2021 10:38

vmg mentioned this pull request Aug 31, 2021

srvtopo: expose WatchSrvKeyspace #8752

Merged

3 tasks

vmg force-pushed the vmg/buffer-2 branch from f37b955 to 087b2f2 Compare August 31, 2021 14:59

vmg added 5 commits September 1, 2021 10:55

buffer: add support for a keyspace-event buffering

c861b3b

Signed-off-by: Vicent Marti <[email protected]>

buffer: better naming

92f3bba

Signed-off-by: Vicent Marti <[email protected]>

buffer: test both implementations

79aa78c

Signed-off-by: Vicent Marti <[email protected]>

vcursor: retry primitives upon buffering failure

559cf9b

Signed-off-by: Vicent Marti <[email protected]>

buffer: merge the two implementations

c321f5a

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/buffer-2 branch from 087b2f2 to c321f5a Compare September 1, 2021 09:16

buffer: test the new buffering implementation

0dedbd3

Signed-off-by: Vicent Marti <[email protected]>

vmg commented Sep 1, 2021

View reviewed changes

deepthi approved these changes Sep 2, 2021

View reviewed changes

go/vt/vtgate/buffer/buffer.go Outdated Show resolved Hide resolved

go/vt/vtgate/buffer/flags.go Outdated Show resolved Hide resolved

go/vt/vtgate/gateway.go Outdated Show resolved Hide resolved

endtoend: increase tablet refresh interval

73f6a14

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/buffer-2 branch from 9d6cf16 to 73f6a14 Compare September 2, 2021 11:02

vmg added 2 commits September 2, 2021 13:22

buffer: better naming for flags

23a5be1

Signed-off-by: Vicent Marti <[email protected]>

gateway: fix flag documentation

af6e792

Signed-off-by: Vicent Marti <[email protected]>

vmg merged commit 990d49e into vitessio:main Sep 2, 2021

deepthi mentioned this pull request Sep 2, 2021

Document keyspace_events buffering vitessio/website#819

Closed

rafael reviewed Sep 2, 2021

View reviewed changes

harshit-gangal reviewed Sep 6, 2021

View reviewed changes

kristee-planetscale mentioned this pull request Sep 21, 2021

Zero downtime shard split #7330

Closed

aquarapid mentioned this pull request Sep 22, 2021

The Read downtime was occured on Resharding (while SwitchWrites). #6527

Closed

5antelope reviewed Sep 24, 2021

View reviewed changes

vmg mentioned this pull request Sep 27, 2021

discovery: document KeyspaceEventWatcher #8890

Merged

3 tasks

frouioui added the release notes label Sep 28, 2021

deepthi mentioned this pull request Jul 23, 2023

Enhancing VTGate buffering for MoveTables and Shard by Shard Migration #13507

Merged

4 tasks

mattlord mentioned this pull request Aug 25, 2023

Bug Report: VTGate buffering treating sharded keyspace primary restart as reshard event #13854

Closed

mattlord mentioned this pull request Sep 25, 2024

(WIP) RFC: Enhancing VTGate buffering for MoveTables and Shard by Shard Migration #13464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gateway: use keyspace events when buffering requests #8700

gateway: use keyspace events when buffering requests #8700

vmg commented Aug 27, 2021 •

edited by deepthi

Loading

vmg commented Sep 1, 2021

vmg Sep 1, 2021

rafael Sep 2, 2021

vmg Sep 9, 2021

rafael Sep 10, 2021

5antelope Sep 24, 2021

vmg Sep 27, 2021

deepthi left a comment

deepthi commented Sep 2, 2021

vmg commented Sep 2, 2021

deepthi commented Sep 2, 2021

rafael left a comment

rafael Sep 2, 2021

vmg Sep 9, 2021

rafael Sep 10, 2021

rafael Sep 2, 2021

harshit-gangal Sep 6, 2021

5antelope Sep 24, 2021

vmg Sep 27, 2021

5antelope Sep 27, 2021

vmg Sep 27, 2021

vmg Sep 27, 2021

5antelope Sep 27, 2021

5antelope Sep 24, 2021

vmg Sep 27, 2021

5antelope Sep 24, 2021

vmg Sep 27, 2021

5antelope Sep 24, 2021

5antelope Sep 24, 2021 •

edited

Loading

	func CausedByFailover(err error) bool {
	log.V(2).Infof("Checking error (type: %T) if it is caused by a failover. err: %v", err, err)
	return vterrors.Code(err) == vtrpcpb.Code_CLUSTER_EVENT
	}

gateway: use keyspace events when buffering requests #8700

gateway: use keyspace events when buffering requests #8700

Conversation

vmg commented Aug 27, 2021 • edited by deepthi Loading

Description

Related Issue(s)

Checklist

Deployment Notes

vmg commented Sep 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepthi left a comment

Choose a reason for hiding this comment

deepthi commented Sep 2, 2021

vmg commented Sep 2, 2021

deepthi commented Sep 2, 2021

rafael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

5antelope Sep 24, 2021 • edited Loading

Choose a reason for hiding this comment

vmg commented Aug 27, 2021 •

edited by deepthi

Loading

5antelope Sep 24, 2021 •

edited

Loading