Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: fix rangefeed retries on transport errors #66910

Merged
merged 2 commits into from
Jun 29, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 25, 2021

rangecache: consider FailedPrecondition a retryable error

In #66199, the gRPC error FailedPrecondition (returned when contacting
a decommissioned node) was incorrectly considered a permanent error
during range descriptor lookups, via IsRangeLookupErrorRetryable().
These lookups are all trying to reach a meta-range (addressed by key),
so when encountering a decommissioned node they should evict the cache
token and retry the lookup.

This patch reverses that change, making only authentication errors
permanent (i.e. when the local node is decommissioned or otherwise not
part of the cluster).

Release note: None

kvcoord: fix rangefeed retries on transport errors

DistSender.RangeFeed() was meant to retry transport errors after
refreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (*sendError vs
sendError), these errors failed the range feed without invalidating
the cached range descriptor.

This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).

Resolves #66636, touches #66586.

Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.

/cc @cockroachdb/kv

In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting
a decommissioned node) was incorrectly considered a permanent error
during range descriptor lookups, via `IsRangeLookupErrorRetryable()`.
These lookups are all trying to reach a meta-range (addressed by key),
so when encountering a decommissioned node they should evict the cache
token and retry the lookup.

This patch reverses that change, making only authentication errors
permanent (i.e. when the local node is decommissioned or otherwise not
part of the cluster).

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 9 of 9 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

@erikgrinaker erikgrinaker force-pushed the rangefeed-retry branch 4 times, most recently from 466bd00 to 8355fc9 Compare June 26, 2021 17:20
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice, thank you for going the extra mile with mockgen! I know you've struggled with getting this to play ball with bazel, is this going to be the same amount of pain for the next person trying to use mockgen elsewhere? If so, I wonder where a good place would be to document how this works.

Reviewed 1 of 1 files at r1, 3 of 9 files at r2, 13 of 13 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aliher1911 and @erikgrinaker)


pkg/kv/kvclient/kvcoord/BUILD.bazel, line 73 at r3 (raw file):

)

# keep

Missing words? Or some secret bazel stuff I don't know about.


pkg/kv/kvclient/kvcoord/BUILD.bazel, line 85 at r3 (raw file):

)

gomock(

Uh what, why did you need that.


pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):

    name = "mocks_rangefeed",
    out = "mocks_generated.go",
    interfaces = [""],  # required, yet ignored when using source -- bug?

:(

Did you want to ask Ricky for a review on the bazel-specific parts of this PR?


pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):

	// Make sure scan failure gets retried.
	db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(errors.New("scan failed"))
	db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(nil)

This (and below) can't be chained into the invocation above, right? .Return(x).Return(y)?


pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 348 at r3 (raw file):

	db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(nil)

	// Make sure rangefeed is retried even after 3 failures, then succeed and cancel context.

(which is the signal for the range feed to shut down gracefully)?


pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 370 at r3 (raw file):

	select {
	case <-ctx.Done():
	case <-time.After(time.Second):

10s because race builds?

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to be the same amount of pain for the next person trying to use mockgen elsewhere?

Hopefully the next person can just grep for gomock and follow the same pattern.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @rickystewart, and @tbg)


pkg/kv/kvclient/kvcoord/BUILD.bazel, line 73 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Missing words? Or some secret bazel stuff I don't know about.

Secret Bazel stuff. Gazelle, which autogenerates Bazel build files from Go source files, doesn't handle the gomock target libraries properly so it will go ahead and replace all of this stuff with its own targets. # keep tells Gazelle to back off. This also means that we can't get Gazelle to manage dependencies for these targets, so they must be updated by hand.


pkg/kv/kvclient/kvcoord/BUILD.bazel, line 85 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Uh what, why did you need that.

Bazel doesn't use the go:generate comments, it wants to handle all code generation and dependency tracking itself, so we need separate code generation mechanisms for Bazel and the Go build chain.


pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

:(

Did you want to ask Ricky for a review on the bazel-specific parts of this PR?

Good idea. @rickystewart mind looking over the Bazel bits here? The purpose is to autogenerate test mock objects from interface definitions.


pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This (and below) can't be chained into the invocation above, right? .Return(x).Return(y)?

This can (updated), but not the one below because of the Times(3) thing.

`DistSender.RangeFeed()` was meant to retry transport errors after
refreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (`*sendError` vs
`sendError`), these errors failed the range feed without invalidating
the cached range descriptor.

This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).

Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @rickystewart, and @tbg)


pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This can (updated), but not the one below because of the Times(3) thing.

Actually no, the chained return replaces the previous definition.

Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)


pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r1 (raw file):

	// Auth errors are not retryable. These imply that the local node has been
	// decommissioned or is otherwise not part of the cluster.
	return !grpcutil.IsAuthError(err)

The original motivation was that since range lookup goes through DistSender.sendToReplicas to get range it would retry FailedPrecondition and evict replicas, but the error code itself would never bubble up. So there would be no need to handle it specifically anywhere above. I'm probably missing invalidation case here.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)


pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r1 (raw file):

Previously, aliher1911 (Oleg) wrote…

The original motivation was that since range lookup goes through DistSender.sendToReplicas to get range it would retry FailedPrecondition and evict replicas, but the error code itself would never bubble up. So there would be no need to handle it specifically anywhere above. I'm probably missing invalidation case here.

Sure, that's probably true, but the logic is a bit backwards -- if it ever did see a FailedPrecondition it shouldn't consider it permanent, so there's no reason to mark it as permanent.

irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 28, 2021
Exploring an alternative to cockroachdb#66910.

Release note: None
Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the bazel stuff, something looks off. After trying it out a bit locally, I can see why this PR was so cumbersome with Bazel. So the way go:generate mockgen stuff is setup here will not play well with Bazel. Here's why:

  • To generate mock_generated.go in pkg/kv/kvclient/kvcoord, we depend on the code already present in pkg/kv/kvclient/kvcoord.
  • The test that uses these mocks, TestDistSenderRangeFeedRetryOnTransportErrors, is also inkvcoord.
  • There's a dependency cycle! kvcoord (TestDistSenderRangeFeedRetryOnTransportErrors) depends on mocks (generated with package name kvcoord) which in turn depends on kvcoord.

The reason go:generate stuff was comparatively easier to write is that it'll run the generation script using an on-disk image of the kvcoord package, to generate more stuff in kvcoord, and then the tests in kvcoord can simply assume the on-disk state has been updated to include the mocks in kvcoord. Bazel doesn't like this -- it can't figure out the full set of transitive dependencies needed to compile+run TestDistSenderRangeFeedRetryOnTransportErrors. It forces you to break this hidden dependency cycles to be able to correctly cache artifacts + parallelize your test runs. This subtle go build behavior, though convenient during development, is difficult to build on-top of to get the kind of CI performance bazel promises.

Breaking this dependency cycle is pretty easy, it would look something like #66975. I'd normally tested my changes locally with bazel build //pkg/kv/kvclient/kvcoord:{kvcoord_test,kvcoord,mocks_transport}, but it looks like my local bazel builds of the kvcoord_test were not working at SHAs even before this PR -- not investigating. In any case, the attempt in that draft PR above doesn't break bazel build cockroach-short.


As for the the rangefeed bazel work, have you taken a look at #65029? Given there we're using source mode generation, our approach there should apply here as well. Happy to work through it with you if it proves to be as annoying. Looking at the upstream libraries, this could all be documented better. Perhaps we could in-code so it's greppable.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)


pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Good idea. @rickystewart mind looking over the Bazel bits here? The purpose is to autogenerate test mock objects from interface definitions.

Reading through https://github.com/jmhodges/bazel_gomock and https://github.com/golang/mock, and the corresponding mockgen invocation in this package:

//go:generate mockgen -package=rangefeed -source rangefeed.go -destination=mocks_generated.go .

We're using the "source mode", which is "Required if source is not set, and ignored if source is set". I can't speak to whether or not using source mode is what we want, but this is not a bug.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An short-term alternative to all this is to not try and bother with generating these mocks through bazel now, and just listing these mock_generated.go files as regular user-edited files in bazel (so adding them to srcs for the go_{library,test} targets), and punting off the proper generation of these in a separate issue (that I'd be happy to take on soon, documenting the right gomock patterns as I go). It'd be good to have @rickystewart chime in though, seeing as how they're very close to migrating all instances of go:generate to bazel.

@rickystewart
Copy link
Collaborator

Haven't had a chance to review this in earnest yet, I'll do so ASAP.

An short-term alternative to all this is to not try and bother with generating these mocks through bazel now, and just listing these mock_generated.go files as regular user-edited files in bazel (so adding them to srcs for the go_{library,test} targets), and punting off the proper generation of these in a separate issue (that I'd be happy to take on soon, documenting the right gomock patterns as I go). It'd be good to have @rickystewart chime in though, seeing as how they're very close to migrating all instances of go:generate to bazel.

The go:generate work has actually been done (#57787), so it would be great if we could avoid regressing it.

Copy link
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look @irfansharif! I get why Bazel needs things to be set up in this way. For Bazel itself, I don't think it matters much that these are all in the same Go package though -- it strictly just needs to know which source files gomock depends on (at least for source mode), then the Go library could depend on those source files and the mocks. But that wouldn't work well with Gazelle, I think. Being disciplined about splitting tests out to a separate package would help -- but I'd like to keep this PR fairly backportable without doing a bunch of restructuring.

I'm going to merge this as-is for now, and we can revisit this setup later.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)


pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Reading through https://github.com/jmhodges/bazel_gomock and https://github.com/golang/mock, and the corresponding mockgen invocation in this package:

//go:generate mockgen -package=rangefeed -source rangefeed.go -destination=mocks_generated.go .

We're using the "source mode", which is "Required if source is not set, and ignored if source is set". I can't speak to whether or not using source mode is what we want, but this is not a bug.

Why wouldn't this be a bug? source and interfaces are mutually exclusive, so requiring interfaces when source is set doesn't make any sense.

@erikgrinaker
Copy link
Contributor Author

bors r=miretskiy,tbg,aliher1911,rickystewart

@aliher1911
Copy link
Contributor

When do you plan to backport this? It requires my other backport to land first.

@erikgrinaker
Copy link
Contributor Author

We should get these backported soon. Let's discuss elsewhere.

@craig
Copy link
Contributor

craig bot commented Jun 29, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changefeedccl: Rangefeeds might fail due to stale range cache
7 participants