-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: fix rangefeed retries on transport errors #66910
Conversation
In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None
f9673c3
to
370a014
Compare
370a014
to
7bbdb4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 9 of 9 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
466bd00
to
8355fc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, thank you for going the extra mile with mockgen! I know you've struggled with getting this to play ball with bazel, is this going to be the same amount of pain for the next person trying to use mockgen elsewhere? If so, I wonder where a good place would be to document how this works.
Reviewed 1 of 1 files at r1, 3 of 9 files at r2, 13 of 13 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aliher1911 and @erikgrinaker)
pkg/kv/kvclient/kvcoord/BUILD.bazel, line 73 at r3 (raw file):
) # keep
Missing words? Or some secret bazel stuff I don't know about.
pkg/kv/kvclient/kvcoord/BUILD.bazel, line 85 at r3 (raw file):
) gomock(
Uh what, why did you need that.
pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):
name = "mocks_rangefeed", out = "mocks_generated.go", interfaces = [""], # required, yet ignored when using source -- bug?
:(
Did you want to ask Ricky for a review on the bazel-specific parts of this PR?
pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):
// Make sure scan failure gets retried. db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(errors.New("scan failed")) db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(nil)
This (and below) can't be chained into the invocation above, right? .Return(x).Return(y)
?
pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 348 at r3 (raw file):
db.EXPECT().Scan(gomock.Any(), gomock.Any(), gomock.Any(), gomock.Any()).Return(nil) // Make sure rangefeed is retried even after 3 failures, then succeed and cancel context.
(which is the signal for the range feed to shut down gracefully)?
pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 370 at r3 (raw file):
select { case <-ctx.Done(): case <-time.After(time.Second):
10s because race builds?
8355fc9
to
3a990da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to be the same amount of pain for the next person trying to use mockgen elsewhere?
Hopefully the next person can just grep for gomock
and follow the same pattern.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @rickystewart, and @tbg)
pkg/kv/kvclient/kvcoord/BUILD.bazel, line 73 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Missing words? Or some secret bazel stuff I don't know about.
Secret Bazel stuff. Gazelle, which autogenerates Bazel build files from Go source files, doesn't handle the gomock target libraries properly so it will go ahead and replace all of this stuff with its own targets. # keep
tells Gazelle to back off. This also means that we can't get Gazelle to manage dependencies for these targets, so they must be updated by hand.
pkg/kv/kvclient/kvcoord/BUILD.bazel, line 85 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Uh what, why did you need that.
Bazel doesn't use the go:generate
comments, it wants to handle all code generation and dependency tracking itself, so we need separate code generation mechanisms for Bazel and the Go build chain.
pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
:(
Did you want to ask Ricky for a review on the bazel-specific parts of this PR?
Good idea. @rickystewart mind looking over the Bazel bits here? The purpose is to autogenerate test mock objects from interface definitions.
pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This (and below) can't be chained into the invocation above, right?
.Return(x).Return(y)
?
This can (updated), but not the one below because of the Times(3)
thing.
`DistSender.RangeFeed()` was meant to retry transport errors after refreshing the range descriptor (invalidating the cached entry). However, due to an incorrect error type check (`*sendError` vs `sendError`), these errors failed the range feed without invalidating the cached range descriptor. This was particularly severe in cases where a large number of nodes had been decommissioned, where some stale range descriptors on some nodes contained only decommissioned nodes. Since change feeds set up range feeds across many nodes and ranges in the cluster, they are likely to encounter these decommissioned nodes and return an error -- and since the descriptor cache wasn't invalidated they would keep erroring until the nodes were restarted such that the caches were flushed (often requiring a full cluster restart). Release note (bug fix): Change feeds now properly invalidate cached range descriptors and retry when encountering decommissioned nodes.
3a990da
to
821ee1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @rickystewart, and @tbg)
pkg/kv/kvclient/rangefeed/rangefeed_mock_test.go, line 346 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This can (updated), but not the one below because of the
Times(3)
thing.
Actually no, the chained return replaces the previous definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)
pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r1 (raw file):
// Auth errors are not retryable. These imply that the local node has been // decommissioned or is otherwise not part of the cluster. return !grpcutil.IsAuthError(err)
The original motivation was that since range lookup goes through DistSender.sendToReplicas to get range it would retry FailedPrecondition and evict replicas, but the error code itself would never bubble up. So there would be no need to handle it specifically anywhere above. I'm probably missing invalidation case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)
pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r1 (raw file):
Previously, aliher1911 (Oleg) wrote…
The original motivation was that since range lookup goes through DistSender.sendToReplicas to get range it would retry FailedPrecondition and evict replicas, but the error code itself would never bubble up. So there would be no need to handle it specifically anywhere above. I'm probably missing invalidation case here.
Sure, that's probably true, but the logic is a bit backwards -- if it ever did see a FailedPrecondition
it shouldn't consider it permanent, so there's no reason to mark it as permanent.
Exploring an alternative to cockroachdb#66910. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the bazel stuff, something looks off. After trying it out a bit locally, I can see why this PR was so cumbersome with Bazel. So the way go:generate mockgen
stuff is setup here will not play well with Bazel. Here's why:
- To generate
mock_generated.go
inpkg/kv/kvclient/kvcoord
, we depend on the code already present inpkg/kv/kvclient/kvcoord
. - The test that uses these mocks,
TestDistSenderRangeFeedRetryOnTransportErrors
, is also inkvcoord
. - There's a dependency cycle!
kvcoord
(TestDistSenderRangeFeedRetryOnTransportErrors
) depends on mocks (generated with package namekvcoord
) which in turn depends onkvcoord
.
The reason go:generate stuff
was comparatively easier to write is that it'll run the generation script using an on-disk image of the kvcoord package, to generate more stuff in kvcoord, and then the tests in kvcoord can simply assume the on-disk state has been updated to include the mocks in kvcoord. Bazel doesn't like this -- it can't figure out the full set of transitive dependencies needed to compile+run TestDistSenderRangeFeedRetryOnTransportErrors
. It forces you to break this hidden dependency cycles to be able to correctly cache artifacts + parallelize your test runs. This subtle go build
behavior, though convenient during development, is difficult to build on-top of to get the kind of CI performance bazel promises.
Breaking this dependency cycle is pretty easy, it would look something like #66975. I'd normally tested my changes locally with bazel build //pkg/kv/kvclient/kvcoord:{kvcoord_test,kvcoord,mocks_transport}
, but it looks like my local bazel builds of the kvcoord_test
were not working at SHAs even before this PR -- not investigating. In any case, the attempt in that draft PR above doesn't break bazel build cockroach-short
.
As for the the rangefeed bazel work, have you taken a look at #65029? Given there we're using source mode generation, our approach there should apply here as well. Happy to work through it with you if it proves to be as annoying. Looking at the upstream libraries, this could all be documented better. Perhaps we could in-code so it's greppable.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)
pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Good idea. @rickystewart mind looking over the Bazel bits here? The purpose is to autogenerate test mock objects from interface definitions.
Reading through https://github.com/jmhodges/bazel_gomock and https://github.com/golang/mock, and the corresponding mockgen
invocation in this package:
//go:generate mockgen -package=rangefeed -source rangefeed.go -destination=mocks_generated.go .
We're using the "source mode", which is "Required if source is not set, and ignored if source is set". I can't speak to whether or not using source mode is what we want, but this is not a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An short-term alternative to all this is to not try and bother with generating these mocks through bazel now, and just listing these mock_generated.go files as regular user-edited files in bazel (so adding them to srcs
for the go_{library,test}
targets), and punting off the proper generation of these in a separate issue (that I'd be happy to take on soon, documenting the right gomock patterns as I go). It'd be good to have @rickystewart chime in though, seeing as how they're very close to migrating all instances of go:generate
to bazel.
Haven't had a chance to review this in earnest yet, I'll do so ASAP.
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having a look @irfansharif! I get why Bazel needs things to be set up in this way. For Bazel itself, I don't think it matters much that these are all in the same Go package though -- it strictly just needs to know which source files gomock
depends on (at least for source mode), then the Go library could depend on those source files and the mocks. But that wouldn't work well with Gazelle, I think. Being disciplined about splitting tests out to a separate package would help -- but I'd like to keep this PR fairly backportable without doing a bunch of restructuring.
I'm going to merge this as-is for now, and we can revisit this setup later.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @rickystewart and @tbg)
pkg/kv/kvclient/rangefeed/BUILD.bazel, line 47 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
Reading through https://github.com/jmhodges/bazel_gomock and https://github.com/golang/mock, and the corresponding
mockgen
invocation in this package://go:generate mockgen -package=rangefeed -source rangefeed.go -destination=mocks_generated.go .
We're using the "source mode", which is "Required if source is not set, and ignored if source is set". I can't speak to whether or not using source mode is what we want, but this is not a bug.
Why wouldn't this be a bug? source
and interfaces
are mutually exclusive, so requiring interfaces
when source
is set doesn't make any sense.
bors r=miretskiy,tbg,aliher1911,rickystewart |
When do you plan to backport this? It requires my other backport to land first. |
We should get these backported soon. Let's discuss elsewhere. |
Build succeeded: |
rangecache: consider FailedPrecondition a retryable error
In #66199, the gRPC error
FailedPrecondition
(returned when contactinga decommissioned node) was incorrectly considered a permanent error
during range descriptor lookups, via
IsRangeLookupErrorRetryable()
.These lookups are all trying to reach a meta-range (addressed by key),
so when encountering a decommissioned node they should evict the cache
token and retry the lookup.
This patch reverses that change, making only authentication errors
permanent (i.e. when the local node is decommissioned or otherwise not
part of the cluster).
Release note: None
kvcoord: fix rangefeed retries on transport errors
DistSender.RangeFeed()
was meant to retry transport errors afterrefreshing the range descriptor (invalidating the cached entry).
However, due to an incorrect error type check (
*sendError
vssendError
), these errors failed the range feed without invalidatingthe cached range descriptor.
This was particularly severe in cases where a large number of nodes had
been decommissioned, where some stale range descriptors on some nodes
contained only decommissioned nodes. Since change feeds set up range
feeds across many nodes and ranges in the cluster, they are likely to
encounter these decommissioned nodes and return an error -- and since
the descriptor cache wasn't invalidated they would keep erroring until
the nodes were restarted such that the caches were flushed (often
requiring a full cluster restart).
Resolves #66636, touches #66586.
Release note (bug fix): Change feeds now properly invalidate cached
range descriptors and retry when encountering decommissioned nodes.
/cc @cockroachdb/kv