-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: return FailedPrecondition when talking to decom node #66199
server: return FailedPrecondition when talking to decom node #66199
Conversation
2f69434
to
fb2e855
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Can't think of anything else we need to handle. Except maybe check that it doesn't break anything for clients when connecting to decommissioned nodes (e.g. the cockroach
CLI, both SQL and admin RPCs), although I don't think it should affect them.
Reviewed 5 of 5 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):
// in-between. // if withCommit && !grpcutil.RequestDidNotStart(err) && !grpcutil.IsFailedPreconditionError(err) {
I'd argue that RequestDidNotStart
should return true
for FailedPrecondition
(if we take "precondition" to mean "pre-evaluation" for the entire request). In this specific case it seems to apply, at least, since what we care about here is whether any writes could have gone through despite the error.
pkg/rpc/context.go, line 1062 at r1 (raw file):
ctx.masterCtx, "rpc.Context: grpc heartbeat", func(masterCtx context.Context) { err := ctx.runHeartbeat(conn, target, redialChan) if err != nil && !grpcutil.IsClosedConnection(err) && !grpcutil.IsAuthError(err) && !grpcutil.IsFailedPreconditionError(err) {
nit: Break lines at 100 characters. I don't really mind, but I know others do (surprised the linter isn't saying anything).
pkg/server/connectivity_test.go, line 394 at r1 (raw file):
// Trying to scan the scratch range from the decommissioned node should // now result in a permission denied error. _, err := decomSrv.DB().Scan(ctx, scratchKey, keys.MaxKey, 1)
Let's try to do a scan the other way as well (from clusterSrv to decomSrv), to test that the operation doesn't hang because of infinite retries.
pkg/server/server.go, line 286 at r1 (raw file):
nodeTombStorage := &nodeTombstoneStorage{engs: engines} checkPingFor := func(ctx context.Context, nodeID roachpb.NodeID, c codes.Code) error {
This parameter could use a more descriptive name.
pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):
// IsFailedPreconditionError returns true if err's Cause is an error produced by // gRPC due to failed precondition. This error is used internally to shortcut // known unavailable nodes and is used by kv retry loop. When outside loop its
I'd be even more specific here and say that it's returned for decommissioned peers -- we want to retry unavailable nodes, because they could become available again. Maybe specify which KV retry loop as well and that we want to retry range-addressed requests, since there's retry loops all over the place. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
pkg/server/connectivity_test.go, line 394 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Let's try to do a scan the other way as well (from clusterSrv to decomSrv), to test that the operation doesn't hang because of infinite retries.
Hm, maybe this doesn't make sense; if we put a range solely on the decommissioned node then either that range would have been moved off of the node first, or the request would indeed hang (because we're specifically retrying these requests in the DistSender).
I think we should have a test for the scenario we're trying to solve here though, but I'm not sure how to set ip up. Maybe something like setting up a range across three nodes, moving the leaseholder to the node we want to decommission, quiesce the range, shut the node down, decommission it, and then try to run a scan over it -- this should eventually succeed once the leaseholder is moved. It's a bit involved though. Any better ideas @tbg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
pkg/server/connectivity_test.go, line 394 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Hm, maybe this doesn't make sense; if we put a range solely on the decommissioned node then either that range would have been moved off of the node first, or the request would indeed hang (because we're specifically retrying these requests in the DistSender).
I think we should have a test for the scenario we're trying to solve here though, but I'm not sure how to set ip up. Maybe something like setting up a range across three nodes, moving the leaseholder to the node we want to decommission, quiesce the range, shut the node down, decommission it, and then try to run a scan over it -- this should eventually succeed once the leaseholder is moved. It's a bit involved though. Any better ideas @tbg?
Maybe it'd be easier to just stop the node and set the decommissioned status for the node directly in KV (without actually going through the whole decommission process).
pkg/util/grpcutil/grpc_util.go
Outdated
@@ -103,6 +103,20 @@ func IsAuthError(err error) bool { | |||
return false | |||
} | |||
|
|||
// IsFailedPreconditionError returns true if err's Cause is an error produced by | |||
// gRPC due to failed precondition. This error is used internally to shortcut | |||
// known unavailable nodes and is used by kv retry loop. When outside loop its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both "known unavailable nodes" "is used by kv retry loop" is ambiguous, can you be more specific? For example
This error is returned when trying to connect to nodes that are decommissioned (and thus will never be reachable again, at least under their old NodeID). The recipient of this error will want to handle this error by either contacting a different node (DistSender does this) when possible, or by returning the error to the client (handling it similar to an AuthError).
@@ -1916,7 +1916,7 @@ func (ds *DistSender) sendToReplicas( | |||
// evaluating twice, overwriting another unrelated write that fell | |||
// in-between. | |||
// | |||
if withCommit && !grpcutil.RequestDidNotStart(err) { | |||
if withCommit && !grpcutil.RequestDidNotStart(err) && !grpcutil.IsFailedPreconditionError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't RequestDidNotStart
be taught to recognize both AuthError and FailedPreconditionError as guaranteeing that the request didn't start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somehow we have to make IsAuthError
and IsFailedPreconditionError
and RequestDidNotStart
play together better. In all cases the caller is asking something about the request. They're either asking "should I try again?" or "is there any chance this actually got executed despite my getting an error"? (And maybe something else I'm not aware of). Rather than sprinkling if !A && !b {
around, can we clean this up by giving the callers methods that describe what they are actually trying to find out?
Btw I went through the callers of AuthError and I think there are some that we need to update. For example
cockroach/pkg/ccl/kvccl/kvtenantccl/connector.go
Lines 332 to 335 in 65c30a3
if grpcutil.IsAuthError(err) { | |
// Authentication or authorization error. Propagate. | |
return nil, nil, err | |
} |
Then this
cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go
Lines 1864 to 1870 in fb2e855
if grpcutil.IsAuthError(err) { | |
// Authentication or authorization error. Propagate. | |
if ambiguousError != nil { | |
return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError) | |
} | |
return nil, err | |
} |
this
cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
Lines 262 to 265 in 0cea9dc
if grpcutil.IsAuthError(err) { | |
// Authentication or authorization error. Propagate. | |
return args.Timestamp, err | |
} |
this
cockroach/pkg/kv/kvserver/liveness/liveness.go
Lines 338 to 340 in dd325f0
if grpcutil.IsAuthError(err) { | |
return err | |
} |
and the list continues, basically you need to audit all of the callers (I like the "Call Hierarchy" feature in Goland for this).
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going through the call sites where we check auth error to abort early and what happening is that we only have two places where it makes sense to handle. Outside of those places we should never get FailedPrecondition as we will keep iterating until we exhaust replicas or hit some other error. I agree about replacing isauterror with a better name, because callers don't care about type of error, they only care if they should bail in all those cases or most of them. Let me try to change that.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think connector.go should retry, it goes through the nodes in a loop and FailedPredondition should evict node from its list as other retryable errors. This is actually second place I mentioned where retry makes sense. And then ambigous error is handled fine, but I'll add precondition check to request didn't start as advised.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911)
fb2e855
to
8a1b5d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
shouldn't
RequestDidNotStart
be taught to recognize both AuthError and FailedPreconditionError as guaranteeing that the request didn't start?
In practice it never reaches this branch, but for the sake for consistency it should.
pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'd argue that
RequestDidNotStart
should returntrue
forFailedPrecondition
(if we take "precondition" to mean "pre-evaluation" for the entire request). In this specific case it seems to apply, at least, since what we care about here is whether any writes could have gone through despite the error.
Done.
pkg/rpc/context.go, line 1062 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: Break lines at 100 characters. I don't really mind, but I know others do (surprised the linter isn't saying anything).
Done.
pkg/server/connectivity_test.go, line 394 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Maybe it'd be easier to just stop the node and set the decommissioned status for the node directly in KV (without actually going through the whole decommission process).
While it is possible here I think this is a wrong place to test this change in functionality. It should be verified on dist sender level for replica eviction logic.
pkg/server/server.go, line 286 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This parameter could use a more descriptive name.
Done.
pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'd be even more specific here and say that it's returned for decommissioned peers -- we want to retry unavailable nodes, because they could become available again. Maybe specify which KV retry loop as well and that we want to retry range-addressed requests, since there's retry loops all over the place. :)
Done.
pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
both "known unavailable nodes" "is used by kv retry loop" is ambiguous, can you be more specific? For example
This error is returned when trying to connect to nodes that are decommissioned (and thus will never be reachable again, at least under their old NodeID). The recipient of this error will want to handle this error by either contacting a different node (DistSender does this) when possible, or by returning the error to the client (handling it similar to an AuthError).
Done.
9625a9d
to
a9e0179
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 8 files at r2, 1 of 1 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)
pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):
// For now, all errors are retryable except transport errors where request was // rejected. return !grpcutil.IsConnectionRejected(err)
I think we want to retry FailedPrecondition
here. In principle, if a range lookup fails then we want to retry that on the new leaseholder for the range. However, this is a bit hard to reason about, because there's cache eviction tokens and recursive DistSender calls involved, so please double-check this.
pkg/server/connectivity_test.go, line 394 at r1 (raw file):
Previously, aliher1911 (Oleg) wrote…
While it is possible here I think this is a wrong place to test this change in functionality. It should be verified on dist sender level for replica eviction logic.
I agree, a DistSender unit test seems more appropriate here. Might be worth adding one for the range cache lookup retry as well (see my other comment).
pkg/server/server.go, line 312 at r3 (raw file):
Settings: cfg.Settings, OnOutgoingPing: func(req *rpc.PingRequest) error { return checkPingFor(ctx, req.TargetNodeID, codes.FailedPrecondition)
I think we should have a comment here explaining why we return different error codes for incoming and outgoing pings.
a9e0179
to
367dcbb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think we want to retry
FailedPrecondition
here. In principle, if a range lookup fails then we want to retry that on the new leaseholder for the range. However, this is a bit hard to reason about, because there's cache eviction tokens and recursive DistSender calls involved, so please double-check this.
There are 3 call sites for this.
RangeIterator.Seek - it would always rely on underlying implementation of either DistSender or tennant Client. Both of them provide retry internally to handle this case.
DistSender.sendPartialBatch - this method sits on top of sendToReplicas where we perform retry.
DistSender.partialRangeFeed - is handling errors from routing info which uses range cache sitting on top of the same dist sender so it also retries internally.
The comment on the method states that its intended use is on range lookup calls, since range lookup or its underlying components are responsible for retry, this should be sufficient imho.
pkg/server/server.go, line 312 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think we should have a comment here explaining why we return different error codes for incoming and outgoing pings.
Done.
df6204c
to
84af023
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)
pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
There are 3 call sites for this.
RangeIterator.Seek - it would always rely on underlying implementation of either DistSender or tennant Client. Both of them provide retry internally to handle this case.
DistSender.sendPartialBatch - this method sits on top of sendToReplicas where we perform retry.
DistSender.partialRangeFeed - is handling errors from routing info which uses range cache sitting on top of the same dist sender so it also retries internally.The comment on the method states that its intended use is on range lookup calls, since range lookup or its underlying components are responsible for retry, this should be sufficient imho.
Makes sense, thanks for verifying.
pkg/server/server.go, line 312 at r3 (raw file):
Previously, aliher1911 (Oleg) wrote…
Done.
s/unavailable/decommissioned/
Also, maybe reword this to clarify that the error code is only returned when the given node is decommissioned. The way it's written it sounds like it's always blocked/rejected. :)
pkg/util/grpcutil/grpc_util.go, line 152 at r4 (raw file):
errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) || errors.Is(err, circuit.ErrBreakerOpen) || IsFailedPreconditionError(err) {
Shouldn't this check IsConnectionRejectedError? Let's expand the test with this as well.
84af023
to
09ddf15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)
pkg/server/server.go, line 312 at r3 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
s/unavailable/decommissioned/
Also, maybe reword this to clarify that the error code is only returned when the given node is decommissioned. The way it's written it sounds like it's always blocked/rejected. :)
Done.
pkg/util/grpcutil/grpc_util.go, line 152 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Shouldn't this check IsConnectionRejectedError? Let's expand the test with this as well.
That makes sense for consistency even if we should never reach here with auth errors. We need some better clarity on which layers handle which errors to make those trivial changes less intrusive to the whole kv codebase.
Previously, when server detected that outgoing call would go to a decommissioned node it would shortcut with codes.PermissionDenied to skip attempt. That would result in retry loop immediately aborting without trying another replica or refreshing leaseholder. One of the consequences of that is that writes to system.eventlog will lose events if trying to reach decommissioned node. To address that, server will return codes.FailedPrecondition which will result in this replica being evicted and call retried. Release note: None
09ddf15
to
d11a42c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 5 files at r1, 5 of 8 files at r2, 1 of 1 files at r3, 3 of 4 files at r4, 2 of 2 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
pkg/kv/kvserver/liveness/liveness.go, line 338 at r5 (raw file):
log.Infof(ctx, "attempting to set liveness draining status to %v: %v", drain, err) } if grpcutil.IsConnectionRejected(err) {
We can't get FailedPrecondition here, right? Because we are going through a DistSender. Just checking my understanding. If we got FailedPrecondition here we should retry it (but we will never).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)
pkg/kv/kvserver/liveness/liveness.go, line 338 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We can't get FailedPrecondition here, right? Because we are going through a DistSender. Just checking my understanding. If we got FailedPrecondition here we should retry it (but we will never).
It uses DB.Txn to commit the change so my understanding is that it is above the distsender. I think it is safe.
Think we may have encountered this in the wild: https://github.com/cockroachlabs/support/issues/1028#issuecomment-862585630 In this PR, when we send a request to a decommissioned node and that request fails, do we invalidate/update the relevant caches such that we don't try using that replica again? |
We do evict lease when this happens. Without this fix we just bail without touching the lease token. |
Great, thanks. I'm guessing that also evicts the range descriptor from its cache? |
bors r=erikgrinaker,tbg |
Build succeeded: |
How do we feel about backporting this to 21.1? |
I think we need to brew it on master. I traced all the code paths where this error could percolate, but lets wait at least for weekly to run that maybe? I don't think it would be hard technically to backport. |
@aliher1911 Mind prepping a 21.1 backport so that this is at least ready to go when we feel confident about it? A customer got hit pretty badly by this (blocked all changefeed creation) and would like a near-term fix. |
In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None
66910: kvcoord: fix rangefeed retries on transport errors r=miretskiy,tbg,aliher1911,rickystewart a=erikgrinaker ### rangecache: consider FailedPrecondition a retryable error In #66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None ### kvcoord: fix rangefeed retries on transport errors `DistSender.RangeFeed()` was meant to retry transport errors after refreshing the range descriptor (invalidating the cached entry). However, due to an incorrect error type check (`*sendError` vs `sendError`), these errors failed the range feed without invalidating the cached range descriptor. This was particularly severe in cases where a large number of nodes had been decommissioned, where some stale range descriptors on some nodes contained only decommissioned nodes. Since change feeds set up range feeds across many nodes and ranges in the cluster, they are likely to encounter these decommissioned nodes and return an error -- and since the descriptor cache wasn't invalidated they would keep erroring until the nodes were restarted such that the caches were flushed (often requiring a full cluster restart). Resolves #66636, touches #66586. Release note (bug fix): Change feeds now properly invalidate cached range descriptors and retry when encountering decommissioned nodes. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>
In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None
Previously, when server detected that outgoing call would go to a
decommissioned node it would shortcut with codes.PermissionDenied
to skip attempt. That would result in retry loop immediately aborting
without trying another replica or refreshing leaseholder. One of the
consequences of that is that writes to system.eventlog will lose
events if trying to reach decommissioned node.
To address that, server will return codes.FailedPrecondition which
will result in this replica being evicted and call retried.
Release note: None
Fixes #65589