server: return FailedPrecondition when talking to decom node #66199

aliher1911 · 2021-06-08T18:54:16Z

Previously, when server detected that outgoing call would go to a
decommissioned node it would shortcut with codes.PermissionDenied
to skip attempt. That would result in retry loop immediately aborting
without trying another replica or refreshing leaseholder. One of the
consequences of that is that writes to system.eventlog will lose
events if trying to reach decommissioned node.
To address that, server will return codes.FailedPrecondition which
will result in this replica being evicted and call retried.

Release note: None

Fixes #65589

cockroach-teamcity · 2021-06-08T18:54:22Z

This change is

erikgrinaker

Looks good! Can't think of anything else we need to handle. Except maybe check that it doesn't break anything for clients when connecting to decommissioned nodes (e.g. the cockroach CLI, both SQL and admin RPCs), although I don't think it should affect them.

Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):

			// in-between.
			//
			if withCommit && !grpcutil.RequestDidNotStart(err) && !grpcutil.IsFailedPreconditionError(err) {

I'd argue that RequestDidNotStart should return true for FailedPrecondition (if we take "precondition" to mean "pre-evaluation" for the entire request). In this specific case it seems to apply, at least, since what we care about here is whether any writes could have gone through despite the error.

pkg/rpc/context.go, line 1062 at r1 (raw file):

				ctx.masterCtx, "rpc.Context: grpc heartbeat", func(masterCtx context.Context) {
					err := ctx.runHeartbeat(conn, target, redialChan)
					if err != nil && !grpcutil.IsClosedConnection(err) && !grpcutil.IsAuthError(err) && !grpcutil.IsFailedPreconditionError(err) {

nit: Break lines at 100 characters. I don't really mind, but I know others do (surprised the linter isn't saying anything).

pkg/server/connectivity_test.go, line 394 at r1 (raw file):

		// Trying to scan the scratch range from the decommissioned node should
		// now result in a permission denied error.
		_, err := decomSrv.DB().Scan(ctx, scratchKey, keys.MaxKey, 1)

Let's try to do a scan the other way as well (from clusterSrv to decomSrv), to test that the operation doesn't hang because of infinite retries.

pkg/server/server.go, line 286 at r1 (raw file):

	nodeTombStorage := &nodeTombstoneStorage{engs: engines}
	checkPingFor := func(ctx context.Context, nodeID roachpb.NodeID, c codes.Code) error {

This parameter could use a more descriptive name.

pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):

// IsFailedPreconditionError returns true if err's Cause is an error produced by
// gRPC due to failed precondition. This error is used internally to shortcut
// known unavailable nodes and is used by kv retry loop. When outside loop its

I'd be even more specific here and say that it's returned for decommissioned peers -- we want to retry unavailable nodes, because they could become available again. Maybe specify which KV retry loop as well and that we want to retry range-addressed requests, since there's retry loops all over the place. :)

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

pkg/server/connectivity_test.go, line 394 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Let's try to do a scan the other way as well (from clusterSrv to decomSrv), to test that the operation doesn't hang because of infinite retries.

Hm, maybe this doesn't make sense; if we put a range solely on the decommissioned node then either that range would have been moved off of the node first, or the request would indeed hang (because we're specifically retrying these requests in the DistSender).

I think we should have a test for the scenario we're trying to solve here though, but I'm not sure how to set ip up. Maybe something like setting up a range across three nodes, moving the leaseholder to the node we want to decommission, quiesce the range, shut the node down, decommission it, and then try to run a scan over it -- this should eventually succeed once the leaseholder is moved. It's a bit involved though. Any better ideas @tbg?

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

pkg/server/connectivity_test.go, line 394 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Hm, maybe this doesn't make sense; if we put a range solely on the decommissioned node then either that range would have been moved off of the node first, or the request would indeed hang (because we're specifically retrying these requests in the DistSender).

I think we should have a test for the scenario we're trying to solve here though, but I'm not sure how to set ip up. Maybe something like setting up a range across three nodes, moving the leaseholder to the node we want to decommission, quiesce the range, shut the node down, decommission it, and then try to run a scan over it -- this should eventually succeed once the leaseholder is moved. It's a bit involved though. Any better ideas @tbg?

Maybe it'd be easier to just stop the node and set the decommissioned status for the node directly in KV (without actually going through the whole decommission process).

tbg · 2021-06-10T12:49:51Z

pkg/util/grpcutil/grpc_util.go

@@ -103,6 +103,20 @@ func IsAuthError(err error) bool {
 	return false
 }

+// IsFailedPreconditionError returns true if err's Cause is an error produced by
+// gRPC due to failed precondition. This error is used internally to shortcut
+// known unavailable nodes and is used by kv retry loop. When outside loop its


both "known unavailable nodes" "is used by kv retry loop" is ambiguous, can you be more specific? For example

This error is returned when trying to connect to nodes that are decommissioned (and thus will never be reachable again, at least under their old NodeID). The recipient of this error will want to handle this error by either contacting a different node (DistSender does this) when possible, or by returning the error to the client (handling it similar to an AuthError).

tbg · 2021-06-10T12:52:14Z

pkg/kv/kvclient/kvcoord/dist_sender.go

@@ -1916,7 +1916,7 @@ func (ds *DistSender) sendToReplicas(
 			// evaluating twice, overwriting another unrelated write that fell
 			// in-between.
 			//
-			if withCommit && !grpcutil.RequestDidNotStart(err) {
+			if withCommit && !grpcutil.RequestDidNotStart(err) && !grpcutil.IsFailedPreconditionError(err) {


shouldn't RequestDidNotStart be taught to recognize both AuthError and FailedPreconditionError as guaranteeing that the request didn't start?

tbg

I think somehow we have to make IsAuthError and IsFailedPreconditionError and RequestDidNotStart play together better. In all cases the caller is asking something about the request. They're either asking "should I try again?" or "is there any chance this actually got executed despite my getting an error"? (And maybe something else I'm not aware of). Rather than sprinkling if !A && !b { around, can we clean this up by giving the callers methods that describe what they are actually trying to find out?

Btw I went through the callers of AuthError and I think there are some that we need to update. For example

cockroach/pkg/ccl/kvccl/kvtenantccl/connector.go

Lines 332 to 335 in 65c30a3

    
           if grpcutil.IsAuthError(err) { 
        
           	// Authentication or authorization error. Propagate. 
        
           	return nil, nil, err 
        
           }

should propagate (not sure that we'll really be able to hit the failed precondition here since that code is on the SQL tenant, but still)

Then this

cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go

Lines 1864 to 1870 in fb2e855

    
           if grpcutil.IsAuthError(err) { 
        
           	// Authentication or authorization error. Propagate. 
        
           	if ambiguousError != nil { 
        
           		return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError) 
        
           	} 
        
           	return nil, err 
        
           }

this

cockroach/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go

Lines 262 to 265 in 0cea9dc

    
           if grpcutil.IsAuthError(err) { 
        
           	// Authentication or authorization error. Propagate. 
        
           	return args.Timestamp, err 
        
           }

this

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 338 to 340 in dd325f0

    
           if grpcutil.IsAuthError(err) { 
        
           	return err 
        
           }

and the list continues, basically you need to audit all of the callers (I like the "Call Hierarchy" feature in Goland for this).

tbg · 2021-06-10T13:00:22Z

IsConnectionRejected(err) could be a good name here - it's true if the error proves that we are, for whatever reason, not allowed to connect (either because of an auth error, or because the recipient is decommissioned). This would also feed into RequestDidNotStart.

aliher1911

I was going through the call sites where we check auth error to abort early and what happening is that we only have two places where it makes sense to handle. Outside of those places we should never get FailedPrecondition as we will keep iterating until we exhaust replicas or hit some other error. I agree about replacing isauterror with a better name, because callers don't care about type of error, they only care if they should bail in all those cases or most of them. Let me try to change that.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911)

aliher1911

I think connector.go should retry, it goes through the nodes in a loop and FailedPredondition should evict node from its list as other retryable errors. This is actually second place I mentioned where retry makes sense. And then ambigous error is handled fine, but I'll add precondition check to request didn't start as advised.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911)

aliher1911

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

shouldn't RequestDidNotStart be taught to recognize both AuthError and FailedPreconditionError as guaranteeing that the request didn't start?

In practice it never reaches this branch, but for the sake for consistency it should.

pkg/kv/kvclient/kvcoord/dist_sender.go, line 1919 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'd argue that RequestDidNotStart should return true for FailedPrecondition (if we take "precondition" to mean "pre-evaluation" for the entire request). In this specific case it seems to apply, at least, since what we care about here is whether any writes could have gone through despite the error.

Done.

pkg/rpc/context.go, line 1062 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: Break lines at 100 characters. I don't really mind, but I know others do (surprised the linter isn't saying anything).

Done.

pkg/server/connectivity_test.go, line 394 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Maybe it'd be easier to just stop the node and set the decommissioned status for the node directly in KV (without actually going through the whole decommission process).

While it is possible here I think this is a wrong place to test this change in functionality. It should be verified on dist sender level for replica eviction logic.

pkg/server/server.go, line 286 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This parameter could use a more descriptive name.

Done.

pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I'd be even more specific here and say that it's returned for decommissioned peers -- we want to retry unavailable nodes, because they could become available again. Maybe specify which KV retry loop as well and that we want to retry range-addressed requests, since there's retry loops all over the place. :)

Done.

pkg/util/grpcutil/grpc_util.go, line 108 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

both "known unavailable nodes" "is used by kv retry loop" is ambiguous, can you be more specific? For example

This error is returned when trying to connect to nodes that are decommissioned (and thus will never be reachable again, at least under their old NodeID). The recipient of this error will want to handle this error by either contacting a different node (DistSender does this) when possible, or by returning the error to the client (handling it similar to an AuthError).

Done.

erikgrinaker

Reviewed 7 of 8 files at r2, 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)

pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):

	// For now, all errors are retryable except transport errors where request was
	// rejected.
	return !grpcutil.IsConnectionRejected(err)

I think we want to retry FailedPrecondition here. In principle, if a range lookup fails then we want to retry that on the new leaseholder for the range. However, this is a bit hard to reason about, because there's cache eviction tokens and recursive DistSender calls involved, so please double-check this.

pkg/server/connectivity_test.go, line 394 at r1 (raw file):

Previously, aliher1911 (Oleg) wrote…

While it is possible here I think this is a wrong place to test this change in functionality. It should be verified on dist sender level for replica eviction logic.

I agree, a DistSender unit test seems more appropriate here. Might be worth adding one for the range cache lookup retry as well (see my other comment).

pkg/server/server.go, line 312 at r3 (raw file):

		Settings:   cfg.Settings,
		OnOutgoingPing: func(req *rpc.PingRequest) error {
			return checkPingFor(ctx, req.TargetNodeID, codes.FailedPrecondition)

I think we should have a comment here explaining why we return different error codes for incoming and outgoing pings.

aliher1911

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think we want to retry FailedPrecondition here. In principle, if a range lookup fails then we want to retry that on the new leaseholder for the range. However, this is a bit hard to reason about, because there's cache eviction tokens and recursive DistSender calls involved, so please double-check this.

There are 3 call sites for this.
RangeIterator.Seek - it would always rely on underlying implementation of either DistSender or tennant Client. Both of them provide retry internally to handle this case.
DistSender.sendPartialBatch - this method sits on top of sendToReplicas where we perform retry.
DistSender.partialRangeFeed - is handling errors from routing info which uses range cache sitting on top of the same dist sender so it also retries internally.

The comment on the method states that its intended use is on range lookup calls, since range lookup or its underlying components are responsible for retry, this should be sufficient imho.

pkg/server/server.go, line 312 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

I think we should have a comment here explaining why we return different error codes for incoming and outgoing pings.

Done.

erikgrinaker

Reviewed 4 of 4 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)

pkg/kv/kvclient/rangecache/range_cache.go, line 1329 at r3 (raw file):

Previously, aliher1911 (Oleg) wrote…

There are 3 call sites for this.
RangeIterator.Seek - it would always rely on underlying implementation of either DistSender or tennant Client. Both of them provide retry internally to handle this case.
DistSender.sendPartialBatch - this method sits on top of sendToReplicas where we perform retry.
DistSender.partialRangeFeed - is handling errors from routing info which uses range cache sitting on top of the same dist sender so it also retries internally.

The comment on the method states that its intended use is on range lookup calls, since range lookup or its underlying components are responsible for retry, this should be sufficient imho.

Makes sense, thanks for verifying.

pkg/server/server.go, line 312 at r3 (raw file):

Previously, aliher1911 (Oleg) wrote…

Done.

s/unavailable/decommissioned/

Also, maybe reword this to clarify that the error code is only returned when the given node is decommissioned. The way it's written it sounds like it's always blocked/rejected. :)

pkg/util/grpcutil/grpc_util.go, line 152 at r4 (raw file):

		errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) ||
		errors.Is(err, circuit.ErrBreakerOpen) ||
		IsFailedPreconditionError(err) {

Shouldn't this check IsConnectionRejectedError? Let's expand the test with this as well.

aliher1911

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)

pkg/server/server.go, line 312 at r3 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

s/unavailable/decommissioned/

Also, maybe reword this to clarify that the error code is only returned when the given node is decommissioned. The way it's written it sounds like it's always blocked/rejected. :)

Done.

pkg/util/grpcutil/grpc_util.go, line 152 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Shouldn't this check IsConnectionRejectedError? Let's expand the test with this as well.

That makes sense for consistency even if we should never reach here with auth errors. We need some better clarity on which layers handle which errors to make those trivial changes less intrusive to the whole kv codebase.

Previously, when server detected that outgoing call would go to a decommissioned node it would shortcut with codes.PermissionDenied to skip attempt. That would result in retry loop immediately aborting without trying another replica or refreshing leaseholder. One of the consequences of that is that writes to system.eventlog will lose events if trying to reach decommissioned node. To address that, server will return codes.FailedPrecondition which will result in this replica being evicted and call retried. Release note: None

erikgrinaker

Reviewed 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @tbg)

tbg

Reviewed 1 of 5 files at r1, 5 of 8 files at r2, 1 of 1 files at r3, 3 of 4 files at r4, 2 of 2 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

pkg/kv/kvserver/liveness/liveness.go, line 338 at r5 (raw file):

				log.Infof(ctx, "attempting to set liveness draining status to %v: %v", drain, err)
			}
			if grpcutil.IsConnectionRejected(err) {

We can't get FailedPrecondition here, right? Because we are going through a DistSender. Just checking my understanding. If we got FailedPrecondition here we should retry it (but we will never).

aliher1911

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @tbg)

pkg/kv/kvserver/liveness/liveness.go, line 338 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We can't get FailedPrecondition here, right? Because we are going through a DistSender. Just checking my understanding. If we got FailedPrecondition here we should retry it (but we will never).

It uses DB.Txn to commit the change so my understanding is that it is above the distsender. I think it is safe.

erikgrinaker · 2021-06-16T18:02:23Z

Think we may have encountered this in the wild:

https://github.com/cockroachlabs/support/issues/1028#issuecomment-862585630

In this PR, when we send a request to a decommissioned node and that request fails, do we invalidate/update the relevant caches such that we don't try using that replica again?

aliher1911 · 2021-06-16T18:13:46Z

We do evict lease when this happens. Without this fix we just bail without touching the lease token.

erikgrinaker · 2021-06-16T18:29:28Z

Great, thanks. I'm guessing that also evicts the range descriptor from its cache?

aliher1911 · 2021-06-17T11:06:08Z

bors r=erikgrinaker,tbg

craig · 2021-06-17T12:07:59Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker · 2021-06-17T16:34:07Z

How do we feel about backporting this to 21.1?

aliher1911 · 2021-06-17T22:10:04Z

I think we need to brew it on master. I traced all the code paths where this error could percolate, but lets wait at least for weekly to run that maybe? I don't think it would be hard technically to backport.

erikgrinaker · 2021-06-24T11:25:41Z

@aliher1911 Mind prepping a 21.1 backport so that this is at least ready to go when we feel confident about it? A customer got hit pretty badly by this (blocked all changefeed creation) and would like a near-term fix.

In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None

66910: kvcoord: fix rangefeed retries on transport errors r=miretskiy,tbg,aliher1911,rickystewart a=erikgrinaker ### rangecache: consider FailedPrecondition a retryable error In #66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None ### kvcoord: fix rangefeed retries on transport errors `DistSender.RangeFeed()` was meant to retry transport errors after refreshing the range descriptor (invalidating the cached entry). However, due to an incorrect error type check (`*sendError` vs `sendError`), these errors failed the range feed without invalidating the cached range descriptor. This was particularly severe in cases where a large number of nodes had been decommissioned, where some stale range descriptors on some nodes contained only decommissioned nodes. Since change feeds set up range feeds across many nodes and ranges in the cluster, they are likely to encounter these decommissioned nodes and return an error -- and since the descriptor cache wasn't invalidated they would keep erroring until the nodes were restarted such that the caches were flushed (often requiring a full cluster restart). Resolves #66636, touches #66586. Release note (bug fix): Change feeds now properly invalidate cached range descriptors and retry when encountering decommissioned nodes. /cc @cockroachdb/kv Co-authored-by: Erik Grinaker <[email protected]>

In cockroachdb#66199, the gRPC error `FailedPrecondition` (returned when contacting a decommissioned node) was incorrectly considered a permanent error during range descriptor lookups, via `IsRangeLookupErrorRetryable()`. These lookups are all trying to reach a meta-range (addressed by key), so when encountering a decommissioned node they should evict the cache token and retry the lookup. This patch reverses that change, making only authentication errors permanent (i.e. when the local node is decommissioned or otherwise not part of the cluster). Release note: None

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch from 2f69434 to fb2e855 Compare June 9, 2021 17:35

aliher1911 requested a review from erikgrinaker June 9, 2021 17:37

aliher1911 changed the title ~~<pkg>: <short description - lowercase, no final period>~~ server: return FailedPrecondition when talking to decom node Jun 9, 2021

aliher1911 marked this pull request as ready for review June 9, 2021 17:38

aliher1911 requested a review from tbg June 9, 2021 17:38

erikgrinaker reviewed Jun 9, 2021

View reviewed changes

erikgrinaker reviewed Jun 10, 2021

View reviewed changes

tbg reviewed Jun 10, 2021

View reviewed changes

aliher1911 commented Jun 10, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch from fb2e855 to 8a1b5d2 Compare June 11, 2021 10:33

aliher1911 commented Jun 11, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch 4 times, most recently from 9625a9d to a9e0179 Compare June 11, 2021 20:32

erikgrinaker reviewed Jun 12, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch from a9e0179 to 367dcbb Compare June 14, 2021 18:14

aliher1911 commented Jun 14, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch 2 times, most recently from df6204c to 84af023 Compare June 14, 2021 22:00

erikgrinaker reviewed Jun 15, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch from 84af023 to 09ddf15 Compare June 15, 2021 11:15

aliher1911 commented Jun 15, 2021

View reviewed changes

aliher1911 force-pushed the fix-decomed-node-evict-leaseholder branch from 09ddf15 to d11a42c Compare June 15, 2021 12:26

erikgrinaker approved these changes Jun 15, 2021

View reviewed changes

tbg approved these changes Jun 15, 2021

View reviewed changes

tbg self-requested a review June 15, 2021 12:54

aliher1911 commented Jun 15, 2021

View reviewed changes

craig bot merged commit 842642f into cockroachdb:master Jun 17, 2021

tbg added the backport-21.1.x label Jun 17, 2021

tbg mentioned this pull request Jun 17, 2021

roachtest: decommission/randomized failed [waiting for backport] #65877

Closed

erikgrinaker mentioned this pull request Jun 17, 2021

kv: operations failing when encountering decommissioned nodes #66586

Closed

3 tasks

This was referenced Jun 18, 2021

Trying to sent a batch to decommissioned node results in Ambiguous error #47460

Closed

Prevent sending batches to decommissioned nodes #47459

Closed

changefeedccl: Rangefeeds might fail due to stale range cache #66636

Closed

tbg mentioned this pull request Jun 21, 2021

No programmatic way to determine if a given data directory has been decommissioned on v21.1? #66658

Open

aliher1911 mentioned this pull request Jun 24, 2021

release-21.1: server: return FailedPrecondition when talking to decom node #66831

Merged

erikgrinaker mentioned this pull request Jun 25, 2021

kvcoord: fix rangefeed retries on transport errors #66910

Merged

erikgrinaker mentioned this pull request Jun 29, 2021

release-21.1: kvcoord: fix rangefeed retries on transport errors #67013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: return FailedPrecondition when talking to decom node #66199

server: return FailedPrecondition when talking to decom node #66199

aliher1911 commented Jun 8, 2021 •

edited

Loading

cockroach-teamcity commented Jun 8, 2021

erikgrinaker left a comment

erikgrinaker left a comment

erikgrinaker left a comment

tbg Jun 10, 2021

tbg Jun 10, 2021

tbg left a comment

tbg commented Jun 10, 2021

aliher1911 left a comment

aliher1911 left a comment

aliher1911 left a comment

erikgrinaker left a comment

aliher1911 left a comment

erikgrinaker left a comment

aliher1911 left a comment

erikgrinaker left a comment

tbg left a comment

aliher1911 left a comment

erikgrinaker commented Jun 16, 2021

aliher1911 commented Jun 16, 2021

erikgrinaker commented Jun 16, 2021

aliher1911 commented Jun 17, 2021

craig bot commented Jun 17, 2021

erikgrinaker commented Jun 17, 2021

aliher1911 commented Jun 17, 2021

erikgrinaker commented Jun 24, 2021

	if grpcutil.IsAuthError(err) {
	// Authentication or authorization error. Propagate.
	return nil, nil, err
	}

	if grpcutil.IsAuthError(err) {
	// Authentication or authorization error. Propagate.
	if ambiguousError != nil {
	return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError)
	}
	return nil, err
	}

	if grpcutil.IsAuthError(err) {
	// Authentication or authorization error. Propagate.
	return args.Timestamp, err
	}

server: return FailedPrecondition when talking to decom node #66199

server: return FailedPrecondition when talking to decom node #66199

Conversation

aliher1911 commented Jun 8, 2021 • edited Loading

cockroach-teamcity commented Jun 8, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg Jun 10, 2021

Choose a reason for hiding this comment

tbg Jun 10, 2021

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Jun 10, 2021

aliher1911 left a comment

Choose a reason for hiding this comment

aliher1911 left a comment

Choose a reason for hiding this comment

aliher1911 left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

aliher1911 left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

aliher1911 left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

aliher1911 left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jun 16, 2021

aliher1911 commented Jun 16, 2021

erikgrinaker commented Jun 16, 2021

aliher1911 commented Jun 17, 2021

craig bot commented Jun 17, 2021

erikgrinaker commented Jun 17, 2021

aliher1911 commented Jun 17, 2021

erikgrinaker commented Jun 24, 2021

aliher1911 commented Jun 8, 2021 •

edited

Loading