Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/randomized failed [waiting for backport] #65877

Closed
cockroach-teamcity opened this issue May 29, 2021 · 22 comments
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

roachtest.decommission/randomized failed with artifacts on release-21.1 @ 4837b15a513fe9f0283b1583fece7fe8d3cd49ae:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:767: unexpected diff(matrix, expMatrix):
		[[][]string[19] != [][]string[20]]
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
		vs.
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh decommission/randomized

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 29, 2021
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone May 29, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/randomized failed with artifacts on release-21.1 @ 587aaedda1a00fe03dd16317ec6da329f69a3ac9:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:767: unexpected diff(matrix, expMatrix):
		[[][]string[19] != [][]string[20]]
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
		vs.
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh decommission/randomized

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@tbg tbg changed the title roachtest: decommission/randomized failed roachtest: decommission/randomized failed [missing event] May 31, 2021
@aliher1911 aliher1911 self-assigned this Jun 1, 2021
@pbardea pbardea removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jun 1, 2021
@aliher1911
Copy link
Contributor

The issue is coming from the way we assert on on command output.
We execute commands with CombineOutput() which redirects stderr to stdout and merge it together. Command result goes to stdout and warnings to to strerr. Commands write warning prior to executing action, but in merged output warnings sometimes get reordered with output. Our assert uses a tail of output to match all expected lines so if warning slips under output the test fails.
Having err and out in the same buffer is not the best idea as it complicates verification. But changing it will go through a lot of code. As a fix I suggest we filter known warnings for now.

@tbg
Copy link
Member

tbg commented Jun 3, 2021

Interesting. Do you have more details? What does the output look like in the error case? I agree that making the output parsing in this test more robust is a good immediate fix.

@aliher1911
Copy link
Contributor

Those are two cases that failed for example:
https://gist.github.com/aliher1911/fc94d7ba65fbd519f8d3dd8c19002943
And the warning in question is:
^warning: node [0-9]+ is already decommissioning or decommissioned$
Failure sometimes happens on one of two asserts where we try to decom node once again and expect noop.

@aliher1911
Copy link
Contributor

There are other issues with the test where test expect nodes to be live after decommission, but because of aforementioned flake above it waits for too long for warnings to appear in the right place, and the node goes dark because of circuitbreaker on rpc kiling it.

@aliher1911
Copy link
Contributor

Okay this is another failure in the same test. So it was frequently failing with what was described above and now that I changed that it started failing with the error that is in the description above. So fixing this only addresses another issue but not this one.
The one described here is reproduceable now, and it looks like the events table misses one state.

@aliher1911
Copy link
Contributor

So what it looks like is that is node has lots of replicas and after ./cockroach node decommission we see draining sequence starting from:
6,false,28,true,decommissioning,false
and going to
6,false,0,true,decommissioning,false
in 20+ seconds test passes.
While is sequence starts from:
2,false,3,true,decommissioning,false
it'll take 10 sec to finish and test fails without seeing node_decommissioning status next to last.
I'm going to add more logging to see which node is in each state in that query to maybe get more insights.

@aliher1911
Copy link
Contributor

Ok I found a culprit:
E210603 22:03:19.644374 191965 1@server/server.go:2377 ⋮ [-] 1793 unable to record event: ‹common:<timestamp:1622757789936846453 event_type:"node_decommissioning" > node:<requesting_node_id:1 target_node_id:5 >›: ‹log-event›: failed to connect to n2 at ‹10.142.0.54:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = PermissionDenied desc = n2 was permanently removed from the cluster at 2021-06-03 22:02:25.848597039 +0000 UTC; it is not allowed to rejoin the cluster›
I can see that on the node that register decommissioning event.
Node 5 was killed and we are trying to decommission it while it's unavailable, node 2 was also decommissioned as a part of the test prior to that. We are running with only 2 nodes which should be... still ok?

@tbg
Copy link
Member

tbg commented Jun 4, 2021

Interesting. So there are two cases here, where you're posting the second case.

case 1: we're decommissioning (running) node X through node X. We'll set the decommissioned status and next we'll try to log the event. This may fail, since by decommissioning itself the node is about to be rejected by all other nodes. The write to the event log will race with the other nodes realizing (through gossip) to reject us.

case 2: this is what we see above. We're decommissioning node X through node Y (!= X) but we still get this error and fail to log the event. It seems that our node has n2 cached as a leaseholder for the range that the event log write is addressed to. The RPC layer (on node Y) is already aware that X has been removed, and so it returns a hard error from trying to dial it. The error is generated here:

if !ts.IsZero() {
// The node was decommissioned.
return grpcstatus.Errorf(codes.PermissionDenied,
"n%d was permanently removed from the cluster at %s; it is not allowed to rejoin the cluster",
nodeID, ts,
)
}

As a general rule, we don't retry on these errors, see for example

if grpcutil.IsAuthError(err) {
// Authentication or authorization error. Propagate.
if ambiguousError != nil {
return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError)
}
return nil, err
}

This is because otherwise, operations that go through a decommissioned node hang forever (since anywhere you try to connect you'll get the auth error). But I think we made this too symmetrical: when healthy nodes try to dial a removed node (which they shouldn't, but it's hard to avoid entirely and at once due to caches etc), they also get this permission denied error, but it really indicates something else than when the decommissioned node tries to dial someone. I think we need to break the symmetry (when you dial a removed node, you get a different error that we consider retriable, the idea being that you won't actually retry).

@erikgrinaker mind checking that plan? We'd split up

OnOutgoingPing: func(req *rpc.PingRequest) error {
return checkPingFor(ctx, req.TargetNodeID)
},
OnIncomingPing: func(req *rpc.PingRequest) error {
return checkPingFor(ctx, req.OriginNodeID)
}}
so that when trying to connect to a removed node (OnOutgoingPing), we get a FailedPrecondition (is that the right one? Or NotFound?) back, which will by default be retriable. We reserve the PermissionDenied for the case in which the remote actually rejects our connection attempt (OnIncomingPing, where we are the remote and we're serving the error to the removed node).

@aliher1911
Copy link
Contributor

We definitely have a case 2 here, the only difference is we have nodes X that we decommission, Y that we go through and Z that is a cached leaseholder, but it was also decommissioned on the previous steps of the test. Just for clarity so it matches the error above. The observed behaviour is not different from what is described.

Looking on the recommended use of error code, FailedPrecondition intended use is when system state needs to be explicitly fixed before retrying while Unavailable seem to be the code that you are supposed to immediately retry. So if we return FailedPrecondition at transport layer, then it should trigger refreshing leases and retry. Sounds logical to me.

I think we should split fixes to the test and fixes to error handling here. By test fixes I mean what I mentioned earlier - filtering stderr and maybe few other minor things there unrelated to this particular history mismatch.

@erikgrinaker
Copy link
Contributor

Right, this is an interesting state of affairs. I get that we want to retry the log write after a lease refresh, but if we make all RPCs to decommissioned nodes return a retryable error, isn't there a risk that we can end up in retry loops for some operations?

@tbg
Copy link
Member

tbg commented Jun 4, 2021

You're right, I think we need to be selective about it. What I should've said is: in DistSender, we should treat FailedPrecondition as retriable, in effect by not special casing it as an auth error - so we'll treat it like an opaque SendError (leading to a cache eviction + retry, if memory serves correctly). When a node in the cluster happens to run an RPC targeting that particular decommissioned node directly - it should not retry, i.e. retain the current behavior.

@erikgrinaker
Copy link
Contributor

in DistSender, we should treat FailedPrecondition as retriable, in effect by not special casing it as an auth error - so we'll treat it like an opaque SendError (leading to a cache eviction + retry, if memory serves correctly). When a node in the cluster happens to run an RPC targeting that particular decommissioned node directly - it should not retry, i.e. retain the current behavior.

Aren't these mutually exclusive? If "a node in the cluster happens to run an RPC targeting that particular decommissioned node", wouldn't that RPC typically go through the DistSender, who we're saying should retry it?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 4, 2021

Actually, I guess that's not true. The DistSender primarily uses key/range addressing, not node addressing, so we'd want it to eventually get to that range. I.e., key/range-addressed operations should retry, but node-addressed operations shouldn't.

@aliher1911
Copy link
Contributor

So this is the interesting part where we handle auth error which now doesn't allow retries and would allow them if we change outgoing heartbeat check to return codes.FailedPrecondition:

br, err = transport.SendNext(ctx, ba)
ds.maybeIncrementErrCounters(br, err)
if err != nil {
if grpcutil.IsAuthError(err) {
// Authentication or authorization error. Propagate.
if ambiguousError != nil {
return nil, roachpb.NewAmbiguousResultErrorf("error=%s [propagate]", ambiguousError)
}
return nil, err
}
// For most connection errors, we cannot tell whether or not the request
// may have succeeded on the remote server (exceptions are captured in the
// grpcutil.RequestDidNotStart function). We'll retry the request in order
// to attempt to eliminate the ambiguity; see below. If there's a commit
// in the batch, we track the ambiguity more explicitly by setting
// ambiguousError. This serves two purposes:
// 1) the higher-level retries in the DistSender will not forget the
// ambiguity, like they forget it for non-commit batches. This in turn
// will ensure that TxnCoordSender-level retries don't happen across
// commits; that'd be bad since requests are not idempotent across
// commits.
// TODO(andrei): This higher-level does things too bluntly, retrying only
// in case of sendError. It should also retry in case of
// AmbiguousRetryError as long as it makes sure to not forget about the
// ambiguity.
// 2) SQL recognizes AmbiguousResultErrors and gives them a special code
// (StatementCompletionUnknown).
// TODO(andrei): The use of this code is inconsistent because a) the
// DistSender tries to only return the code for commits, but it'll happily
// forward along AmbiguousResultErrors coming from the replica and b) we
// probably should be returning that code for non-commit statements too.
//
// We retry requests in order to avoid returning errors (in particular,
// AmbiguousResultError). Retrying the batch will either:
// a) succeed if the request had not been evaluated the first time.
// b) succeed if the request also succeeded the first time, but is
// idempotent (i.e. it is internal to a txn, without a commit in the
// batch).
// c) fail if it succeeded the first time and the request is not
// idempotent. In the case of EndTxn requests, this is ensured by the
// tombstone keys in the timestamp cache. The retry failing does not
// prove that the request did not succeed the first time around, so we
// can't claim success (and even if we could claim success, we still
// wouldn't have the complete result of the successful evaluation).
//
// Case a) is great - the retry made the request succeed. Case b) is also
// good; due to idempotency we managed to swallow a communication error.
// Case c) is not great - we'll end up returning an error even though the
// request might have succeeded (an AmbiguousResultError if withCommit is
// set).
//
// TODO(andrei): Case c) is broken for non-transactional requests: nothing
// prevents them from double evaluation. This can result in, for example,
// an increment applying twice, or more subtle problems like a blind write
// evaluating twice, overwriting another unrelated write that fell
// in-between.
//
if withCommit && !grpcutil.RequestDidNotStart(err) {
ambiguousError = err
}
log.VErrEventf(ctx, 2, "RPC error: %s", err)
// If the error wasn't just a context cancellation and the down replica
// is cached as the lease holder, evict it. The only other eviction
// happens below on NotLeaseHolderError, but if the next replica is the
// actual lease holder, we're never going to receive one of those and
// will thus pay the price of trying the down node first forever.
//
// NB: we should consider instead adding a successful reply from the next
// replica into the cache, but without a leaseholder (and taking into
// account that the local node can't be down) it won't take long until we
// talk to a replica that tells us who the leaseholder is.
if ctx.Err() == nil {
if lh := routing.Leaseholder(); lh != nil && *lh == curReplica {
routing.EvictLease(ctx)
}
}

I actually tried the change and not able to repro the test failure with it. But I didn't try the full test suite to verify if anything else will fail.

@erikgrinaker
Copy link
Contributor

Great! As outlined, we need to be careful to handle FailedPrecondition as a non-retryable error for RPC calls outside of the DistSender -- otherwise they may end up in ~infinite retry loops. Looking at IsAuthError() call sites and code paths would be a good start. And we should make sure there's ample comments and tests for this, since it's a bit of a footgun.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@tbg
Copy link
Member

tbg commented Jun 17, 2021

This will be fixed with the backport of #66199.

@tbg
Copy link
Member

tbg commented Jun 22, 2021

@aliher1911 could you backport your change? Or are we intentionally holding off to let it bake on master for a bit?

@aliher1911
Copy link
Contributor

@tbg brewing it on master a bit was my plan. I'm also trying some local tests to verify if it addresses some other issues with talking to decom nodes that were raised against 19.1 (#47460 and #47459) and checking if nothing else is broken with this fix.

@tbg tbg changed the title roachtest: decommission/randomized failed [missing event] roachtest: decommission/randomized failed [waiting for backport] Jun 22, 2021
@cockroach-teamcity
Copy link
Member Author

roachtest.decommission/randomized failed with artifacts on release-21.1 @ db3cdb8913ba56599cc52766411893438b5c4b54:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:728: unexpected diff(matrix, expMatrix):
		[[][]string[19] != [][]string[20]]
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
		vs.
		[[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh decommission/randomized

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Jul 13, 2021

It's been 26 days, should we go ahead with the backport after the local verifications you mention above @aliher1911?

@aliher1911
Copy link
Contributor

Actually we backported it to 21.1 already: #66831
As for previous versions, it is not possible to do so as the code to shortcut decommissioned nodes didn't exits before 21.1.
So I'm going to close this as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants