-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: decommission/randomized failed [waiting for backport] #65877
Comments
roachtest.decommission/randomized failed with artifacts on release-21.1 @ 587aaedda1a00fe03dd16317ec6da329f69a3ac9:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh decommission/randomized Same failure on other branches
|
The issue is coming from the way we assert on on command output. |
Interesting. Do you have more details? What does the output look like in the error case? I agree that making the output parsing in this test more robust is a good immediate fix. |
Those are two cases that failed for example: |
There are other issues with the test where test expect nodes to be live after decommission, but because of aforementioned flake above it waits for too long for warnings to appear in the right place, and the node goes dark because of circuitbreaker on rpc kiling it. |
Okay this is another failure in the same test. So it was frequently failing with what was described above and now that I changed that it started failing with the error that is in the description above. So fixing this only addresses another issue but not this one. |
So what it looks like is that is node has lots of replicas and after ./cockroach node decommission we see draining sequence starting from: |
Ok I found a culprit: |
Interesting. So there are two cases here, where you're posting the second case. case 1: we're decommissioning (running) node X through node X. We'll set the decommissioned status and next we'll try to log the event. This may fail, since by decommissioning itself the node is about to be rejected by all other nodes. The write to the event log will race with the other nodes realizing (through gossip) to reject us. case 2: this is what we see above. We're decommissioning node X through node Y (!= X) but we still get this error and fail to log the event. It seems that our node has n2 cached as a leaseholder for the range that the event log write is addressed to. The RPC layer (on node Y) is already aware that X has been removed, and so it returns a hard error from trying to dial it. The error is generated here: cockroach/pkg/server/server.go Lines 291 to 297 in 9048076
As a general rule, we don't retry on these errors, see for example cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1864 to 1870 in 9048076
This is because otherwise, operations that go through a decommissioned node hang forever (since anywhere you try to connect you'll get the auth error). But I think we made this too symmetrical: when healthy nodes try to dial a removed node (which they shouldn't, but it's hard to avoid entirely and at once due to caches etc), they also get this permission denied error, but it really indicates something else than when the decommissioned node tries to dial someone. I think we need to break the symmetry (when you dial a removed node, you get a different error that we consider retriable, the idea being that you won't actually retry). @erikgrinaker mind checking that plan? We'd split up cockroach/pkg/server/server.go Lines 309 to 314 in 9048076
FailedPrecondition (is that the right one? Or NotFound ?) back, which will by default be retriable. We reserve the PermissionDenied for the case in which the remote actually rejects our connection attempt (OnIncomingPing, where we are the remote and we're serving the error to the removed node).
|
We definitely have a case 2 here, the only difference is we have nodes X that we decommission, Y that we go through and Z that is a cached leaseholder, but it was also decommissioned on the previous steps of the test. Just for clarity so it matches the error above. The observed behaviour is not different from what is described. Looking on the recommended use of error code, I think we should split fixes to the test and fixes to error handling here. By test fixes I mean what I mentioned earlier - filtering stderr and maybe few other minor things there unrelated to this particular history mismatch. |
Right, this is an interesting state of affairs. I get that we want to retry the log write after a lease refresh, but if we make all RPCs to decommissioned nodes return a retryable error, isn't there a risk that we can end up in retry loops for some operations? |
You're right, I think we need to be selective about it. What I should've said is: in DistSender, we should treat FailedPrecondition as retriable, in effect by not special casing it as an auth error - so we'll treat it like an opaque SendError (leading to a cache eviction + retry, if memory serves correctly). When a node in the cluster happens to run an RPC targeting that particular decommissioned node directly - it should not retry, i.e. retain the current behavior. |
Aren't these mutually exclusive? If "a node in the cluster happens to run an RPC targeting that particular decommissioned node", wouldn't that RPC typically go through the DistSender, who we're saying should retry it? |
Actually, I guess that's not true. The DistSender primarily uses key/range addressing, not node addressing, so we'd want it to eventually get to that range. I.e., key/range-addressed operations should retry, but node-addressed operations shouldn't. |
So this is the interesting part where we handle auth error which now doesn't allow retries and would allow them if we change outgoing heartbeat check to return codes.FailedPrecondition: cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1860 to 1938 in 9048076
I actually tried the change and not able to repro the test failure with it. But I didn't try the full test suite to verify if anything else will fail. |
Great! As outlined, we need to be careful to handle |
This will be fixed with the backport of #66199. |
@aliher1911 could you backport your change? Or are we intentionally holding off to let it bake on master for a bit? |
roachtest.decommission/randomized failed with artifacts on release-21.1 @ db3cdb8913ba56599cc52766411893438b5c4b54:
Reproduce
To reproduce, try: # From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh decommission/randomized |
It's been 26 days, should we go ahead with the backport after the local verifications you mention above @aliher1911? |
Actually we backported it to 21.1 already: #66831 |
roachtest.decommission/randomized failed with artifacts on release-21.1 @ 4837b15a513fe9f0283b1583fece7fe8d3cd49ae:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh decommission/randomized
Same failure on other branches
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: