Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916
Labels
A-observability-inf
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
no-issue-activity
X-stale
Normally, the loss of quorum on the timeseries ranges doesn't cause any major issues other than the Metrics page not loading graphs on the DBConsole. However, restarting a node then causes that node to fail in all sorts of unusual ways which are not logged and is very hard to diagnose.
My tests showed it affects 21.2.12 (latest at the time), but 22.1 is unaffected. However, I'm putting this reproduction here for posterity.
Start a 5 node cluster on 21.2.12 in the usual way:
On a new 5 node cluster, all the existing ranges are replicated 5 times except for the timeseries data, which is replicated 3 times. This makes it simple to identify 2 nodes to suddenly destroy which will lose quorum on the timeseries data, but everything else should survive:
Confirm that there is now only one replica:
Find the node ids of those replicas:
So I'm going to hard shutdown nodes 3 and 5:
node 3+5# sudo poweroff
After giving the cluster some time to treat those two nodes as dead and up-replicate the underreplicated ranges - the DBConsole Overview now looks like this:
With timeseries losing quorum, none of the graphs work:
However, losing these graphs is somewhat expected, and not the issue this ticket is reporting. Overall functionality of the cluster remains intact. In particular, I can log in with
cockroach sql
to any node without issue:But what happens if I restart a node and try to connect to it:
It now just times out. If I go further and restart the entire cluster, then I'm unable to log into the cluster at all. Interestingly, DBConsole remains working and other than the unavailable ranges, appears healthy.
In addition to not being able to log in, nodes will no longer write a socket file if using
--socket-dir
. New nodes can also not join the cluster.Once the unavailable timeseries ranges are repaired by the LoQ tool, then everything reverts back to a fully working condition.
Jira issue: CRDB-16726
The text was updated successfully, but these errors were encountered: