Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

smcvey · 2022-06-15T03:27:26Z

Normally, the loss of quorum on the timeseries ranges doesn't cause any major issues other than the Metrics page not loading graphs on the DBConsole. However, restarting a node then causes that node to fail in all sorts of unusual ways which are not logged and is very hard to diagnose.

My tests showed it affects 21.2.12 (latest at the time), but 22.1 is unaffected. However, I'm putting this reproduction here for posterity.

Start a 5 node cluster on 21.2.12 in the usual way:

(on each node)
# cockroach start  --advertise-addr=<ip> --join=<ip>,<ip>,... --cache=.25 --max-sql-memory=.25 --background
(on one node)
# cockroach init

On a new 5 node cluster, all the existing ranges are replicated 5 times except for the timeseries data, which is replicated 3 times. This makes it simple to identify 2 nodes to suddenly destroy which will lose quorum on the timeseries data, but everything else should survive:

root@:26257/defaultdb> select range_id, replicas from crdb_internal.ranges_no_leases;
  range_id |  replicas
-----------+--------------
         1 | {1,2,3,4,5}
         2 | {1,2,3,4,5}
         3 | {1,2,3,4,5}
         4 | {2,3,5}
         5 | {1,2,3,4,5}
         6 | {1,2,3,4,5}
         7 | {1,2,3,4,5}
....

Confirm that there is now only one replica:

root@:26257/defaultdb> select range_id, start_pretty, end_pretty, replicas from crdb_internal.ranges_no_leases where range_id = 4;
  range_id | start_pretty |  end_pretty   | replicas
-----------+--------------+---------------+-----------
         4 | /System/tsd  | /System/"tse" | {4}
(1 row)

Find the node ids of those replicas:

root@:26257/defaultdb> select node_id, store_id from crdb_internal.kv_store_status where store_id in (2, 3, 5);
  node_id | store_id
----------+-----------
        2 |        2
        3 |        3
        5 |        5
(3 rows)

So I'm going to hard shutdown nodes 3 and 5:

node 3+5# sudo poweroff

After giving the cluster some time to treat those two nodes as dead and up-replicate the underreplicated ranges - the DBConsole Overview now looks like this:

With timeseries losing quorum, none of the graphs work:

However, losing these graphs is somewhat expected, and not the issue this ticket is reporting. Overall functionality of the cluster remains intact. In particular, I can log in with cockroach sql to any node without issue:

[vagrant@node-1 ~]$ cockroach sql
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Server version: CockroachDB CCL v22.1.1 (x86_64-pc-linux-gnu, built 2022/06/06 16:38:56, go1.17.6) (same version as client)
# Cluster ID: d00c2ca7-e6b2-44df-a9fc-8f1782981ccc
# Organization: Support
#
# Enter \? for a brief introduction.
#
root@:26257/defaultdb> show databases;
  database_name | owner | primary_region | regions | survival_goal
----------------+-------+----------------+---------+----------------
  defaultdb     | root  | NULL           | {}      | NULL
  postgres      | root  | NULL           | {}      | NULL
  system        | node  | NULL           | {}      | NULL
(3 rows)


Time: 2ms total (execution 1ms / network 0ms)

But what happens if I restart a node and try to connect to it:

[vagrant@node-1 ~]$ sql
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

read tcp [::1]:43690 -> [::1]:26257: i/o timeout
Failed running "sql"
[vagrant@node-1 ~]$

It now just times out. If I go further and restart the entire cluster, then I'm unable to log into the cluster at all. Interestingly, DBConsole remains working and other than the unavailable ranges, appears healthy.

In addition to not being able to log in, nodes will no longer write a socket file if using --socket-dir. New nodes can also not join the cluster.

Once the unavailable timeseries ranges are repaired by the LoQ tool, then everything reverts back to a fully working condition.

Jira issue: CRDB-16726

The text was updated successfully, but these errors were encountered:

rafiss · 2022-07-26T18:49:50Z

Link to internal ticket where this was discovered: https://github.com/cockroachlabs/support/issues/1646

andreimatei · 2022-07-26T19:40:39Z

My tests showed it affects 21.2.12 (latest at the time), but 22.1 is unaffected.

This reproduces reliably with 21.2.12, and you've tried the steps with 22.1 where things were good?

knz · 2022-07-26T21:19:10Z

I think what's happening here is that we get a timeseries epoch marker being written on the critical path to server startup. This vaguely rings a bell. We should implement a timeout on that, or fork that marker into a separate goroutine.

smcvey · 2022-07-27T16:33:57Z

Has anyone been able to reproduce this in 22.1? If not, the issue may no longer actually exist.

github-actions · 2024-01-22T11:04:57Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

smcvey added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 15, 2022

rafiss changed the title ~~Combination of loss of quorum on timeseries ranges, and node restart renders node inaccessible~~ Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible Jul 13, 2022

blathers-crl bot added T-observability-inf A-observability-inf labels Jul 13, 2022

rafiss mentioned this issue Jul 26, 2022

docs: new RFC on always-available users #44134

Open

github-actions bot added the no-issue-activity label Jan 22, 2024

github-actions bot added the X-stale label Feb 5, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024

exalate-issue-sync bot closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

smcvey commented Jun 15, 2022 •

edited by cockroach-jira-scripts

Loading

rafiss commented Jul 26, 2022

andreimatei commented Jul 26, 2022

knz commented Jul 26, 2022

smcvey commented Jul 27, 2022

github-actions bot commented Jan 22, 2024

Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

Comments

smcvey commented Jun 15, 2022 • edited by cockroach-jira-scripts Loading

rafiss commented Jul 26, 2022

andreimatei commented Jul 26, 2022

knz commented Jul 26, 2022

smcvey commented Jul 27, 2022

github-actions bot commented Jan 22, 2024

smcvey commented Jun 15, 2022 •

edited by cockroach-jira-scripts

Loading