Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible #82916

Closed
smcvey opened this issue Jun 15, 2022 · 5 comments
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity X-stale

Comments

@smcvey
Copy link
Contributor

smcvey commented Jun 15, 2022

Normally, the loss of quorum on the timeseries ranges doesn't cause any major issues other than the Metrics page not loading graphs on the DBConsole. However, restarting a node then causes that node to fail in all sorts of unusual ways which are not logged and is very hard to diagnose.

My tests showed it affects 21.2.12 (latest at the time), but 22.1 is unaffected. However, I'm putting this reproduction here for posterity.

Start a 5 node cluster on 21.2.12 in the usual way:

(on each node)
# cockroach start  --advertise-addr=<ip> --join=<ip>,<ip>,... --cache=.25 --max-sql-memory=.25 --background
(on one node)
# cockroach init

On a new 5 node cluster, all the existing ranges are replicated 5 times except for the timeseries data, which is replicated 3 times. This makes it simple to identify 2 nodes to suddenly destroy which will lose quorum on the timeseries data, but everything else should survive:

root@:26257/defaultdb> select range_id, replicas from crdb_internal.ranges_no_leases;
  range_id |  replicas
-----------+--------------
         1 | {1,2,3,4,5}
         2 | {1,2,3,4,5}
         3 | {1,2,3,4,5}
         4 | {2,3,5}
         5 | {1,2,3,4,5}
         6 | {1,2,3,4,5}
         7 | {1,2,3,4,5}
....

Confirm that there is now only one replica:

root@:26257/defaultdb> select range_id, start_pretty, end_pretty, replicas from crdb_internal.ranges_no_leases where range_id = 4;
  range_id | start_pretty |  end_pretty   | replicas
-----------+--------------+---------------+-----------
         4 | /System/tsd  | /System/"tse" | {4}
(1 row)

Find the node ids of those replicas:

root@:26257/defaultdb> select node_id, store_id from crdb_internal.kv_store_status where store_id in (2, 3, 5);
  node_id | store_id
----------+-----------
        2 |        2
        3 |        3
        5 |        5
(3 rows)

So I'm going to hard shutdown nodes 3 and 5:

node 3+5# sudo poweroff

After giving the cluster some time to treat those two nodes as dead and up-replicate the underreplicated ranges - the DBConsole Overview now looks like this:

Screenshot 2022-06-15 at 03 19 54

With timeseries losing quorum, none of the graphs work:

Screenshot 2022-06-15 at 03 05 12

However, losing these graphs is somewhat expected, and not the issue this ticket is reporting. Overall functionality of the cluster remains intact. In particular, I can log in with cockroach sql to any node without issue:

[vagrant@node-1 ~]$ cockroach sql
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Server version: CockroachDB CCL v22.1.1 (x86_64-pc-linux-gnu, built 2022/06/06 16:38:56, go1.17.6) (same version as client)
# Cluster ID: d00c2ca7-e6b2-44df-a9fc-8f1782981ccc
# Organization: Support
#
# Enter \? for a brief introduction.
#
root@:26257/defaultdb> show databases;
  database_name | owner | primary_region | regions | survival_goal
----------------+-------+----------------+---------+----------------
  defaultdb     | root  | NULL           | {}      | NULL
  postgres      | root  | NULL           | {}      | NULL
  system        | node  | NULL           | {}      | NULL
(3 rows)


Time: 2ms total (execution 1ms / network 0ms)

But what happens if I restart a node and try to connect to it:

[vagrant@node-1 ~]$ sql
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

read tcp [::1]:43690 -> [::1]:26257: i/o timeout
Failed running "sql"
[vagrant@node-1 ~]$ 

It now just times out. If I go further and restart the entire cluster, then I'm unable to log into the cluster at all. Interestingly, DBConsole remains working and other than the unavailable ranges, appears healthy.

In addition to not being able to log in, nodes will no longer write a socket file if using --socket-dir. New nodes can also not join the cluster.

Once the unavailable timeseries ranges are repaired by the LoQ tool, then everything reverts back to a fully working condition.

Jira issue: CRDB-16726

@smcvey smcvey added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 15, 2022
@rafiss rafiss changed the title Combination of loss of quorum on timeseries ranges, and node restart renders node inaccessible Combination of loss of quorum on timeseries ranges, and node restart renders cluster inaccessible Jul 13, 2022
@rafiss
Copy link
Collaborator

rafiss commented Jul 26, 2022

Link to internal ticket where this was discovered: https://github.com/cockroachlabs/support/issues/1646

@andreimatei
Copy link
Contributor

My tests showed it affects 21.2.12 (latest at the time), but 22.1 is unaffected.

This reproduces reliably with 21.2.12, and you've tried the steps with 22.1 where things were good?

@knz
Copy link
Contributor

knz commented Jul 26, 2022

I think what's happening here is that we get a timeseries epoch marker being written on the critical path to server startup. This vaguely rings a bell. We should implement a timeout on that, or fork that marker into a separate goroutine.

@smcvey
Copy link
Contributor Author

smcvey commented Jul 27, 2022

Has anyone been able to reproduce this in 22.1? If not, the issue may no longer actually exist.

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. no-issue-activity X-stale
Projects
None yet
Development

No branches or pull requests

4 participants