server: draining hangs when quorum is lost #14620

jseldess · 2017-04-04T22:21:49Z

This isn't an issue for production clusters, which will be upgraded in a rolling fashion, but it is a usability issue for quick test clusters.

Once you lose quorum, the remaining nodes can't be shut down with cockroach quit. Instead, you need to do a force kill.

~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node1
CockroachDB node starting at 2017-04-04 18:08:22.607233193 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8080
sql:        postgresql://root@localhost:26257?sslmode=disable
logs:       repdemo-node1/logs
store[0]:   path=repdemo-node1
status:     initialized new cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     1
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node2 --port=26258 --http-port=8081 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:32.159974578 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8081
sql:        postgresql://root@localhost:26258?sslmode=disable
logs:       repdemo-node2/logs
store[0]:   path=repdemo-node2
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     2
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node3 --port=26259 --http-port=8082 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:39.068806601 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8082
sql:        postgresql://root@localhost:26259?sslmode=disable
logs:       repdemo-node3/logs
store[0]:   path=repdemo-node3
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     3
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26259
initiating graceful shutdown of server
server drained and shutdown completed
ok
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26258
initiating graceful shutdown of server
ok
server drained and shutdown completed
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26257 --logtostderr

Note that the third node never quits. Here's what you see toward the end of the logs:

I170404 18:09:49.786408 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:50.644194 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:50.840000 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:51.706244 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:51.983215 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}

The text was updated successfully, but these errors were encountered:

asubiotto · 2017-04-05T13:17:31Z

This is due to the last node attempting to update its node liveness and not being able to, thus retrying indefinitely.

To work around this for now, sending a SIGTERM to the process will enable draining in the same way but proceed to a hard shutdown after a one minute timeout (or a second SIGTERM will proceed to a hard shutdown right away). The solution is to add a timeout to the quit endpoint.

err was wrongfully asserted against pointer

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated (error log is misleading)

… lost added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

fix indefinite retrying for `cockroach quit` when quorum is lost #14620

asubiotto · 2017-04-18T16:49:32Z

Reopening because although #14708 fixed this particular issue, the server should still time out potentially infinitely retryable writes while draining.

rjnn · 2018-03-01T18:50:14Z

@asubiotto is this fixed? It appears fixed to me. After approximately 30 seconds, the final nodes give up and quit when issued the quit command. I've been using this pattern reliably for months now. Closing this issue, please reopen if I'm missing something.

asubiotto · 2018-03-01T20:15:00Z

The remaining work was to time out the liveness update (the quit command forces a shutdown after a minute) to proceed with draining leases. However, this is not completely necessary. It's a small change that I'll probably get to for 2.0 so I'll reopen.

asubiotto · 2018-03-05T15:37:02Z

Actually, thinking about this more I'm not sure that going forward with canceling a node liveness update is the way to go. Timeouts are implemented by users (as you pointed out) from a higher level and we never want to sacrifice correctness for a quicker drain. Closing this issue.

jseldess added this to the 1.0 milestone Apr 4, 2017

jseldess assigned asubiotto Apr 4, 2017

jseldess mentioned this issue Apr 5, 2017

cli: can't use cockroach quit once quorum is lost cockroachdb/docs#1252

Closed

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 6, 2017

fix indefinite retrying for quit when quorom is lost cockroachdb#14620

eaaaad4

err was wrongfully asserted against pointer

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 7, 2017

fix indefinite retrying for quit when quorom is lost cockroachdb#14620

e799c61

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated (error log is misleading)

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 11, 2017

Fixes cockroachdb#14620 indefinite retrying for quit when quorom is…

3334f96

… lost added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 11, 2017

Fixes cockroachdb#14620 indefinite retrying for quit when quorom is…

5a35680

… lost added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

asubiotto mentioned this issue Apr 11, 2017

fix indefinite retrying for cockroach quit when quorum is lost #14620 #14708

Merged

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 13, 2017

cli: fix indefinite retrying for cockroach quit when quorum is lost

93cc469

Fixes cockroachdb#14620 added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated

asubiotto closed this as completed in 90e25ca Apr 18, 2017

asubiotto added a commit that referenced this issue Apr 18, 2017

Merge pull request #14708 from xphoniex/master

625f5ef

fix indefinite retrying for `cockroach quit` when quorum is lost #14620

asubiotto reopened this Apr 18, 2017

asubiotto changed the title ~~cli: can't use cockroach quit once quorum is lost~~ server: draining hangs when quorum is lost Apr 18, 2017

cuongdo modified the milestones: 1.1, 1.0 Apr 18, 2017

vivekmenezes removed this from the 1.1 milestone Aug 10, 2017

dianasaur323 added this to the 1.2 milestone Sep 17, 2017

rjnn closed this as completed Mar 1, 2018

asubiotto reopened this Mar 1, 2018

asubiotto closed this as completed Mar 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: draining hangs when quorum is lost #14620

server: draining hangs when quorum is lost #14620

jseldess commented Apr 4, 2017

asubiotto commented Apr 5, 2017

asubiotto commented Apr 18, 2017

rjnn commented Mar 1, 2018

asubiotto commented Mar 1, 2018

asubiotto commented Mar 5, 2018

server: draining hangs when quorum is lost #14620

server: draining hangs when quorum is lost #14620

Comments

jseldess commented Apr 4, 2017

asubiotto commented Apr 5, 2017

asubiotto commented Apr 18, 2017

rjnn commented Mar 1, 2018

asubiotto commented Mar 1, 2018

asubiotto commented Mar 5, 2018