Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: draining hangs when quorum is lost #14620

Closed
jseldess opened this issue Apr 4, 2017 · 5 comments
Closed

server: draining hangs when quorum is lost #14620

jseldess opened this issue Apr 4, 2017 · 5 comments
Assignees
Milestone

Comments

@jseldess
Copy link
Contributor

jseldess commented Apr 4, 2017

This isn't an issue for production clusters, which will be upgraded in a rolling fashion, but it is a usability issue for quick test clusters.

Once you lose quorum, the remaining nodes can't be shut down with cockroach quit. Instead, you need to do a force kill.

~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node1
CockroachDB node starting at 2017-04-04 18:08:22.607233193 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8080
sql:        postgresql://root@localhost:26257?sslmode=disable
logs:       repdemo-node1/logs
store[0]:   path=repdemo-node1
status:     initialized new cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     1
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node2 --port=26258 --http-port=8081 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:32.159974578 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8081
sql:        postgresql://root@localhost:26258?sslmode=disable
logs:       repdemo-node2/logs
store[0]:   path=repdemo-node2
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     2
~/src/github.com/cockroachdb/cockroach$ cockroach start --background --store=repdemo-node3 --port=26259 --http-port=8082 --join=localhost:26257
CockroachDB node starting at 2017-04-04 18:08:39.068806601 -0400 EDT
build:      CCL 274f7e5 @ 2017/04/04 04:16:48 (go1.8)
admin:      http://localhost:8082
sql:        postgresql://root@localhost:26259?sslmode=disable
logs:       repdemo-node3/logs
store[0]:   path=repdemo-node3
status:     initialized new node, joined pre-existing cluster
clusterID:  5f41d0b5-c814-40b8-a356-6def69281b92
nodeID:     3
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26259
initiating graceful shutdown of server
server drained and shutdown completed
ok
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26258
initiating graceful shutdown of server
ok
server drained and shutdown completed
~/src/github.com/cockroachdb/cockroach$ cockroach quit --port=26257 --logtostderr

Note that the third node never quits. Here's what you see toward the end of the logs:

I170404 18:09:49.786408 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:50.644194 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:50.840000 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
I170404 18:09:51.706244 576 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26259: getsockopt: connection refused"; Reconnecting to {localhost:26259 <nil>}
I170404 18:09:51.983215 390 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:26258: getsockopt: connection refused"; Reconnecting to {localhost:26258 <nil>}
@jseldess jseldess added this to the 1.0 milestone Apr 4, 2017
@asubiotto
Copy link
Contributor

This is due to the last node attempting to update its node liveness and not being able to, thus retrying indefinitely.

To work around this for now, sending a SIGTERM to the process will enable draining in the same way but proceed to a hard shutdown after a one minute timeout (or a second SIGTERM will proceed to a hard shutdown right away). The solution is to add a timeout to the quit endpoint.

xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 6, 2017


err was wrongfully asserted against pointer
xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 7, 2017


added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated (error log is misleading)
xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 11, 2017
… lost

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 11, 2017
… lost

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
xphoniex added a commit to xphoniex/cockroach that referenced this issue Apr 13, 2017
Fixes cockroachdb#14620

added a timeout of 1 minute inside runQuit, after which a hard shutdown is initiated
asubiotto added a commit that referenced this issue Apr 18, 2017
fix indefinite retrying for `cockroach quit` when quorum is lost #14620
@asubiotto
Copy link
Contributor

Reopening because although #14708 fixed this particular issue, the server should still time out potentially infinitely retryable writes while draining.

@asubiotto asubiotto reopened this Apr 18, 2017
@asubiotto asubiotto changed the title cli: can't use cockroach quit once quorum is lost server: draining hangs when quorum is lost Apr 18, 2017
@cuongdo cuongdo modified the milestones: 1.1, 1.0 Apr 18, 2017
@vivekmenezes vivekmenezes removed this from the 1.1 milestone Aug 10, 2017
@dianasaur323 dianasaur323 added this to the 1.2 milestone Sep 17, 2017
@rjnn
Copy link
Contributor

rjnn commented Mar 1, 2018

@asubiotto is this fixed? It appears fixed to me. After approximately 30 seconds, the final nodes give up and quit when issued the quit command. I've been using this pattern reliably for months now. Closing this issue, please reopen if I'm missing something.

@rjnn rjnn closed this as completed Mar 1, 2018
@asubiotto
Copy link
Contributor

The remaining work was to time out the liveness update (the quit command forces a shutdown after a minute) to proceed with draining leases. However, this is not completely necessary. It's a small change that I'll probably get to for 2.0 so I'll reopen.

@asubiotto asubiotto reopened this Mar 1, 2018
@asubiotto
Copy link
Contributor

Actually, thinking about this more I'm not sure that going forward with canceling a node liveness update is the way to go. Timeouts are implemented by users (as you pointed out) from a higher level and we never want to sacrifice correctness for a quicker drain. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants