Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: schema change fails on "breaker open" error #36771

Closed
thoszhang opened this issue Apr 11, 2019 · 1 comment
Closed

sql: schema change fails on "breaker open" error #36771

thoszhang opened this issue Apr 11, 2019 · 1 comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.

Comments

@thoszhang
Copy link
Contributor

A CREATE INDEX statement failed and rolled back for me with the error breaker open: unable to dial n6. We should be retrying on these errors (or at least not returning this error to the client).

Logs for slightly more context:

E190409 17:59:13.483781 4720006 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
I190409 17:59:13.485224 49 storage/node_liveness.go:775  [n1,hb] retrying liveness update after storage.errRetryLiveness: result is ambiguous (context done during DistSender.Send: context deadline exceeded)
W190409 17:59:13.485266 49 storage/node_liveness.go:518  [n1,hb] slow heartbeat took 4.8s
W190409 17:59:13.485318 49 storage/node_liveness.go:458  [n1,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
E190409 17:59:13.490952 4719982 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
W190409 17:59:13.492808 72 gossip/gossip.go:1501  [n1] first range unavailable; trying remaining resolvers
E190409 17:59:13.497300 4719956 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.498556 4720005 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.499433 4719983 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.499718 4719955 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
W190409 17:59:13.499935 1001 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {lucy-test-20-0006:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190409 17:59:13.501054 369 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {lucy-test-20-0003:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190409 17:59:13.501192 369 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
I190409 17:59:13.501767 4746503 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257->6 tripped: failed to check for ready connection to n6 at lucy-test-20-0006:26257: connection not ready: TRANSIENT_FAILURE
I190409 17:59:13.501794 4746503 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257->6 event: BreakerTripped
I190409 17:59:13.501826 4746503 rpc/nodedialer/nodedialer.go:143  [ct-client] unable to connect to n6: failed to check for ready connection to n6 at lucy-test-20-0006:26257: connection not ready: TRANSIENT_FAILURE
E190409 17:59:13.502022 4719954 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
W190409 17:59:13.502101 3426 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {lucy-test-20-0020:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
E190409 17:59:13.503430 4719986 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.503760 4719999 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.504040 4720000 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.504575 4719984 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.508697 4719985 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
E190409 17:59:13.523948 4720001 sql/distsqlrun/server.go:614  [n1] communication error: rpc error: code = Canceled desc = context canceled
W190409 17:59:13.526441 1001 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
W190409 17:59:13.529628 3426 vendor/google.golang.org/grpc/clientconn.go:1440  grpc: addrConn.transportMonitor exits due to: context canceled
I190409 17:59:13.536403 4746568 gossip/client.go:128  [n1] started gossip client to lucy-test-20-0003:26257
W190409 17:59:13.548976 3819160 sql/schema_changer.go:872  [n1,client=127.0.0.1:45426,user=root,scExec] reversing schema change 441560562452955137 due to irrecoverable error: unable to dial n6: breaker open
I190409 17:59:13.549147 4746568 gossip/client.go:133  [n1] closing client to n3 (lucy-test-20-0003:26257): received forward from n3 to 12 (lucy-test-20-0012:26257)
I190409 17:59:13.549209 72 gossip/gossip.go:1510  [n1] node has connected to cluster via gossip
I190409 17:59:13.549392 4746616 gossip/client.go:128  [n1] started gossip client to lucy-test-20-0012:26257
W190409 17:59:13.560846 3819160 sql/schema_changer.go:1147  [n1,client=127.0.0.1:45426,user=root,scExec] reverse schema change mutation: {Descriptor_:0xc03d20dee0 State:DELETE_AND_WRITE_ONLY Direction:ADD MutationID:1 Rollback:false XXX_NoUnkeyedLiteral:{} XXX_sizecache:0}
I190409 17:59:13.560871 3819160 sql/lease.go:335  [n1,client=127.0.0.1:45426,user=root,scExec] publish: descID=53 (bulkingest) version=4 mtime=2019-04-09 17:59:13.559678045 +0000 UTC
I190409 17:59:13.616321 3819160 sql/event_log.go:135  [n1,client=127.0.0.1:45426,user=root,scExec] Event: "reverse_schema_change", target: 53, info: {Error:breaker open
unable to dial n6
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).dial
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:137
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).Dial
        /go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:92
github.com/cockroachdb/cockroach/pkg/sql.runnerRequest.run
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:70
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).initRunners.func1
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:97
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1333 MutationID:1}
@awoods187 awoods187 added S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Apr 15, 2019
@ajwerner
Copy link
Contributor

This breaker error was a symptom of the underlying connection being closed which happened due to a race in gRPC (grpc/grpc-go#1882). gRPC was upgraded in #39041 so I'm inclined to close this though would understand keeping it open if we think retry logic to deal with transient network outages are warranted here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

4 participants