Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead node was re-added to cluster, resulted in fatal error #36487

Closed
roncrdb opened this issue Apr 3, 2019 · 1 comment
Closed

Dead node was re-added to cluster, resulted in fatal error #36487

roncrdb opened this issue Apr 3, 2019 · 1 comment
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@roncrdb
Copy link

roncrdb commented Apr 3, 2019

Describe the problem

Node had been dead for a few days, was re-added to the cluster, node crashed leading to a drop in queries and an increase in 99th Percentile Latency.

sql queries graph

The log files show there was a fatal error and unstable cluster:

F190403 08:38:33.172235 18618 storage/replica_proposal.go:436  [n2,s2,r91733/5:/Table/86/7/"MT"/2019-07-…] while ingesting /var/lib/crdbinternal/ds1/auxiliary/sideloading/r9XXXX/r91733/i5782.t651.ingested: Corruption: external file have non zero sequence number
goroutine 18618 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0000c8600, 0xc0000c8600, 0x5336d00, 0x1b)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5ada300, 0xc000000004, 0x5336dbd, 0x1b, 0x1b4, 0xc0077dca90, 0xc8)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x39d8080, 0xc0095d1aa0, 0x4, 0x2, 0x32914d2, 0x16, 0xc007c8e9c8, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x39d8080, 0xc0095d1aa0, 0x1, 0xc000000004, 0x32914d2, 0x16, 0xc007c8e9c8, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x39d8080, 0xc0095d1aa0, 0x32914d2, 0x16, 0xc007c8e9c8, 0x2, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/storage.addSSTablePreApply(0x39d8080, 0xc0095d1aa0, 0xc000013300, 0x3a24920, 0xc00168a600, 0x39fa7a0, 0xc00665e1c0, 0x28b, 0x1696, 0xc007eb3500, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:436 +0xec1
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).processRaftCommand(0xc0060b1b00, 0x39d8080, 0xc0095d1aa0, 0xc00a12aa50, 0x8, 0x28b, 0x1696, 0x200000002, 0x5, 0xc38, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:1875 +0x15a0
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc0060b1b00, 0x39d8080, 0xc0095d1aa0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raft.go:790 +0x13da
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x39d8080, 0xc0095d1aa0, 0xc0060b1b00, 0x39d8080)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3585 +0x120
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc001db4000, 0x39d8080, 0xc0095d1aa0, 0xc000f58360, 0xc0090c9ed0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3232 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc001db4000, 0x39d8080, 0xc0091acf30, 0x16655)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3573 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc000622a00, 0x39d8080, 0xc0091acf30)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:225 +0x21a
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x39d8080, 0xc0091acf30)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:165 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc006dbfd20, 0xc00053a510, 0xc006dbfd10)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:200 +0xe1
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:193 +0xa8

To Reproduce

Queries that were running at the time of the node crashing:
Queries

Expected behavior
Re-adding a node should not result in cluster becoming unstable and node crashing.

Environment:

  • CockroachDB v19.1.0-beta.20190318
@awoods187 awoods187 assigned tbg and unassigned vivekmenezes Apr 3, 2019
@awoods187 awoods187 added A-kv-distribution Relating to rebalancing and leasing. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Apr 3, 2019
@tbg
Copy link
Member

tbg commented Apr 3, 2019

This is one of our current release blockers (see the list there for a matching error message; the investigation is in a private repo): #35554 (comment)

@dt is working on a mitigation in #36445.

@tbg tbg closed this as completed Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

4 participants