Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: panic in metrics #63218

Closed
awoods187 opened this issue Apr 7, 2021 · 13 comments
Closed

kv: panic in metrics #63218

awoods187 opened this issue Apr 7, 2021 · 13 comments
Assignees
Labels
A-kv-observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@awoods187
Copy link
Contributor

Yesterday, in demo, I hit a panic in metrics collection.

demo@127.0.0.1:26257/movr> *
* ERROR: [n1,summaries] a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969
*   | runtime.panicmem
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:212
*   | runtime.sigpanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/signal_unix.go:742
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Count
*   | 	<autogenerated>:1
*   | github.com/cockroachdb/cockroach/pkg/server/status.extractValue
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:563
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:595
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:153
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Inspect
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/metric.go:356
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:152
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:574
*   | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:455
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:753
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:313
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:752
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:736
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
*   | runtime.goexit
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/asm_amd64.s:1374
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString
*
*
* ERROR: [n1,summaries] a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:233
*   | runtime.gopanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969
*   | runtime.panicmem
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:212
*   | runtime.sigpanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/signal_unix.go:742
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Count
*   | 	<autogenerated>:1
*   | github.com/cockroachdb/cockroach/pkg/server/status.extractValue
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:563
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:595
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:153
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Inspect
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/metric.go:356
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:152
*   | github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:574
*   | github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:455
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:753
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:313
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:752
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:736
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
*   | runtime.goexit
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/asm_amd64.s:1374
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString
*
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x5101caf]

goroutine 1014 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000d16d00, 0x9528720, 0xc00464d710)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:233 +0x126
panic(0x801ed00, 0xbe88070)
	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969 +0x1b9
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000d16d00, 0x9528720, 0xc00464d710)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:233 +0x126
panic(0x801ed00, 0xbe88070)
	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969 +0x1b9
github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Count(0xc0000dbe60, 0x8655f80)
	<autogenerated>:1 +0x2f
github.com/cockroachdb/cockroach/pkg/server/status.extractValue(0x8655f80, 0xc0000dbe60, 0x0, 0x0, 0xc015b75170)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:563 +0x19a
github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue.func1(0xc0009d2810, 0x26, 0x8655f80, 0xc0000dbe60)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:595 +0x191
github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each.func1(0x8655f80, 0xc0000dbe60)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:153 +0x6c
github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Inspect(0xc0000dbe60, 0xc04df290a0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/metric.go:356 +0x3c
github.com/cockroachdb/cockroach/pkg/util/metric.(*Registry).Each(0xc0010fc440, 0xc04df068a0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/metric/registry.go:152 +0x125
github.com/cockroachdb/cockroach/pkg/server/status.eachRecordableValue(0xc0010fc440, 0xc04df06890)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:574 +0x65
github.com/cockroachdb/cockroach/pkg/server/status.(*MetricsRecorder).GenerateNodeStatus(0xc000103500, 0x9528720, 0xc00464d710, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:455 +0x62c
github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus.func1(0x9528720, 0xc00464d710)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:753 +0x8b
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTask(0xc000d16d00, 0x9528720, 0xc00464d710, 0x8731684, 0x1a, 0xc015b75e20, 0x0, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:313 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).writeNodeStatus(0xc00165c580, 0x9528720, 0xc00464d710, 0x4a817c800, 0x1, 0x0, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:752 +0xbd
github.com/cockroachdb/cockroach/pkg/server.(*Node).startWriteNodeStatus.func1(0x9528720, 0xc00464d710)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:736 +0x16d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc000d16d00, 0x9528720, 0xc00464d710, 0x0, 0xc0011bb7c0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351 +0xb9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0xfc
@awoods187 awoods187 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Apr 7, 2021
@awoods187
Copy link
Contributor Author

I hit this again today doing a live demo

demo@127.0.0.1:26257/movr> *
* ERROR: [n7] a panic has occurred!
* runtime error: invalid memory address or nil pointer dereference
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969
*   | runtime.panicmem
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:212
*   | runtime.sigpanic
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/signal_unix.go:742
*   | go.etcd.io/etcd/raft/v3.(*raft).hardState
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/raft.go:374
*   | go.etcd.io/etcd/raft/v3.getBasicStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/status.go:61
*   | go.etcd.io/etcd/raft/v3.getStatus
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/status.go:70
*   | go.etcd.io/etcd/raft/v3.(*RawNode).Status
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/rawnode.go:184
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).raftStatusRLocked
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica.go:1109
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Metrics
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_metrics.go:56
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2543
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:376
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2542
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetrics
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2642
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computePeriodicMetrics.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:676
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:148
*   | github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:147
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computePeriodicMetrics
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:675
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:662
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1
*   | 	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351
*   | runtime.goexit
*   | 	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/asm_amd64.s:1374
* Wraps: (2) runtime error: invalid memory address or nil pointer dereference
* Error types: (1) *withstack.withStack (2) runtime.errorString
*
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x48 pc=0x59dbbef]

goroutine 2436 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc00372c800, 0x9535560, 0xc0039185d0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:233 +0x126
panic(0x802a6c0, 0xbe9a370)
	/usr/local/Cellar/go/1.15.4/libexec/src/runtime/panic.go:969 +0x1b9
go.etcd.io/etcd/raft/v3.(*raft).hardState(...)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/raft.go:374
go.etcd.io/etcd/raft/v3.getBasicStatus(...)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/status.go:61
go.etcd.io/etcd/raft/v3.getStatus(0xc02aa1b900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/status.go:70 +0xaf
go.etcd.io/etcd/raft/v3.(*RawNode).Status(...)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/vendor/go.etcd.io/etcd/raft/v3/rawnode.go:184
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).raftStatusRLocked(0xc0033d5600, 0xc05ff068d0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica.go:1109 +0xa5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Metrics(0xc0033d5600, 0x9535560, 0xc05ff068d0, 0x16758555d06ffd60, 0x0, 0xc05fbd5f80, 0x9, 0x0, 0x0, 0x0, ...)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_metrics.go:56 +0x7c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges.func1(0xc0033d5600, 0xc05fbd5f01)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2543 +0x148
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc05fbd5fb0, 0xc0234739e8)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:376 +0x151
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges(0xc008149000, 0x9535560, 0xc05ff068d0, 0x0, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2542 +0x32c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetrics(0xc008149000, 0x9535560, 0xc0039185d0, 0xef, 0x0, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store.go:2642 +0xd8
github.com/cockroachdb/cockroach/pkg/server.(*Node).computePeriodicMetrics.func1(0xc008149000, 0x4014a45, 0xc0a784fd48)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:676 +0x57
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x7, 0xc008149000, 0xc0a784fd48)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:148 +0x38
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc000f9eab0, 0xc023473dd8)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:352 +0x130
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc000f9ea80, 0xc0a784fe20, 0x0, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:147 +0x75
github.com/cockroachdb/cockroach/pkg/server.(*Node).computePeriodicMetrics(0xc003b20000, 0x9535560, 0xc0039185d0, 0xef, 0x1, 0x0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:675 +0x77
github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func1(0x9535560, 0xc0039185d0)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:662 +0x165
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc00372c800, 0x9535560, 0xc0039185d0, 0x0, 0xc00802d820)
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:351 +0xb9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/Users/andrewwoods/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:346 +0xfc```

@tbg
Copy link
Member

tbg commented Apr 14, 2021

This is a different panic. It's interesting that you can hit these at will. Is there anything particular you're doing? Does this happen "early" in the lifetime of the cluster or randomly in the middle?

What version?

@tbg tbg assigned irfansharif and unassigned tbg Apr 14, 2021
@tbg
Copy link
Member

tbg commented Apr 14, 2021

@irfansharif could you look into this (doesn't have to be during breather week, but next week would be good)?

@awoods187
Copy link
Contributor Author

I have been running ./cockroach demo movr --nodes 9 / and hitting this somewhat randomly. I hit the first one above within 10 minutes and the one from yesterday about 30 minutes into the clusters life. The last one was on master built on Monday night. Is there anything I should collect next time it happens? Since its demo I don't think I have access to logs?

@knz
Copy link
Contributor

knz commented Apr 19, 2021

The stack trace says that one of the metric objects added to the registry is a nil reference. There's possibly a bug in the code. The fact it doesn't happen right away is related to the fact that the update code doesn't run for the first X minutes. I'll run a test with a shortened update loop and stress this

@knz
Copy link
Contributor

knz commented Apr 19, 2021

The two stack traces above appear unrelated

  • the stack trace at the top reflects a nil pointer inside the metric registry
  • the stack trace that comes later is a nil pointer inside the raft group code

@tbg
Copy link
Member

tbg commented Apr 19, 2021

The second failure is perplexing. We start here

if rg := r.mu.internalRaftGroup; rg != nil {
s := rg.Status()
return &s
}

so rg (which is a *RawNode) is not nil. We then explode while accessing rg.raft. But literally the only way we make a RawNode is here

raftGroup, err := raft.NewRawNode(newRaftConfig(
raft.Storage((*replicaRaftStorage)(r)),
uint64(r.mu.replicaID),
r.mu.state.RaftAppliedIndex,
r.store.cfg,
&raftLogger{ctx: ctx},
))

and if you look inside it's quite clear that its .raft will not be nil (and it is never set to nil). The only remotely possible chance here could be that we're hitting a panic inside of newRaft (there is one explicit reference to panic(err)). But this seems unlikely to result in this symptom. The node would've crashed earlier in all likelihood.

@tbg
Copy link
Member

tbg commented Apr 19, 2021

As for the metrics crash, there is some nil handling here:

switch vfield.Kind() {
case reflect.Array:
for i := 0; i < vfield.Len(); i++ {
velem := vfield.Index(i)
telemName := fmt.Sprintf("%s[%d]", tname, i)
// Permit elements in the array to be nil.
const skipNil = true
r.addMetricValue(ctx, velem, telemName, skipNil)
}
default:
// No metric fields should be nil.
const skipNil = false
r.addMetricValue(ctx, vfield, tname, skipNil)
}

I don't know if that would save us. But also, given that we never see this outside of this issue, and that the raft nil thing is basically impossible given a sane env, I am putting my money on this being something weird about Andy's setup, though unclear what.

@knz
Copy link
Contributor

knz commented Apr 19, 2021

The first step I'd like to suggest is for @awoods187 to upgrade Go to 1.15.10 or later. The version used here (1.15.4) has known bugs which we need to exclude first from this analysis.

@irfansharif
Copy link
Contributor

@knz: Are you thinking of golang/go#44614?

@knz
Copy link
Contributor

knz commented Apr 19, 2021

yes

@irfansharif
Copy link
Contributor

That does sound plausible with the second panic, looks like in go1.15.4 we can accidentally GC memory referenced in the same manner rg.raft is.

@knz
Copy link
Contributor

knz commented Apr 19, 2021

Ok I'm going to mark this issue as resolved when #63837 merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants