oddity while running extended version of tpcc/nodes=3/w=max #32058

danhhz · 2018-10-31T18:01:32Z

I ran the tpcc/nodes=3/w=max roachtest overnight (local change to eliminate the 2h duration specified by the test). There was a blip around 06:56 that someone may want to look into. (My run was manually ended before the end of the screenshot, so don't worry about that cliff at the end.)

Transaction restarts were elevated during the badness. There was also a small blip of leaders w/o leaseholders. I have a copy of the logs. (Where do I put them?) They're basically just tons of "handle raft ready: 1.7s [processed=0]" lines over and over around the badness.

Full disclosure: my binary had a few local changes to changefeedccl code (commented out flushing kafka and added some metrics), but changefeeds do not run in this test and I can't imagine a world in which my changes were related.

danhhz · 2018-10-31T18:02:46Z

dan-1540922755-tpcc-nodes-3-w-max.zip

tbg · 2018-11-12T15:07:36Z

I'm noticing that the number of goroutines (on n1) goes up quite a bit when the badness starts:

I181031 06:49:02.822001 329 server/status/runtime.go:465 [n1] runtime stats: 6.1 GiB RSS, 1902 goroutines, 683 MiB/194 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 103768.4 CGO/sec, 1126.1/128.5 %(u/s)time, 0.2 %gc (13x), 113 MiB/112 MiB (r/w)net
I181031 06:49:12.851888 329 server/status/runtime.go:465 [n1] runtime stats: 6.1 GiB RSS, 1932 goroutines, 659 MiB/237 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 102444.4 CGO/sec, 1123.2/128.0 %(u/s)time, 0.2 %gc (12x), 109 MiB/108 MiB (r/w)net
I181031 06:49:22.893057 329 server/status/runtime.go:465 [n1] runtime stats: 6.2 GiB RSS, 3930 goroutines, 724 MiB/182 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 131635.9 CGO/sec, 991.0/119.2 %(u/s)time, 0.1 %gc (9x), 101 MiB/97 MiB (r/w)net
I181031 06:49:32.910471 329 server/status/runtime.go:465 [n1] runtime stats: 6.2 GiB RSS, 3902 goroutines, 630 MiB/142 MiB/1.1 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 170590.2 CGO/sec, 1012.7/115.8 %(u/s)time, 0.1 %gc (11x), 103 MiB/103 MiB (r/w)net
W181031 06:49:33.845919 256 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]
W181031 06:49:33.845929 263 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]
W181031 06:49:33.845964 293 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]

The runtime stats don't show a red flag other than that (I also think I convinced myself that RocksDB wasn't writing much more to disk than it did before). The network i/o drops in half, but it's unclear whether this is cause or symptom (probably the latter).

also most of the slow readies don't process any entries (i.e. they're not applying commands). The duration is also extremely uniform (i.e. all readies take 0.7s, then that slowly increases, drops down again, etc). There are no snapshots. Also, n2 and n3 look just fine.

Unfortunately I don't see much I can infer from this, but hopefully we'll catch similar badness in longer-running perf tests and can investigate it there.

danhhz · 2018-11-12T16:09:25Z

Thanks for taking a look! I've been running tpcc overnight a bunch while digging into #32104. If I see this again, what should I save to help debug?

tbg · 2018-11-12T16:12:20Z

Unfortunately the only real help will be leaving the cluster up so that I (or someone else) can poke around in the UI. (./cockroach debug tsdump in theory exports that data, but there's no good way to look at it after).

danhhz assigned tbg Oct 31, 2018

knz added C-investigation Further steps needed to qualify. C-label will change. A-testing Testing tools and infrastructure A-benchmarking labels Nov 12, 2018

tbg closed this as completed Nov 12, 2018

danhhz mentioned this issue Nov 13, 2018

roachtest: basic multi-day roachtest #31223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oddity while running extended version of tpcc/nodes=3/w=max #32058

oddity while running extended version of tpcc/nodes=3/w=max #32058

danhhz commented Oct 31, 2018

danhhz commented Oct 31, 2018

tbg commented Nov 12, 2018

danhhz commented Nov 12, 2018

tbg commented Nov 12, 2018

oddity while running extended version of tpcc/nodes=3/w=max #32058

oddity while running extended version of tpcc/nodes=3/w=max #32058

Comments

danhhz commented Oct 31, 2018

danhhz commented Oct 31, 2018

tbg commented Nov 12, 2018

danhhz commented Nov 12, 2018

tbg commented Nov 12, 2018