Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oddity while running extended version of tpcc/nodes=3/w=max #32058

Closed
danhhz opened this issue Oct 31, 2018 · 4 comments
Closed

oddity while running extended version of tpcc/nodes=3/w=max #32058

danhhz opened this issue Oct 31, 2018 · 4 comments
Assignees
Labels
A-benchmarking A-testing Testing tools and infrastructure C-investigation Further steps needed to qualify. C-label will change.

Comments

@danhhz
Copy link
Contributor

danhhz commented Oct 31, 2018

I ran the tpcc/nodes=3/w=max roachtest overnight (local change to eliminate the 2h duration specified by the test). There was a blip around 06:56 that someone may want to look into. (My run was manually ended before the end of the screenshot, so don't worry about that cliff at the end.)

screen shot 2018-10-31 at 10 53 20 am

Transaction restarts were elevated during the badness. There was also a small blip of leaders w/o leaseholders. I have a copy of the logs. (Where do I put them?) They're basically just tons of "handle raft ready: 1.7s [processed=0]" lines over and over around the badness.

Full disclosure: my binary had a few local changes to changefeedccl code (commented out flushing kafka and added some metrics), but changefeeds do not run in this test and I can't imagine a world in which my changes were related.

@danhhz
Copy link
Contributor Author

danhhz commented Oct 31, 2018

@knz knz added C-investigation Further steps needed to qualify. C-label will change. A-testing Testing tools and infrastructure A-benchmarking labels Nov 12, 2018
@tbg
Copy link
Member

tbg commented Nov 12, 2018

I'm noticing that the number of goroutines (on n1) goes up quite a bit when the badness starts:

I181031 06:49:02.822001 329 server/status/runtime.go:465 [n1] runtime stats: 6.1 GiB RSS, 1902 goroutines, 683 MiB/194 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 103768.4 CGO/sec, 1126.1/128.5 %(u/s)time, 0.2 %gc (13x), 113 MiB/112 MiB (r/w)net
I181031 06:49:12.851888 329 server/status/runtime.go:465 [n1] runtime stats: 6.1 GiB RSS, 1932 goroutines, 659 MiB/237 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 102444.4 CGO/sec, 1123.2/128.0 %(u/s)time, 0.2 %gc (12x), 109 MiB/108 MiB (r/w)net
I181031 06:49:22.893057 329 server/status/runtime.go:465 [n1] runtime stats: 6.2 GiB RSS, 3930 goroutines, 724 MiB/182 MiB/1.0 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 131635.9 CGO/sec, 991.0/119.2 %(u/s)time, 0.1 %gc (9x), 101 MiB/97 MiB (r/w)net
I181031 06:49:32.910471 329 server/status/runtime.go:465 [n1] runtime stats: 6.2 GiB RSS, 3902 goroutines, 630 MiB/142 MiB/1.1 GiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 170590.2 CGO/sec, 1012.7/115.8 %(u/s)time, 0.1 %gc (11x), 103 MiB/103 MiB (r/w)net
W181031 06:49:33.845919 256 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]
W181031 06:49:33.845929 263 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]
W181031 06:49:33.845964 293 storage/store.go:3732 [n1,s1] handle raft ready: 1.2s [processed=0]

The runtime stats don't show a red flag other than that (I also think I convinced myself that RocksDB wasn't writing much more to disk than it did before). The network i/o drops in half, but it's unclear whether this is cause or symptom (probably the latter).

also most of the slow readies don't process any entries (i.e. they're not applying commands). The duration is also extremely uniform (i.e. all readies take 0.7s, then that slowly increases, drops down again, etc). There are no snapshots. Also, n2 and n3 look just fine.

Unfortunately I don't see much I can infer from this, but hopefully we'll catch similar badness in longer-running perf tests and can investigate it there.

@tbg tbg closed this as completed Nov 12, 2018
@danhhz
Copy link
Contributor Author

danhhz commented Nov 12, 2018

Thanks for taking a look! I've been running tpcc overnight a bunch while digging into #32104. If I see this again, what should I save to help debug?

@tbg
Copy link
Member

tbg commented Nov 12, 2018

Unfortunately the only real help will be leaving the cluster up so that I (or someone else) can poke around in the UI. (./cockroach debug tsdump in theory exports that data, but there's no good way to look at it after).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-benchmarking A-testing Testing tools and infrastructure C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

3 participants