-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
oddity while running extended version of tpcc/nodes=3/w=max #32058
Comments
I'm noticing that the number of goroutines (on n1) goes up quite a bit when the badness starts:
The runtime stats don't show a red flag other than that (I also think I convinced myself that RocksDB wasn't writing much more to disk than it did before). The network i/o drops in half, but it's unclear whether this is cause or symptom (probably the latter). also most of the slow readies don't process any entries (i.e. they're not applying commands). The duration is also extremely uniform (i.e. all readies take 0.7s, then that slowly increases, drops down again, etc). There are no snapshots. Also, n2 and n3 look just fine. Unfortunately I don't see much I can infer from this, but hopefully we'll catch similar badness in longer-running perf tests and can investigate it there. |
Thanks for taking a look! I've been running tpcc overnight a bunch while digging into #32104. If I see this again, what should I save to help debug? |
Unfortunately the only real help will be leaving the cluster up so that I (or someone else) can poke around in the UI. ( |
I ran the tpcc/nodes=3/w=max roachtest overnight (local change to eliminate the 2h duration specified by the test). There was a blip around 06:56 that someone may want to look into. (My run was manually ended before the end of the screenshot, so don't worry about that cliff at the end.)
Transaction restarts were elevated during the badness. There was also a small blip of leaders w/o leaseholders. I have a copy of the logs. (Where do I put them?) They're basically just tons of "handle raft ready: 1.7s [processed=0]" lines over and over around the badness.
Full disclosure: my binary had a few local changes to changefeedccl code (commented out flushing kafka and added some metrics), but changefeeds do not run in this test and I can't imagine a world in which my changes were related.
The text was updated successfully, but these errors were encountered: