stability: performance problems when adding a lot of nodes at once #12741

a-robinson · 2017-01-06T17:24:32Z

When running @mberhault's scalability test that doubles the size of the cluster every half hour while running a block writer on each member node, it becomes clear that adding a lot of nodes to the cluster at once can cause some problems (particularly the jump from 32 to 64 nodes). This isn't urgently critical, since adding 32 nodes to an existing 32 node cluster within a matter of seconds isn't a top use case right now, but it is a little concerning.

A bunch of node liveness epochs got incremented, indicating that nodes weren't able to properly heartbeat their liveness entries (which can lead to other problems):

Mutex critical sections slow down:

Queue failures jump up:

Raft performance gets worse before it gets better:

Along with user-visible performance and aborted transactions:

Most concerningly, a bunch of lease/distsender requests apparently got stuck (and seemingly never unstuck) when the number of nodes jumped up:

Metrics at https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-sql?from=1483630200000&to=1483641000000&var-cluster=sky&var-node=All&var-rate_interval=1m

I can upload the logs somewhere as well if anyone would like them.

petermattis · 2017-01-06T19:59:53Z

I've noticed in recent experiments that performance suffers when significant rebalance is taking place. For example, in a 10 node cluster if you take 1 node down for longer than 5m performance takes a hit when rebalancing away from the down node kicks in. There might be something similar going on in this case. If you add a single node to a cluster only 1 rebalance operation can take place at a time which is a fairly small drain on background resources. If you add 32 nodes then 32 concurrent rebalance operations can occur.

dianasaur323 · 2017-04-10T18:13:50Z

assigning to @petermattis. it seems like the performance hit we take when 1 node goes down merits some exploration for 1.0, and that that fix might help with the performance impact of adding too many nodes at the same time?

petermattis · 2017-04-10T18:17:55Z

The performance hit is likely attenuated by #14718. Would be interesting to re-run the experiment above when that changes lands.

a-robinson · 2017-04-24T20:29:02Z

Alright, lots of pretty graphs to share. The tl;dr is that things worked well except when moving from 8 to 16 nodes, where performance was terrible for about 5 minutes (starting just before 18:20 in the graphs). It's not obvious from the graphs why that transition was worse than any of the others. It doesn't look particularly different from the others in terms of what background work was being done at the time.

The other two concerning things:

Gossip thrashing in large clusters appears to have returned, and I'm not sure what gossip changes since we fixed it could have caused it to come back.
Performance didn't really improve much as nodes were added, and actually got worse when we moved from 32 to 64 nodes. This might just be due to the workload, though -- the only load being generated is from a single photos instance on each node with --users=10.

I'm going to shut down the nodes in 30 minutes or so unless anyone needs them for anything. I'll save the logs beforehand.

a-robinson · 2017-04-24T20:49:27Z

I'm gonna wipe and try again with only block writer running, but I just noticed that the performance has jumped quite noticeably since the original screenshots:

The most correlated event appears to be two node liveness epochs being incremented, which could have led to a better distribution of leases across the nodes. The leases were perfectly well balanced before the epoch increments, but perhaps all the hot ranges had their lease on just one or two nodes.

a-robinson · 2017-04-25T00:44:21Z

Things were less rocky when running with block writer while ramping up load. The worst thing that happened was a 25% dip in QPS when moving from 8 to 16 nodes, which is a bit worse than what we were seeing when the original issue was filed. That's obviously not great, but is significantly better than what happened for photos.

What is a bit more concerning is that the cluster was seemingly falling behind while running in its steady state at 64 nodes. Note the number of under-replicated ranges and replicas behind on their raft log. The cluster was still functioning, but not totally catching up.

a-robinson · 2017-04-26T19:27:44Z

I don't think there's anything actionable here for 1.0. The short 25% dip when going from 8 to 16 nodes isn't ideal, but certainly isn't a release blocker and will fall under our ruggedization goals for 1.1. I've also reopened our old gossip thrashing issue (#9819) to track the reappearance of that problem.

petermattis added this to the 1.0 milestone Feb 23, 2017

dianasaur323 assigned petermattis Apr 10, 2017

petermattis assigned a-robinson and unassigned petermattis Apr 10, 2017

a-robinson mentioned this issue Apr 26, 2017

stability: long-lasting gossip connection thrashing #9819

Closed

a-robinson modified the milestones: 1.1, 1.0 Apr 26, 2017

a-robinson modified the milestones: 1.2, 1.1 Aug 14, 2017

cuongdo modified the milestones: Later, 1.2 Sep 7, 2017

a-robinson removed their assignment Sep 7, 2017

knz added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Apr 27, 2018

tbg added the A-coreperf label Aug 21, 2018

petermattis removed this from the Later milestone Oct 5, 2018

nvanbenschoten added A-kv-distribution Relating to rebalancing and leasing. and removed A-coreperf labels Oct 16, 2018

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: performance problems when adding a lot of nodes at once #12741

stability: performance problems when adding a lot of nodes at once #12741

a-robinson commented Jan 6, 2017

petermattis commented Jan 6, 2017

dianasaur323 commented Apr 10, 2017 •

edited

Loading

petermattis commented Apr 10, 2017

a-robinson commented Apr 24, 2017

a-robinson commented Apr 24, 2017 •

edited

Loading

a-robinson commented Apr 25, 2017

a-robinson commented Apr 26, 2017

stability: performance problems when adding a lot of nodes at once #12741

stability: performance problems when adding a lot of nodes at once #12741

Comments

a-robinson commented Jan 6, 2017

petermattis commented Jan 6, 2017

dianasaur323 commented Apr 10, 2017 • edited Loading

petermattis commented Apr 10, 2017

a-robinson commented Apr 24, 2017

a-robinson commented Apr 24, 2017 • edited Loading

a-robinson commented Apr 25, 2017

a-robinson commented Apr 26, 2017

dianasaur323 commented Apr 10, 2017 •

edited

Loading

a-robinson commented Apr 24, 2017 •

edited

Loading