-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: performance problems when adding a lot of nodes at once #12741
Comments
I've noticed in recent experiments that performance suffers when significant rebalance is taking place. For example, in a 10 node cluster if you take 1 node down for longer than 5m performance takes a hit when rebalancing away from the down node kicks in. There might be something similar going on in this case. If you add a single node to a cluster only 1 rebalance operation can take place at a time which is a fairly small drain on background resources. If you add 32 nodes then 32 concurrent rebalance operations can occur. |
assigning to @petermattis. it seems like the performance hit we take when 1 node goes down merits some exploration for 1.0, and that that fix might help with the performance impact of adding too many nodes at the same time? |
The performance hit is likely attenuated by #14718. Would be interesting to re-run the experiment above when that changes lands. |
Alright, lots of pretty graphs to share. The tl;dr is that things worked well except when moving from 8 to 16 nodes, where performance was terrible for about 5 minutes (starting just before 18:20 in the graphs). It's not obvious from the graphs why that transition was worse than any of the others. It doesn't look particularly different from the others in terms of what background work was being done at the time. The other two concerning things:
I'm going to shut down the nodes in 30 minutes or so unless anyone needs them for anything. I'll save the logs beforehand. |
I'm gonna wipe and try again with only block writer running, but I just noticed that the performance has jumped quite noticeably since the original screenshots: The most correlated event appears to be two node liveness epochs being incremented, which could have led to a better distribution of leases across the nodes. The leases were perfectly well balanced before the epoch increments, but perhaps all the hot ranges had their lease on just one or two nodes. |
Things were less rocky when running with block writer while ramping up load. The worst thing that happened was a 25% dip in QPS when moving from 8 to 16 nodes, which is a bit worse than what we were seeing when the original issue was filed. That's obviously not great, but is significantly better than what happened for photos. What is a bit more concerning is that the cluster was seemingly falling behind while running in its steady state at 64 nodes. Note the number of under-replicated ranges and replicas behind on their raft log. The cluster was still functioning, but not totally catching up. |
I don't think there's anything actionable here for 1.0. The short 25% dip when going from 8 to 16 nodes isn't ideal, but certainly isn't a release blocker and will fall under our ruggedization goals for 1.1. I've also reopened our old gossip thrashing issue (#9819) to track the reappearance of that problem. |
When running @mberhault's scalability test that doubles the size of the cluster every half hour while running a block writer on each member node, it becomes clear that adding a lot of nodes to the cluster at once can cause some problems (particularly the jump from 32 to 64 nodes). This isn't urgently critical, since adding 32 nodes to an existing 32 node cluster within a matter of seconds isn't a top use case right now, but it is a little concerning.
A bunch of node liveness epochs got incremented, indicating that nodes weren't able to properly heartbeat their liveness entries (which can lead to other problems):
Mutex critical sections slow down:
Queue failures jump up:
Raft performance gets worse before it gets better:
Along with user-visible performance and aborted transactions:
Most concerningly, a bunch of lease/distsender requests apparently got stuck (and seemingly never unstuck) when the number of nodes jumped up:
Metrics at https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-sql?from=1483630200000&to=1483641000000&var-cluster=sky&var-node=All&var-rate_interval=1m
I can upload the logs somewhere as well if anyone would like them.
The text was updated successfully, but these errors were encountered: