Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: performance problems when adding a lot of nodes at once #12741

Closed
a-robinson opened this issue Jan 6, 2017 · 7 comments
Closed

stability: performance problems when adding a lot of nodes at once #12741

a-robinson opened this issue Jan 6, 2017 · 7 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale

Comments

@a-robinson
Copy link
Contributor

When running @mberhault's scalability test that doubles the size of the cluster every half hour while running a block writer on each member node, it becomes clear that adding a lot of nodes to the cluster at once can cause some problems (particularly the jump from 32 to 64 nodes). This isn't urgently critical, since adding 32 nodes to an existing 32 node cluster within a matter of seconds isn't a top use case right now, but it is a little concerning.

A bunch of node liveness epochs got incremented, indicating that nodes weren't able to properly heartbeat their liveness entries (which can lead to other problems):

screen shot 2017-01-06 at 12 15 59 pm

Mutex critical sections slow down:

screen shot 2017-01-06 at 12 16 33 pm

screen shot 2017-01-06 at 12 16 44 pm

Queue failures jump up:

screen shot 2017-01-06 at 12 18 03 pm

Raft performance gets worse before it gets better:

screen shot 2017-01-06 at 12 18 59 pm

screen shot 2017-01-06 at 12 19 14 pm

Along with user-visible performance and aborted transactions:

screen shot 2017-01-06 at 12 20 36 pm

screen shot 2017-01-06 at 12 20 49 pm

Most concerningly, a bunch of lease/distsender requests apparently got stuck (and seemingly never unstuck) when the number of nodes jumped up:

screen shot 2017-01-06 at 12 20 02 pm

screen shot 2017-01-06 at 12 20 21 pm

Metrics at https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-sql?from=1483630200000&to=1483641000000&var-cluster=sky&var-node=All&var-rate_interval=1m

I can upload the logs somewhere as well if anyone would like them.

@petermattis
Copy link
Collaborator

I've noticed in recent experiments that performance suffers when significant rebalance is taking place. For example, in a 10 node cluster if you take 1 node down for longer than 5m performance takes a hit when rebalancing away from the down node kicks in. There might be something similar going on in this case. If you add a single node to a cluster only 1 rebalance operation can take place at a time which is a fairly small drain on background resources. If you add 32 nodes then 32 concurrent rebalance operations can occur.

@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@dianasaur323
Copy link
Contributor

dianasaur323 commented Apr 10, 2017

assigning to @petermattis. it seems like the performance hit we take when 1 node goes down merits some exploration for 1.0, and that that fix might help with the performance impact of adding too many nodes at the same time?

@petermattis
Copy link
Collaborator

The performance hit is likely attenuated by #14718. Would be interesting to re-run the experiment above when that changes lands.

@a-robinson
Copy link
Contributor Author

Alright, lots of pretty graphs to share. The tl;dr is that things worked well except when moving from 8 to 16 nodes, where performance was terrible for about 5 minutes (starting just before 18:20 in the graphs). It's not obvious from the graphs why that transition was worse than any of the others. It doesn't look particularly different from the others in terms of what background work was being done at the time.

The other two concerning things:

  • Gossip thrashing in large clusters appears to have returned, and I'm not sure what gossip changes since we fixed it could have caused it to come back.
  • Performance didn't really improve much as nodes were added, and actually got worse when we moved from 32 to 64 nodes. This might just be due to the workload, though -- the only load being generated is from a single photos instance on each node with --users=10.

I'm going to shut down the nodes in 30 minutes or so unless anyone needs them for anything. I'll save the logs beforehand.

screen shot 2017-04-24 at 4 16 56 pm

screen shot 2017-04-24 at 4 14 39 pm

screen shot 2017-04-24 at 4 15 18 pm

screen shot 2017-04-24 at 4 17 59 pm

screen shot 2017-04-24 at 4 18 23 pm

screen shot 2017-04-24 at 4 19 44 pm

screen shot 2017-04-24 at 4 19 48 pm

screen shot 2017-04-24 at 4 20 28 pm

screen shot 2017-04-24 at 4 16 00 pm

screen shot 2017-04-24 at 4 16 10 pm

screen shot 2017-04-24 at 4 16 23 pm

@a-robinson
Copy link
Contributor Author

a-robinson commented Apr 24, 2017

I'm gonna wipe and try again with only block writer running, but I just noticed that the performance has jumped quite noticeably since the original screenshots:

screen shot 2017-04-24 at 4 45 21 pm

The most correlated event appears to be two node liveness epochs being incremented, which could have led to a better distribution of leases across the nodes. The leases were perfectly well balanced before the epoch increments, but perhaps all the hot ranges had their lease on just one or two nodes.

screen shot 2017-04-24 at 4 49 37 pm

@a-robinson
Copy link
Contributor Author

Things were less rocky when running with block writer while ramping up load. The worst thing that happened was a 25% dip in QPS when moving from 8 to 16 nodes, which is a bit worse than what we were seeing when the original issue was filed. That's obviously not great, but is significantly better than what happened for photos.

What is a bit more concerning is that the cluster was seemingly falling behind while running in its steady state at 64 nodes. Note the number of under-replicated ranges and replicas behind on their raft log. The cluster was still functioning, but not totally catching up.

screen shot 2017-04-24 at 8 05 06 pm

screen shot 2017-04-24 at 8 04 17 pm

screen shot 2017-04-24 at 8 05 57 pm

screen shot 2017-04-24 at 8 05 25 pm

screen shot 2017-04-24 at 8 06 42 pm

screen shot 2017-04-24 at 8 07 05 pm

screen shot 2017-04-24 at 8 07 17 pm

screen shot 2017-04-24 at 8 07 48 pm

screen shot 2017-04-24 at 8 08 00 pm

@a-robinson
Copy link
Contributor Author

I don't think there's anything actionable here for 1.0. The short 25% dip when going from 8 to 16 nodes isn't ideal, but certainly isn't a release blocker and will fall under our ruggedization goals for 1.1. I've also reopened our old gossip thrashing issue (#9819) to track the reappearance of that problem.

@a-robinson a-robinson modified the milestones: 1.1, 1.0 Apr 26, 2017
@a-robinson a-robinson modified the milestones: 1.2, 1.1 Aug 14, 2017
@cuongdo cuongdo modified the milestones: Later, 1.2 Sep 7, 2017
@a-robinson a-robinson removed their assignment Sep 7, 2017
@knz knz added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Apr 27, 2018
@tbg tbg added the A-coreperf label Aug 21, 2018
@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@nvanbenschoten nvanbenschoten added A-kv-distribution Relating to rebalancing and leasing. and removed A-coreperf labels Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior. X-stale
Projects
None yet
Development

No branches or pull requests

8 participants