Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: break up giant raftMu critical section downstream of raft #19156

Closed
a-robinson opened this issue Oct 10, 2017 · 16 comments
Closed

storage: break up giant raftMu critical section downstream of raft #19156

a-robinson opened this issue Oct 10, 2017 · 16 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@a-robinson
Copy link
Contributor

a-robinson commented Oct 10, 2017

In testing our single-range write throughput (#18657), it's become clear that raftMu is the primary bottleneck. We hold it for a very long time as handleRaftReadyRaftMuLocked executes.

In my testing in #18657 it's not unusual for raftMu to be held for many tens of milliseconds at a time, and occasionally more than 100ms. And from adding some poor man's tracing to the code, it appears as though the bulk of the time is spent sending messages and applying committed entries, with the latter being the worst (and least predictable) offender.

There seem to be three ways to attack this:

  1. Try to make the message sending and applying of committed entries faster. There may be less big wins available here, but long term it'd likely reduce CPU usage the most.
  2. Parallelize the sending of messages (and maybe also the applying of committed entries?). This would help single range throughput, but potentially at the expense of total cluster throughput in workloads that aren't skewed toward a single range. Load-based splitting (storage: Load-based splitting #19154) could probably get similar results but in a more generalized way.
  3. Break up the big critical section. This would mean unlocking and re-locking raftMu as necessary rather than holding it throughout the entire process. I'll need some guidance (from @bdarnell? @petermattis?) on whether this is even remotely feasible, since it's possible that all of that work really does need to be done as part of a single critical section. If we can break it up, though, it opens up opportunities for pipelining, for example having one goroutine persisting new log entries, one sending messages, and one applying committed entries.

According to the etcd-raft usage instructions, it'd be safe for us to advance the raft node as soon as we've persisted the new raft entries to disk (i.e. before the two slow steps referenced above), but those steps as written today do appear to rely on some things protected by raftMu. That would seemingly make option 3 viable in the sense that handleRaftReadyRaftMuLocked could return as soon as it had synced new entries to disk and Advanced the raft group, with the message-sending and applying continuing in the background so long as we ensure that all messages/entries are processed in the order they were returned by raft.

Sorry for the giant block of text, but I wanted to explain where I'm at with single-range performance and make it possible to get the information I want about the viability of breaking up the big raftMu critical section.

@a-robinson a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. refactoring labels Oct 10, 2017
@a-robinson
Copy link
Contributor Author

And cc #18779, since this could be a bit of a refactoring project.

@petermattis
Copy link
Collaborator

I think we need mutual exclusion on handling of Raft ready messages. But perhaps that could be achieved with the addition of another mutex (raftReadyMu?). Gah, my head is hurting looking at this locking again.

@a-robinson
Copy link
Contributor Author

Do you think that we need that due to how raft works or due to how our various above-raft mechanisms work? My reading of raft itself is that it would be ok to pipeline the different stages of handling the Raft ready messages, as long as we:

  • Finish persisting new entries before doing anything else with a given Raft ready
  • Advance the Raft state before moving on to a new Raft ready
  • Process each Raft ready in order at each stage of the pipeline

Gah, my head is hurting looking at this locking again.

Yeah, it is possible that load-based splitting and optimizing the work being done (rather than the locking/ordering) would be a better place to focus efforts.

@petermattis
Copy link
Collaborator

My statement was based on how our various above-Raft mechanisms work. I think it would be problematic for multiple routines to be running processRaftCommand simultaneously.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow. What I'm surprised at is that this causes problems because the only call I see to lock raftMu is in Replica.propose (the one in maybeInitializeRaftGroup should rarely be called). And you said that removing the raftMu.Lock in Replica.propose didn't help performance. So is there some other usage of raftMu that I'm not seeing that is causing problems?

@tbg
Copy link
Member

tbg commented Oct 11, 2017

Where exactly are we spending significant amounts of time when sending messages? sendRaftMessage itself looks like it should be rather fast, and the remaining stuff there deals with sideloading and doesn't (shouldn't) actually do anything in your testing.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow. What I'm surprised at is that this causes problems because the only call I see to lock raftMu is in Replica.propose

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one (assuming no raftMu acquired during proposals). Naively the amount of syncing going on should be about the same (assuming enough concurrency). Each proposal should eat about 1-2 sync latencies (since it has to wait out the ongoing sync, and then its own sync) and throughput should be the same. Probably both of my statements are wrong, but I think it's going to be helpful to find out. Also, how does min_wal_sync_interval=1ms or even 5ms influence things?

@a-robinson
Copy link
Contributor Author

a-robinson commented Oct 11, 2017

My statement was based on how our various above-Raft mechanisms work. I think it would be problematic for multiple routines to be running processRaftCommand simultaneously.

Ack, thanks.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow.

In my printf-style tracing, the synchronous committing of new raft log entries never took more than a few milliseconds. The truly slow part was doing the rest of the processing after that to respond to raft messages (occasionally) and apply committed commands (most frequently).

So is there some other usage of raftMu that I'm not seeing that is causing problems?

Not that I've noticed. The problem I've observed is that handleRaftReadyRaftMuLocked for a given batch takes a very long time, preventing the next batch of requests from beginning processing (he backlog of these requests getting too large is what causes dropped raft messages). Hence the suggestion to pipeline the different stages of handleRaftReadyRaftMuLocked.

@a-robinson
Copy link
Contributor Author

Where exactly are we spending significant amounts of time when sending messages?

The time here doesn't seem to get spent in any large outliers, rather it just takes a while to send so many messages when a large batch of requests has built up. For example, some trace outputs include:

  • Sending ~125 messages taking 4.2ms
  • Sending ~70 messages taking 2.7ms
  • Sending ~175 messages taking 3.5ms
  • Sending ~125 messages taking 7.9ms

It appears to be more of an issue on the leader than on the followers. It's certainly not a ton of time, but every millisecond that we're holding the lock to send messages is blocking new requests from making progress. Although it'd be less impactful than speeding up the applying of committed commands, moving the message-sending out of the critical section would be the easiest change from a safety perspective.

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one (assuming no raftMu acquired during proposals).

I'll re-test it today, but I can't remember a single configuration of #18657 that I've tested in which a split didn't give a considerable spike to throughput (and a corresponding decrease in dropped raft messages. The bottleneck for single range writes is waiting for the previous raft ready to finish processing before new messages can be processed.

Each proposal should eat about 1-2 sync latencies (since it has to wait out the ongoing sync, and then its own sync) and throughput should be the same.

That might be true if we didn't build up so big of a backlog (100 raft requests) that we started dropping new ones, requiring them to be re-sent. But even so, it's not so much about sync latencies at this point (which are only a few milliseconds) as it is about handleRaftReady processing time as a whole, since the applying of commands is dwarfing the syncing of entries.

Probably both of my statements are wrong, but I think it's going to be helpful to find out. Also, how does min_wal_sync_interval=1ms or even 5ms influence things?

For the single range case, switching to 1ms had no noticeable effect (which makes sense given that the time taken by apply, which doesn't require syncing, is the main bottleneck). Switching to 5ms hurt throughput and latency by a small, but noticeable amount, around 5-10%.

@a-robinson
Copy link
Contributor Author

a-robinson commented Oct 11, 2017

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one

Here's a single split. Ignore the absolute values since they're being affected by my very verbose logging; just focus on the relative changes:

screen shot 2017-10-11 at 10 25 33 am

screen shot 2017-10-11 at 10 25 48 am

And here's splitting from 2 ranges to 21 ranges (by setting max range size down to 8MiB):

screen shot 2017-10-11 at 10 31 06 am

screen shot 2017-10-11 at 10 31 13 am

@tbg
Copy link
Member

tbg commented Oct 11, 2017

The time here doesn't seem to get spent in any large outliers, rather it just takes a while to send so many messages when a large batch of requests has built up.

Hmm, that just seems like... a lot, but I missed that sendRaftMessage acquires replicaMu, which it would then do 125x and that mutex we do lock a bunch during proposing. Seems like an obvious improvement here to avoid re-locking over and over again (bet you could just skip that lock without things blowing up too much for your testing). Once that's done and it's still not where we'd like it to be, we could have sendRaftMessage be called in a goroutine, or better, have an on-demand worker goroutine that handles message sending (just as you proposed).

Although it'd be less impactful than speeding up the applying of committed commands

Oh, I see. Sorry, I had made the misguided assumption that by "applying committed entries" you meant the sync, but I'm not sure why I thought that.

You're right that it's unfortunate that we're still holding on to raftMu while we apply committed entries and your initial idea of moving the Advance up to after the sync and passing the rest of to a worker goroutine seems like the right way to address this. (The way I imagine these goroutines is that they're singletons and shut down when they've been idle for O(seconds) and are recreated on demand). I think it shouldn't be too hard to prototype this to see what the impact would be (or rather, confirm that that gets you where you think we should be).

In carrying out the steps above, we may have to change the mutex the sideloaded storage is tied to (currently raftMu, but there's no deeper reason for that other than it being convenient at the time), but let's see about that when it becomes relevant. Certainly doesn't matter for workloads that don't RESTORE.

@bdarnell
Copy link
Contributor

Transplanting my comments from #18657 (comment)

If raftMu is only used to guard against concurrent replica GC (which was the original idea but I'm not sure if it's stayed true as things have evolved), maybe there's room to turn raftMu into a RWMutex and lock it in read mode everywhere but in creation and destruction of the raft group.

The RWMutex suggested at

// TODO(bdarnell): a Replica could be destroyed immediately after
// Store.Send finds the Replica and releases the lock. We need
// another RWMutex to be held by anything using a Replica to ensure
// that everything is finished before releasing it. #7169
could be a replacement for raftMu.

@a-robinson
Copy link
Contributor Author

Thanks for helping to clarify things @bdarnell and @tschottdorf! I'm excited to continue this line of testing soon.

@bdarnell
Copy link
Contributor

As I rediscovered in #19172 (comment), raftMu has an important use in handling the split/snapshot race (and this use requires an exclusive lock; it won't be easy to replace it with an RWMutex. A sync.Cond with an explicit state variable might work)

@a-robinson
Copy link
Contributor Author

Hmm, that just seems like... a lot, but I missed that sendRaftMessage acquires replicaMu, which it would then do 125x and that mutex we do lock a bunch during proposing. Seems like an obvious improvement here to avoid re-locking over and over again (bet you could just skip that lock without things blowing up too much for your testing).

Just moving the locking of replica.mu outside the loop over all messages has a pretty negligible impact on throughput and latency. It looks like it may be slightly better, but not by much:

Locking/unlocking in every sendRaftMessage call:
screen shot 2017-10-25 at 11 31 05 am

Locking/unlocking just once, outside the loop over messages:
screen shot 2017-10-25 at 11 37 54 am

@a-robinson
Copy link
Contributor Author

Naively parallelizing the sending of messages with the applying of committed commands using a sync.WaitGroup had a small, but a little more noticeable effect on throughput:

screen shot 2017-10-25 at 12 58 43 pm

Probably not worth spending too much time on in isolation.

@knz knz added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed refactoring labels Apr 24, 2018
@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Jul 22, 2018
@nvanbenschoten
Copy link
Member

Moving this to 2.2 because we're not going to be able to make any changes here for 2.1.

@nvanbenschoten nvanbenschoten added A-kv-replication Relating to Raft, consensus, and coordination. and removed A-coreperf labels Oct 16, 2018
@nvanbenschoten
Copy link
Member

Closing, as the learnings and ideas presented here have diffused into a number of other issues, many of which have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

6 participants