storage: break up giant raftMu critical section downstream of raft #19156

a-robinson · 2017-10-10T14:55:51Z

In testing our single-range write throughput (#18657), it's become clear that raftMu is the primary bottleneck. We hold it for a very long time as handleRaftReadyRaftMuLocked executes.

In my testing in #18657 it's not unusual for raftMu to be held for many tens of milliseconds at a time, and occasionally more than 100ms. And from adding some poor man's tracing to the code, it appears as though the bulk of the time is spent sending messages and applying committed entries, with the latter being the worst (and least predictable) offender.

There seem to be three ways to attack this:

Try to make the message sending and applying of committed entries faster. There may be less big wins available here, but long term it'd likely reduce CPU usage the most.
Parallelize the sending of messages (and maybe also the applying of committed entries?). This would help single range throughput, but potentially at the expense of total cluster throughput in workloads that aren't skewed toward a single range. Load-based splitting (storage: Load-based splitting #19154) could probably get similar results but in a more generalized way.
Break up the big critical section. This would mean unlocking and re-locking raftMu as necessary rather than holding it throughout the entire process. I'll need some guidance (from @bdarnell? @petermattis?) on whether this is even remotely feasible, since it's possible that all of that work really does need to be done as part of a single critical section. If we can break it up, though, it opens up opportunities for pipelining, for example having one goroutine persisting new log entries, one sending messages, and one applying committed entries.

According to the etcd-raft usage instructions, it'd be safe for us to advance the raft node as soon as we've persisted the new raft entries to disk (i.e. before the two slow steps referenced above), but those steps as written today do appear to rely on some things protected by raftMu. That would seemingly make option 3 viable in the sense that handleRaftReadyRaftMuLocked could return as soon as it had synced new entries to disk and Advanced the raft group, with the message-sending and applying continuing in the background so long as we ensure that all messages/entries are processed in the order they were returned by raft.

Sorry for the giant block of text, but I wanted to explain where I'm at with single-range performance and make it possible to get the information I want about the viability of breaking up the big raftMu critical section.

The text was updated successfully, but these errors were encountered:

a-robinson · 2017-10-10T14:56:05Z

And cc #18779, since this could be a bit of a refactoring project.

petermattis · 2017-10-10T15:38:37Z

I think we need mutual exclusion on handling of Raft ready messages. But perhaps that could be achieved with the addition of another mutex (raftReadyMu?). Gah, my head is hurting looking at this locking again.

a-robinson · 2017-10-10T16:40:49Z

Do you think that we need that due to how raft works or due to how our various above-raft mechanisms work? My reading of raft itself is that it would be ok to pipeline the different stages of handling the Raft ready messages, as long as we:

Finish persisting new entries before doing anything else with a given Raft ready
Advance the Raft state before moving on to a new Raft ready
Process each Raft ready in order at each stage of the pipeline

Gah, my head is hurting looking at this locking again.

Yeah, it is possible that load-based splitting and optimizing the work being done (rather than the locking/ordering) would be a better place to focus efforts.

petermattis · 2017-10-11T00:35:26Z

My statement was based on how our various above-Raft mechanisms work. I think it would be problematic for multiple routines to be running processRaftCommand simultaneously.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow. What I'm surprised at is that this causes problems because the only call I see to lock raftMu is in Replica.propose (the one in maybeInitializeRaftGroup should rarely be called). And you said that removing the raftMu.Lock in Replica.propose didn't help performance. So is there some other usage of raftMu that I'm not seeing that is causing problems?

tbg · 2017-10-11T09:36:37Z

Where exactly are we spending significant amounts of time when sending messages? sendRaftMessage itself looks like it should be rather fast, and the remaining stuff there deals with sideloading and doesn't (shouldn't) actually do anything in your testing.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow. What I'm surprised at is that this causes problems because the only call I see to lock raftMu is in Replica.propose

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one (assuming no raftMu acquired during proposals). Naively the amount of syncing going on should be about the same (assuming enough concurrency). Each proposal should eat about 1-2 sync latencies (since it has to wait out the ongoing sync, and then its own sync) and throughput should be the same. Probably both of my statements are wrong, but I think it's going to be helpful to find out. Also, how does min_wal_sync_interval=1ms or even 5ms influence things?

a-robinson · 2017-10-11T13:34:58Z

My statement was based on how our various above-Raft mechanisms work. I think it would be problematic for multiple routines to be running processRaftCommand simultaneously.

Ack, thanks.

I can understand why raftMu is held for tens of milliseconds: the call to synchronously commit a batch can be slow.

In my printf-style tracing, the synchronous committing of new raft log entries never took more than a few milliseconds. The truly slow part was doing the rest of the processing after that to respond to raft messages (occasionally) and apply committed commands (most frequently).

So is there some other usage of raftMu that I'm not seeing that is causing problems?

Not that I've noticed. The problem I've observed is that handleRaftReadyRaftMuLocked for a given batch takes a very long time, preventing the next batch of requests from beginning processing (he backlog of these requests getting too large is what causes dropped raft messages). Hence the suggestion to pipeline the different stages of handleRaftReadyRaftMuLocked.

a-robinson · 2017-10-11T14:07:12Z

Where exactly are we spending significant amounts of time when sending messages?

The time here doesn't seem to get spent in any large outliers, rather it just takes a while to send so many messages when a large batch of requests has built up. For example, some trace outputs include:

Sending ~125 messages taking 4.2ms
Sending ~70 messages taking 2.7ms
Sending ~175 messages taking 3.5ms
Sending ~125 messages taking 7.9ms

It appears to be more of an issue on the leader than on the followers. It's certainly not a ton of time, but every millisecond that we're holding the lock to send messages is blocking new requests from making progress. Although it'd be less impactful than speeding up the applying of committed commands, moving the message-sending out of the critical section would be the easiest change from a safety perspective.

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one (assuming no raftMu acquired during proposals).

I'll re-test it today, but I can't remember a single configuration of #18657 that I've tested in which a split didn't give a considerable spike to throughput (and a corresponding decrease in dropped raft messages. The bottleneck for single range writes is waiting for the previous raft ready to finish processing before new messages can be processed.

Each proposal should eat about 1-2 sync latencies (since it has to wait out the ongoing sync, and then its own sync) and throughput should be the same.

That might be true if we didn't build up so big of a backlog (100 raft requests) that we started dropping new ones, requiring them to be re-sent. But even so, it's not so much about sync latencies at this point (which are only a few milliseconds) as it is about handleRaftReady processing time as a whole, since the applying of commands is dwarfing the syncing of entries.

Probably both of my statements are wrong, but I think it's going to be helpful to find out. Also, how does min_wal_sync_interval=1ms or even 5ms influence things?

For the single range case, switching to 1ms had no noticeable effect (which makes sense given that the time taken by apply, which doesn't require syncing, is the main bottleneck). Switching to 5ms hurt throughput and latency by a small, but noticeable amount, around 5-10%.

a-robinson · 2017-10-11T14:31:56Z

Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one

Here's a single split. Ignore the absolute values since they're being affected by my very verbose logging; just focus on the relative changes:

And here's splitting from 2 ranges to 21 ranges (by setting max range size down to 8MiB):

tbg · 2017-10-11T15:29:30Z

The time here doesn't seem to get spent in any large outliers, rather it just takes a while to send so many messages when a large batch of requests has built up.

Hmm, that just seems like... a lot, but I missed that sendRaftMessage acquires replicaMu, which it would then do 125x and that mutex we do lock a bunch during proposing. Seems like an obvious improvement here to avoid re-locking over and over again (bet you could just skip that lock without things blowing up too much for your testing). Once that's done and it's still not where we'd like it to be, we could have sendRaftMessage be called in a goroutine, or better, have an on-demand worker goroutine that handles message sending (just as you proposed).

Although it'd be less impactful than speeding up the applying of committed commands

Oh, I see. Sorry, I had made the misguided assumption that by "applying committed entries" you meant the sync, but I'm not sure why I thought that.

You're right that it's unfortunate that we're still holding on to raftMu while we apply committed entries and your initial idea of moving the Advance up to after the sync and passing the rest of to a worker goroutine seems like the right way to address this. (The way I imagine these goroutines is that they're singletons and shut down when they've been idle for O(seconds) and are recreated on demand). I think it shouldn't be too hard to prototype this to see what the impact would be (or rather, confirm that that gets you where you think we should be).

In carrying out the steps above, we may have to change the mutex the sideloaded storage is tied to (currently raftMu, but there's no deeper reason for that other than it being convenient at the time), but let's see about that when it becomes relevant. Certainly doesn't matter for workloads that don't RESTORE.

bdarnell · 2017-10-15T17:16:38Z

Transplanting my comments from #18657 (comment)

If raftMu is only used to guard against concurrent replica GC (which was the original idea but I'm not sure if it's stayed true as things have evolved), maybe there's room to turn raftMu into a RWMutex and lock it in read mode everywhere but in creation and destruction of the raft group.

The RWMutex suggested at

cockroach/pkg/storage/store.go

Lines 440 to 443 in 1811c61

    
           // TODO(bdarnell): a Replica could be destroyed immediately after 
        
           // Store.Send finds the Replica and releases the lock. We need 
        
           // another RWMutex to be held by anything using a Replica to ensure 
        
           // that everything is finished before releasing it. #7169

could be a replacement for raftMu.

a-robinson · 2017-10-16T13:55:58Z

Thanks for helping to clarify things @bdarnell and @tschottdorf! I'm excited to continue this line of testing soon.

bdarnell · 2017-10-16T22:34:38Z

As I rediscovered in #19172 (comment), raftMu has an important use in handling the split/snapshot race (and this use requires an exclusive lock; it won't be easy to replace it with an RWMutex. A sync.Cond with an explicit state variable might work)

a-robinson · 2017-10-25T15:40:16Z

Hmm, that just seems like... a lot, but I missed that sendRaftMessage acquires replicaMu, which it would then do 125x and that mutex we do lock a bunch during proposing. Seems like an obvious improvement here to avoid re-locking over and over again (bet you could just skip that lock without things blowing up too much for your testing).

Just moving the locking of replica.mu outside the loop over all messages has a pretty negligible impact on throughput and latency. It looks like it may be slightly better, but not by much:

Locking/unlocking in every sendRaftMessage call:

Locking/unlocking just once, outside the loop over messages:

a-robinson · 2017-10-25T17:01:11Z

Naively parallelizing the sending of messages with the applying of committed commands using a sync.WaitGroup had a small, but a little more noticeable effect on throughput:

Probably not worth spending too much time on in isolation.

nvanbenschoten · 2018-07-31T18:23:10Z

Moving this to 2.2 because we're not going to be able to make any changes here for 2.1.

nvanbenschoten · 2020-08-12T17:51:16Z

Closing, as the learnings and ideas presented here have diffused into a number of other issues, many of which have been addressed.

a-robinson added C-performance Perf of queries or internals. Solution not expected to change functional behavior. refactoring labels Oct 10, 2017

a-robinson mentioned this issue Oct 13, 2017

perf: analyze INSERT INTO .... VALUES(...) ON CONFLICT DO UPDATE SET n = n+1 #18657

Closed

a-robinson mentioned this issue Oct 23, 2017

storage: Dropped raft messages #19423

Closed

a-robinson mentioned this issue Apr 23, 2018

storage: avoid acquiring raftMu in Replica.propose #24990

Merged

knz added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. and removed refactoring labels Apr 24, 2018

nvanbenschoten added this to the 2.1 milestone Jul 22, 2018

nvanbenschoten modified the milestones: 2.1, 2.2 Jul 31, 2018

tbg added the A-coreperf label Aug 21, 2018

nvanbenschoten mentioned this issue Sep 30, 2018

kv: make disk I/O asynchronous with respect to Raft state machine #17500

Closed

petermattis removed this from the 2.2 milestone Oct 5, 2018

nvanbenschoten added A-kv-replication Relating to Raft, consensus, and coordination. and removed A-coreperf labels Oct 16, 2018

nvanbenschoten closed this as completed Aug 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: break up giant raftMu critical section downstream of raft #19156

storage: break up giant raftMu critical section downstream of raft #19156

a-robinson commented Oct 10, 2017 •

edited

Loading

a-robinson commented Oct 10, 2017

petermattis commented Oct 10, 2017

a-robinson commented Oct 10, 2017

petermattis commented Oct 11, 2017

tbg commented Oct 11, 2017

a-robinson commented Oct 11, 2017 •

edited

Loading

a-robinson commented Oct 11, 2017

a-robinson commented Oct 11, 2017 •

edited

Loading

tbg commented Oct 11, 2017

bdarnell commented Oct 15, 2017

a-robinson commented Oct 16, 2017

bdarnell commented Oct 16, 2017

a-robinson commented Oct 25, 2017

a-robinson commented Oct 25, 2017

nvanbenschoten commented Jul 31, 2018

nvanbenschoten commented Aug 12, 2020

storage: break up giant raftMu critical section downstream of raft #19156

storage: break up giant raftMu critical section downstream of raft #19156

Comments

a-robinson commented Oct 10, 2017 • edited Loading

a-robinson commented Oct 10, 2017

petermattis commented Oct 10, 2017

a-robinson commented Oct 10, 2017

petermattis commented Oct 11, 2017

tbg commented Oct 11, 2017

a-robinson commented Oct 11, 2017 • edited Loading

a-robinson commented Oct 11, 2017

a-robinson commented Oct 11, 2017 • edited Loading

tbg commented Oct 11, 2017

bdarnell commented Oct 15, 2017

a-robinson commented Oct 16, 2017

bdarnell commented Oct 16, 2017

a-robinson commented Oct 25, 2017

a-robinson commented Oct 25, 2017

nvanbenschoten commented Jul 31, 2018

nvanbenschoten commented Aug 12, 2020

a-robinson commented Oct 10, 2017 •

edited

Loading

a-robinson commented Oct 11, 2017 •

edited

Loading

a-robinson commented Oct 11, 2017 •

edited

Loading