-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: break up giant raftMu critical section downstream of raft #19156
Comments
And cc #18779, since this could be a bit of a refactoring project. |
I think we need mutual exclusion on handling of Raft ready messages. But perhaps that could be achieved with the addition of another mutex ( |
Do you think that we need that due to how raft works or due to how our various above-raft mechanisms work? My reading of raft itself is that it would be ok to pipeline the different stages of handling the Raft ready messages, as long as we:
Yeah, it is possible that load-based splitting and optimizing the work being done (rather than the locking/ordering) would be a better place to focus efforts. |
My statement was based on how our various above-Raft mechanisms work. I think it would be problematic for multiple routines to be running I can understand why |
Where exactly are we spending significant amounts of time when sending messages?
Can we revisit #18657? I'm curious how the single-range case compares against the multi-range one (assuming no |
Ack, thanks.
In my printf-style tracing, the synchronous committing of new raft log entries never took more than a few milliseconds. The truly slow part was doing the rest of the processing after that to respond to raft messages (occasionally) and apply committed commands (most frequently).
Not that I've noticed. The problem I've observed is that |
The time here doesn't seem to get spent in any large outliers, rather it just takes a while to send so many messages when a large batch of requests has built up. For example, some trace outputs include:
It appears to be more of an issue on the leader than on the followers. It's certainly not a ton of time, but every millisecond that we're holding the lock to send messages is blocking new requests from making progress. Although it'd be less impactful than speeding up the applying of committed commands, moving the message-sending out of the critical section would be the easiest change from a safety perspective.
I'll re-test it today, but I can't remember a single configuration of #18657 that I've tested in which a split didn't give a considerable spike to throughput (and a corresponding decrease in dropped raft messages. The bottleneck for single range writes is waiting for the previous raft ready to finish processing before new messages can be processed.
That might be true if we didn't build up so big of a backlog (100 raft requests) that we started dropping new ones, requiring them to be re-sent. But even so, it's not so much about sync latencies at this point (which are only a few milliseconds) as it is about
For the single range case, switching to 1ms had no noticeable effect (which makes sense given that the time taken by apply, which doesn't require syncing, is the main bottleneck). Switching to 5ms hurt throughput and latency by a small, but noticeable amount, around 5-10%. |
Here's a single split. Ignore the absolute values since they're being affected by my very verbose logging; just focus on the relative changes: And here's splitting from 2 ranges to 21 ranges (by setting max range size down to 8MiB): |
Hmm, that just seems like... a lot, but I missed that
Oh, I see. Sorry, I had made the misguided assumption that by "applying committed entries" you meant the sync, but I'm not sure why I thought that. You're right that it's unfortunate that we're still holding on to In carrying out the steps above, we may have to change the mutex the sideloaded storage is tied to (currently |
Transplanting my comments from #18657 (comment) If raftMu is only used to guard against concurrent replica GC (which was the original idea but I'm not sure if it's stayed true as things have evolved), maybe there's room to turn raftMu into a RWMutex and lock it in read mode everywhere but in creation and destruction of the raft group. The RWMutex suggested at cockroach/pkg/storage/store.go Lines 440 to 443 in 1811c61
|
Thanks for helping to clarify things @bdarnell and @tschottdorf! I'm excited to continue this line of testing soon. |
As I rediscovered in #19172 (comment), raftMu has an important use in handling the split/snapshot race (and this use requires an exclusive lock; it won't be easy to replace it with an RWMutex. A sync.Cond with an explicit state variable might work) |
Moving this to 2.2 because we're not going to be able to make any changes here for 2.1. |
Closing, as the learnings and ideas presented here have diffused into a number of other issues, many of which have been addressed. |
In testing our single-range write throughput (#18657), it's become clear that
raftMu
is the primary bottleneck. We hold it for a very long time ashandleRaftReadyRaftMuLocked
executes.In my testing in #18657 it's not unusual for
raftMu
to be held for many tens of milliseconds at a time, and occasionally more than 100ms. And from adding some poor man's tracing to the code, it appears as though the bulk of the time is spent sending messages and applying committed entries, with the latter being the worst (and least predictable) offender.There seem to be three ways to attack this:
raftMu
as necessary rather than holding it throughout the entire process. I'll need some guidance (from @bdarnell? @petermattis?) on whether this is even remotely feasible, since it's possible that all of that work really does need to be done as part of a single critical section. If we can break it up, though, it opens up opportunities for pipelining, for example having one goroutine persisting new log entries, one sending messages, and one applying committed entries.According to the etcd-raft usage instructions, it'd be safe for us to advance the raft node as soon as we've persisted the new raft entries to disk (i.e. before the two slow steps referenced above), but those steps as written today do appear to rely on some things protected by
raftMu
. That would seemingly make option 3 viable in the sense thathandleRaftReadyRaftMuLocked
could return as soon as it had synced new entries to disk andAdvance
d the raft group, with the message-sending and applying continuing in the background so long as we ensure that all messages/entries are processed in the order they were returned by raft.Sorry for the giant block of text, but I wanted to explain where I'm at with single-range performance and make it possible to get the information I want about the viability of breaking up the big
raftMu
critical section.The text was updated successfully, but these errors were encountered: