storage: avoid acquiring raftMu in Replica.propose #24990

nvanbenschoten · 2018-04-23T15:44:07Z

Related to #9827.

This addresses a TODO to avoid acquiring Replica.raftMu in Replica.propose
in the common case where the replica's internal raft group is not null.

This showed up in blocking profiles when running sysbench's oltp_insert
workload. Specifically, the locking of raftMu in Replica.propose was
responsible for 29.3% of blocking in the profile:

raftMu locking in Replica.propose is responsible for the second large pillar

This change increases throughput of the sysbench workload by 11% and reduces
the reported average latency by 10%. This was averaged over a few runs.
Below is the output of a representative run before and after.

Before

SQL statistics:
    queries performed:
        read:                            0
        write:                           202253
        other:                           0
        total:                           202253
    transactions:                        202253 (3368.51 per sec.)
    queries:                             202253 (3368.51 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3368.5102
    time elapsed:                        60.0423s
    total number of events:              202253

Latency (ms):
         min:                                    0.99
         avg:                                   37.98
         max:                                  141.98
         95th percentile:                       58.92
         sum:                              7681469.72

Threads fairness:
    events (avg/stddev):           1580.1016/4.11
    execution time (avg/stddev):   60.0115/0.01

After

SQL statistics:
    queries performed:
        read:                            0
        write:                           228630
        other:                           0
        total:                           228630
    transactions:                        228630 (3808.46 per sec.)
    queries:                             228630 (3808.46 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3808.4610
    time elapsed:                        60.0321s
    total number of events:              228630

Latency (ms):
         min:                                    3.59
         avg:                                   33.60
         max:                                  156.34
         95th percentile:                       51.02
         sum:                              7680843.03

Threads fairness:
    events (avg/stddev):           1786.1719/3.89
    execution time (avg/stddev):   60.0066/0.01

After this change, the blocking profile shows that Replica.propose is no longer
a problem:

Release note (performance improvement): Reduce lock contention in
Replica write path.

cockroach-teamcity · 2018-04-23T15:44:14Z

This change is

jordanlewis · 2018-04-23T15:45:11Z

This is sweet - can't wait to hear whether it affects TPCC.

petermattis · 2018-04-23T17:14:01Z

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.

pkg/storage/replica.go, line 564 at r1 (raw file):

		// creation if we gossiped our store descriptor more than the election
		// timeout in the past.
		canCampaign := r.store.canCampaignIdleReplica()

#24920 is getting rid of this logic. You and Ben will have to work out the sequencing of these PRs.

pkg/storage/replica.go, line 3209 at r1 (raw file):

	// optimize for the common case where Replica.mu.internalRaftGroup is
	// non-nil, but that doesn't currently seem worth it. Always locking raftMu
	// has a tiny (~1%) performance hit for single-node block_writer testing.

It's interesting that you are seeing a benefit. Perhaps the rest of the system has evolved enough to make this a bottleneck now when it wasn't previously.

Comments from Reviewable

petermattis · 2018-04-23T17:14:34Z

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

petermattis · 2018-04-23T17:15:36Z

Reminder, we're in the stabilization period before the next 2.1 alpha release. We should consider the risk vs reward for this change and consider merging next week (after the 2.1 alpha has been cut) instead of this week.

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.

Comments from Reviewable

nvanbenschoten · 2018-04-23T17:39:53Z

TFTR. I'll hold off on merging this until next week and will coordinate with #24920.

a-robinson · 2018-04-23T17:43:31Z

This is also related to #19156. Thanks @nvanbenschoten!

bdarnell · 2018-04-23T17:46:45Z

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.

pkg/storage/replica.go, line 528 at r1 (raw file):

// replicas.
//
// Requires that Replica.mu is held. Also requires that Replica.raftMu is held

This is a weird and subtle rule. I don't like introducing non-standard synchronization rules because they're hard to reason about. How confident are we that the old comment "It appears that we only need to hold Replica.raftMu when calling raft.NewRawNode. We could get fancier with the locking here to" is still true? The race detector can't help us here because raftMu isn't protecting access to any particular piece of memory.

pkg/storage/replica.go, line 3209 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's interesting that you are seeing a benefit. Perhaps the rest of the system has evolved enough to make this a bottleneck now when it wasn't previously.

I bet that comment came before we were fsyncing all log writes. handleRaftReadyRaftMuLocked writes to the log while holding raftMu but not r.mu; it is one of the few places where we hold raftMu by itself.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.

Comments from Reviewable

petermattis · 2018-04-23T19:20:03Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.

pkg/storage/replica.go, line 3209 at r1 (raw file):
@bdarnell Heh, the TODO(peter) comment was written in 7fb9543 (Oct 5, 2016). That was a lifetime ago.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.

I did an audit a long time ago about long holds of raftMu. Probably worthwhile to do that again.

Comments from Reviewable

a-robinson · 2018-04-23T19:27:37Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.

pkg/storage/replica.go, line 3209 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@bdarnell Heh, the TODO(peter) comment was written in 7fb9543 (Oct 5, 2016). That was a lifetime ago.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.

I did an audit a long time ago about long holds of raftMu. Probably worthwhile to do that again.

As called out in #19156, handleRaftReadyRaftMuLocked is a huge offender.

Comments from Reviewable

nvanbenschoten · 2018-05-01T19:51:54Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.

pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is a weird and subtle rule. I don't like introducing non-standard synchronization rules because they're hard to reason about. How confident are we that the old comment "It appears that we only need to hold Replica.raftMu when calling raft.NewRawNode. We could get fancier with the locking here to" is still true? The race detector can't help us here because raftMu isn't protecting access to any particular piece of memory.

All external mutations to r.mu.internalRaftGroup are performed under Replica.mu lock, so it looks like your suspicion was correct that we never need to hold Replica.raftMu when calling withRaftGroupLocked. This allowed me to simplify things considerably. PTAL.

pkg/storage/replica.go, line 564 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

#24920 is getting rid of this logic. You and Ben will have to work out the sequencing of these PRs.

#24920 is a much larger change, so that should go in first. It looks like it will be merged tomorrow.

Comments from Reviewable

bdarnell · 2018-05-03T15:57:37Z

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.

pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

All external mutations to r.mu.internalRaftGroup are performed under Replica.mu lock, so it looks like your suspicion was correct that we never need to hold Replica.raftMu when calling withRaftGroupLocked. This allowed me to simplify things considerably. PTAL.

It's certainly simpler, but is it safe? I didn't mean to say that we never need to hold raftMu when calling withRaftGroupLocked; I was asking whether even the lesser claim from that old comment (that we could avoid holding raftMu while calling withRaftGroupLocked if the group already exists) still holds.

I think we need a fresh analysis of what exactly raftMu is needed for so we can figure out whether it's safe to make this change. I can't remember all the subtleties here and I'm not confident that the tests we have are enough to expose any issues.

Comments from Reviewable

Related to cockroachdb#9827. This addresses a TODO to avoid acquiring `Replica.raftMu` in `Replica.propose` in the common case where the replica's internal raft group is not null. This showed up in blocking profiles when running sysbench's `oltp_insert` workload. Specifically, the locking of `raftMu` in `Replica.propose` was responsible for **29.3%** of blocking in the profile. This change increases throughput of the sysbench workload by **11%** and reduces the reported average latency by **10%**. This was averaged over a few runs. Below is the output of a representative run before and after. Before ``` SQL statistics: queries performed: read: 0 write: 202253 other: 0 total: 202253 transactions: 202253 (3368.51 per sec.) queries: 202253 (3368.51 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) Throughput: events/s (eps): 3368.5102 time elapsed: 60.0423s total number of events: 202253 Latency (ms): min: 0.99 avg: 37.98 max: 141.98 95th percentile: 58.92 sum: 7681469.72 Threads fairness: events (avg/stddev): 1580.1016/4.11 execution time (avg/stddev): 60.0115/0.01 ``` After ``` SQL statistics: queries performed: read: 0 write: 228630 other: 0 total: 228630 transactions: 228630 (3808.46 per sec.) queries: 228630 (3808.46 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) Throughput: events/s (eps): 3808.4610 time elapsed: 60.0321s total number of events: 228630 Latency (ms): min: 3.59 avg: 33.60 max: 156.34 95th percentile: 51.02 sum: 7680843.03 Threads fairness: events (avg/stddev): 1786.1719/3.89 execution time (avg/stddev): 60.0066/0.01 ``` Release note (performance improvement): Reduce lock contention in Replica write path.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's certainly simpler, but is it safe? I didn't mean to say that we never need to hold raftMu when calling withRaftGroupLocked; I was asking whether even the lesser claim from that old comment (that we could avoid holding raftMu while calling withRaftGroupLocked if the group already exists) still holds.

I think we need a fresh analysis of what exactly raftMu is needed for so we can figure out whether it's safe to make this change. I can't remember all the subtleties here and I'm not confident that the tests we have are enough to expose any issues.

I went back through and audited the use of Replica.mu and Replica.raftMu. I also audited all manipulation of Replica.mu.internalRaftGroup. I'm now pretty convinced that the original version of this change was correct. I've reverted the code to that. If the internalRaftGroup is non-nil and we hold Replica.mu then withRaftGroupLocked should be safe to call because anyone who resets it to nil (initRaftMuLockedReplicaMuLocked, setReplicaIDRaftMuLockedMuLocked, removeReplicaImpl) also holds Replica.mu and handles local proposals in the same critical section. Additionally, calling Propose on the raft group should be safe even without Replica.raftMu because it's fine if it ends up being concurrent with a call to handleRaftReadyRaftMuLocked - either the raft.Ready will contain the new proposals or it won't.

The only thing that looks fishy to me is that this will allow us to call Replica.unquiesceLocked() without the raftMu, but we seem to do that elsewhere as well. Do you have any insight into this?

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica.go, line 528 at r1 (raw file):

Do you have any insight into this?

Not really. Unquiescing without raftMu seems about the same as Proposing without it.

nvanbenschoten · 2018-07-30T05:46:17Z

TFTR.

bors r+

craig · 2018-07-30T07:30:33Z

Timed out

nvanbenschoten · 2018-07-30T07:36:11Z

bors r+

24990: storage: avoid acquiring raftMu in Replica.propose r=nvanbenschoten a=nvanbenschoten Related to #9827. This addresses a TODO to avoid acquiring `Replica.raftMu` in `Replica.propose` in the common case where the replica's internal raft group is not null. This showed up in blocking profiles when running sysbench's `oltp_insert` workload. Specifically, the locking of `raftMu` in `Replica.propose` was responsible for **29.3%** of blocking in the profile: ![before](https://user-images.githubusercontent.com/5438456/39137377-638396d4-46eb-11e8-8a95-876e3deb91f1.png) _raftMu locking in Replica.propose is responsible for the second large pillar_ This change increases throughput of the sysbench workload by **11%** and reduces the reported average latency by **10%**. This was averaged over a few runs. Below is the output of a representative run before and after. Before ``` SQL statistics: queries performed: read: 0 write: 202253 other: 0 total: 202253 transactions: 202253 (3368.51 per sec.) queries: 202253 (3368.51 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) Throughput: events/s (eps): 3368.5102 time elapsed: 60.0423s total number of events: 202253 Latency (ms): min: 0.99 avg: 37.98 max: 141.98 95th percentile: 58.92 sum: 7681469.72 Threads fairness: events (avg/stddev): 1580.1016/4.11 execution time (avg/stddev): 60.0115/0.01 ``` After ``` SQL statistics: queries performed: read: 0 write: 228630 other: 0 total: 228630 transactions: 228630 (3808.46 per sec.) queries: 228630 (3808.46 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.) Throughput: events/s (eps): 3808.4610 time elapsed: 60.0321s total number of events: 228630 Latency (ms): min: 3.59 avg: 33.60 max: 156.34 95th percentile: 51.02 sum: 7680843.03 Threads fairness: events (avg/stddev): 1786.1719/3.89 execution time (avg/stddev): 60.0066/0.01 ``` After this change, the blocking profile shows that `Replica.propose` is no longer a problem: ![after](https://user-images.githubusercontent.com/5438456/39137437-8d4a53ea-46eb-11e8-8faa-eff3ed01914f.png) Release note (performance improvement): Reduce lock contention in Replica write path. Co-authored-by: Nathan VanBenschoten <[email protected]>

craig · 2018-07-30T07:55:45Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten requested review from petermattis and a team April 23, 2018 15:44

nvanbenschoten mentioned this pull request Apr 23, 2018

storage: Marshal RaftCommand outside of Replica mutex #24991

Closed

nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from 1008280 to 4b83913 Compare May 1, 2018 19:51

nvanbenschoten changed the title ~~storage: avoid acquiring raftMu in Replica.propose in common case~~ storage: avoid acquiring raftMu in Replica.propose May 1, 2018

nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from 4b83913 to c943275 Compare May 2, 2018 23:41

nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from c943275 to 5ec656c Compare July 28, 2018 06:52

nvanbenschoten commented Jul 28, 2018

View reviewed changes

bdarnell approved these changes Jul 30, 2018

View reviewed changes

craig bot merged commit 5ec656c into cockroachdb:master Jul 30, 2018

nvanbenschoten deleted the nvanbenschoten/raftMu branch July 31, 2018 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid acquiring raftMu in Replica.propose #24990

storage: avoid acquiring raftMu in Replica.propose #24990

nvanbenschoten commented Apr 23, 2018

cockroach-teamcity commented Apr 23, 2018

jordanlewis commented Apr 23, 2018

petermattis commented Apr 23, 2018

petermattis commented Apr 23, 2018

petermattis commented Apr 23, 2018

nvanbenschoten commented Apr 23, 2018

a-robinson commented Apr 23, 2018

bdarnell commented Apr 23, 2018

petermattis commented Apr 23, 2018

a-robinson commented Apr 23, 2018

nvanbenschoten commented May 1, 2018

bdarnell commented May 3, 2018

nvanbenschoten left a comment

bdarnell left a comment

nvanbenschoten commented Jul 30, 2018

craig bot commented Jul 30, 2018

nvanbenschoten commented Jul 30, 2018

craig bot commented Jul 30, 2018

storage: avoid acquiring raftMu in Replica.propose #24990

storage: avoid acquiring raftMu in Replica.propose #24990

Conversation

nvanbenschoten commented Apr 23, 2018

cockroach-teamcity commented Apr 23, 2018

jordanlewis commented Apr 23, 2018

petermattis commented Apr 23, 2018

petermattis commented Apr 23, 2018

petermattis commented Apr 23, 2018

nvanbenschoten commented Apr 23, 2018

a-robinson commented Apr 23, 2018

bdarnell commented Apr 23, 2018

petermattis commented Apr 23, 2018

a-robinson commented Apr 23, 2018

nvanbenschoten commented May 1, 2018

bdarnell commented May 3, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

bdarnell left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jul 30, 2018

craig bot commented Jul 30, 2018

Timed out

nvanbenschoten commented Jul 30, 2018

craig bot commented Jul 30, 2018

Build succeeded