Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid acquiring raftMu in Replica.propose #24990

Merged
merged 1 commit into from
Jul 30, 2018

Conversation

nvanbenschoten
Copy link
Member

Related to #9827.

This addresses a TODO to avoid acquiring Replica.raftMu in Replica.propose
in the common case where the replica's internal raft group is not null.

This showed up in blocking profiles when running sysbench's oltp_insert
workload. Specifically, the locking of raftMu in Replica.propose was
responsible for 29.3% of blocking in the profile:

before
raftMu locking in Replica.propose is responsible for the second large pillar

This change increases throughput of the sysbench workload by 11% and reduces
the reported average latency by 10%. This was averaged over a few runs.
Below is the output of a representative run before and after.

Before

SQL statistics:
    queries performed:
        read:                            0
        write:                           202253
        other:                           0
        total:                           202253
    transactions:                        202253 (3368.51 per sec.)
    queries:                             202253 (3368.51 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3368.5102
    time elapsed:                        60.0423s
    total number of events:              202253

Latency (ms):
         min:                                    0.99
         avg:                                   37.98
         max:                                  141.98
         95th percentile:                       58.92
         sum:                              7681469.72

Threads fairness:
    events (avg/stddev):           1580.1016/4.11
    execution time (avg/stddev):   60.0115/0.01

After

SQL statistics:
    queries performed:
        read:                            0
        write:                           228630
        other:                           0
        total:                           228630
    transactions:                        228630 (3808.46 per sec.)
    queries:                             228630 (3808.46 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3808.4610
    time elapsed:                        60.0321s
    total number of events:              228630

Latency (ms):
         min:                                    3.59
         avg:                                   33.60
         max:                                  156.34
         95th percentile:                       51.02
         sum:                              7680843.03

Threads fairness:
    events (avg/stddev):           1786.1719/3.89
    execution time (avg/stddev):   60.0066/0.01

After this change, the blocking profile shows that Replica.propose is no longer
a problem:

after

Release note (performance improvement): Reduce lock contention in
Replica write path.

@nvanbenschoten nvanbenschoten requested review from petermattis and a team April 23, 2018 15:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jordanlewis
Copy link
Member

This is sweet - can't wait to hear whether it affects TPCC.

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


pkg/storage/replica.go, line 564 at r1 (raw file):

		// creation if we gossiped our store descriptor more than the election
		// timeout in the past.
		canCampaign := r.store.canCampaignIdleReplica()

#24920 is getting rid of this logic. You and Ben will have to work out the sequencing of these PRs.


pkg/storage/replica.go, line 3209 at r1 (raw file):

	// optimize for the common case where Replica.mu.internalRaftGroup is
	// non-nil, but that doesn't currently seem worth it. Always locking raftMu
	// has a tiny (~1%) performance hit for single-node block_writer testing.

It's interesting that you are seeing a benefit. Perhaps the rest of the system has evolved enough to make this a bottleneck now when it wasn't previously.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Reminder, we're in the stabilization period before the next 2.1 alpha release. We should consider the risk vs reward for this change and consider merging next week (after the 2.1 alpha has been cut) instead of this week.


Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.


Comments from Reviewable

@nvanbenschoten
Copy link
Member Author

TFTR. I'll hold off on merging this until next week and will coordinate with #24920.

@a-robinson
Copy link
Contributor

This is also related to #19156. Thanks @nvanbenschoten!

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions.


pkg/storage/replica.go, line 528 at r1 (raw file):

// replicas.
//
// Requires that Replica.mu is held. Also requires that Replica.raftMu is held

This is a weird and subtle rule. I don't like introducing non-standard synchronization rules because they're hard to reason about. How confident are we that the old comment "It appears that we only need to hold Replica.raftMu when calling raft.NewRawNode. We could get fancier with the locking here to" is still true? The race detector can't help us here because raftMu isn't protecting access to any particular piece of memory.


pkg/storage/replica.go, line 3209 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's interesting that you are seeing a benefit. Perhaps the rest of the system has evolved enough to make this a bottleneck now when it wasn't previously.

I bet that comment came before we were fsyncing all log writes. handleRaftReadyRaftMuLocked writes to the log while holding raftMu but not r.mu; it is one of the few places where we hold raftMu by itself.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 3209 at r1 (raw file):
@bdarnell Heh, the TODO(peter) comment was written in 7fb9543 (Oct 5, 2016). That was a lifetime ago.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.

I did an audit a long time ago about long holds of raftMu. Probably worthwhile to do that again.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 3209 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

@bdarnell Heh, the TODO(peter) comment was written in 7fb9543 (Oct 5, 2016). That was a lifetime ago.

I wonder if we could refactor handleRaftReady so we didn't have to hold raftMu over the sync.

I did an audit a long time ago about long holds of raftMu. Probably worthwhile to do that again.

As called out in #19156, handleRaftReadyRaftMuLocked is a huge offender.


Comments from Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from 1008280 to 4b83913 Compare May 1, 2018 19:51
@nvanbenschoten
Copy link
Member Author

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This is a weird and subtle rule. I don't like introducing non-standard synchronization rules because they're hard to reason about. How confident are we that the old comment "It appears that we only need to hold Replica.raftMu when calling raft.NewRawNode. We could get fancier with the locking here to" is still true? The race detector can't help us here because raftMu isn't protecting access to any particular piece of memory.

All external mutations to r.mu.internalRaftGroup are performed under Replica.mu lock, so it looks like your suspicion was correct that we never need to hold Replica.raftMu when calling withRaftGroupLocked. This allowed me to simplify things considerably. PTAL.


pkg/storage/replica.go, line 564 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

#24920 is getting rid of this logic. You and Ben will have to work out the sequencing of these PRs.

#24920 is a much larger change, so that should go in first. It looks like it will be merged tomorrow.


Comments from Reviewable

@nvanbenschoten nvanbenschoten changed the title storage: avoid acquiring raftMu in Replica.propose in common case storage: avoid acquiring raftMu in Replica.propose May 1, 2018
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from 4b83913 to c943275 Compare May 2, 2018 23:41
@bdarnell
Copy link
Contributor

bdarnell commented May 3, 2018

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

All external mutations to r.mu.internalRaftGroup are performed under Replica.mu lock, so it looks like your suspicion was correct that we never need to hold Replica.raftMu when calling withRaftGroupLocked. This allowed me to simplify things considerably. PTAL.

It's certainly simpler, but is it safe? I didn't mean to say that we never need to hold raftMu when calling withRaftGroupLocked; I was asking whether even the lesser claim from that old comment (that we could avoid holding raftMu while calling withRaftGroupLocked if the group already exists) still holds.

I think we need a fresh analysis of what exactly raftMu is needed for so we can figure out whether it's safe to make this change. I can't remember all the subtleties here and I'm not confident that the tests we have are enough to expose any issues.


Comments from Reviewable

Related to cockroachdb#9827.

This addresses a TODO to avoid acquiring `Replica.raftMu` in `Replica.propose`
in the common case where the replica's internal raft group is not null.

This showed up in blocking profiles when running sysbench's `oltp_insert`
workload. Specifically, the locking of `raftMu` in `Replica.propose` was
responsible for **29.3%** of blocking in the profile.

This change increases throughput of the sysbench workload by **11%** and reduces
the reported average latency by **10%**. This was averaged over a few runs.
Below is the output of a representative run before and after.

Before
```
SQL statistics:
    queries performed:
        read:                            0
        write:                           202253
        other:                           0
        total:                           202253
    transactions:                        202253 (3368.51 per sec.)
    queries:                             202253 (3368.51 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3368.5102
    time elapsed:                        60.0423s
    total number of events:              202253

Latency (ms):
         min:                                    0.99
         avg:                                   37.98
         max:                                  141.98
         95th percentile:                       58.92
         sum:                              7681469.72

Threads fairness:
    events (avg/stddev):           1580.1016/4.11
    execution time (avg/stddev):   60.0115/0.01
```

After
```
SQL statistics:
    queries performed:
        read:                            0
        write:                           228630
        other:                           0
        total:                           228630
    transactions:                        228630 (3808.46 per sec.)
    queries:                             228630 (3808.46 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3808.4610
    time elapsed:                        60.0321s
    total number of events:              228630

Latency (ms):
         min:                                    3.59
         avg:                                   33.60
         max:                                  156.34
         95th percentile:                       51.02
         sum:                              7680843.03

Threads fairness:
    events (avg/stddev):           1786.1719/3.89
    execution time (avg/stddev):   60.0066/0.01
```

Release note (performance improvement): Reduce lock contention in
Replica write path.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/raftMu branch from c943275 to 5ec656c Compare July 28, 2018 06:52
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 528 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's certainly simpler, but is it safe? I didn't mean to say that we never need to hold raftMu when calling withRaftGroupLocked; I was asking whether even the lesser claim from that old comment (that we could avoid holding raftMu while calling withRaftGroupLocked if the group already exists) still holds.

I think we need a fresh analysis of what exactly raftMu is needed for so we can figure out whether it's safe to make this change. I can't remember all the subtleties here and I'm not confident that the tests we have are enough to expose any issues.

I went back through and audited the use of Replica.mu and Replica.raftMu. I also audited all manipulation of Replica.mu.internalRaftGroup. I'm now pretty convinced that the original version of this change was correct. I've reverted the code to that. If the internalRaftGroup is non-nil and we hold Replica.mu then withRaftGroupLocked should be safe to call because anyone who resets it to nil (initRaftMuLockedReplicaMuLocked, setReplicaIDRaftMuLockedMuLocked, removeReplicaImpl) also holds Replica.mu and handles local proposals in the same critical section. Additionally, calling Propose on the raft group should be safe even without Replica.raftMu because it's fine if it ends up being concurrent with a call to handleRaftReadyRaftMuLocked - either the raft.Ready will contain the new proposals or it won't.

The only thing that looks fishy to me is that this will allow us to call Replica.unquiesceLocked() without the raftMu, but we seem to do that elsewhere as well. Do you have any insight into this?

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 528 at r1 (raw file):

Do you have any insight into this?

Not really. Unquiescing without raftMu seems about the same as Proposing without it.

@nvanbenschoten
Copy link
Member Author

TFTR.

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 30, 2018

Timed out

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 30, 2018
24990: storage: avoid acquiring raftMu in Replica.propose r=nvanbenschoten a=nvanbenschoten

Related to #9827.

This addresses a TODO to avoid acquiring `Replica.raftMu` in `Replica.propose`
in the common case where the replica's internal raft group is not null.

This showed up in blocking profiles when running sysbench's `oltp_insert`
workload. Specifically, the locking of `raftMu` in `Replica.propose` was
responsible for **29.3%** of blocking in the profile:

![before](https://user-images.githubusercontent.com/5438456/39137377-638396d4-46eb-11e8-8a95-876e3deb91f1.png)
_raftMu locking in Replica.propose is responsible for the second large pillar_

This change increases throughput of the sysbench workload by **11%** and reduces
the reported average latency by **10%**. This was averaged over a few runs.
Below is the output of a representative run before and after.

Before
```
SQL statistics:
    queries performed:
        read:                            0
        write:                           202253
        other:                           0
        total:                           202253
    transactions:                        202253 (3368.51 per sec.)
    queries:                             202253 (3368.51 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3368.5102
    time elapsed:                        60.0423s
    total number of events:              202253

Latency (ms):
         min:                                    0.99
         avg:                                   37.98
         max:                                  141.98
         95th percentile:                       58.92
         sum:                              7681469.72

Threads fairness:
    events (avg/stddev):           1580.1016/4.11
    execution time (avg/stddev):   60.0115/0.01
```

After
```
SQL statistics:
    queries performed:
        read:                            0
        write:                           228630
        other:                           0
        total:                           228630
    transactions:                        228630 (3808.46 per sec.)
    queries:                             228630 (3808.46 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      3808.4610
    time elapsed:                        60.0321s
    total number of events:              228630

Latency (ms):
         min:                                    3.59
         avg:                                   33.60
         max:                                  156.34
         95th percentile:                       51.02
         sum:                              7680843.03

Threads fairness:
    events (avg/stddev):           1786.1719/3.89
    execution time (avg/stddev):   60.0066/0.01
```

After this change, the blocking profile shows that `Replica.propose` is no longer
a problem:

![after](https://user-images.githubusercontent.com/5438456/39137437-8d4a53ea-46eb-11e8-8faa-eff3ed01914f.png)

Release note (performance improvement): Reduce lock contention in
Replica write path.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 30, 2018

Build succeeded

@craig craig bot merged commit 5ec656c into cockroachdb:master Jul 30, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/raftMu branch July 31, 2018 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants