kv: re-tune Raft knobs for high throughput WAN replication #49619

nvanbenschoten · 2020-05-27T23:33:39Z

The knobs were tuned four years ago with a focus on stability during the code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in response to observed instability due to long handleRaftReady pauses. Since then, a lot has changed:

we now batch the application of Raft entries in the same raft.Ready struct
we now acknowledge the proposers of Raft entries before their application
we now set a MaxCommittedSizePerReady value to prevent the size of the committed entries in a single raft.Ready struct from growing too large. We introduced this knob a few years ago. Even on its own, it appears to invalidate the motivating reason for the tuning of the other knobs
we now default to 512MB ranges, so we expect to have 1/8 of the number of total ranges on a node

In response to these database changes, this commit makes the following adjustments to the replication knobs:
- increase RaftMaxSizePerMsg from 16 KB to 32 KB
- increase RaftMaxInflightMsgs from 64 to 128
- increase RaftLogTruncationThreshold from 4 MB to 8 MB
- increase RaftProposalQuota from 1 MB to 4 MB

Combined, these changes increase the per-replica replication window size from 1 MB to 4 MB. This should allow for higher throughput replication, especially over high latency links.

To test this, we run a global cluster (nodes in us-east1, us-west1, and europe-west1) and write 10 KB blocks as fast as possible to a single Range. This is similar to a workload we see customers run in testing and production environments.

# Setup cluster
roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b'
roachprod stage nathan-geo cockroach
roachprod start nathan-geo:1-3

# Setup dataset
roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}'
roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'"

# Run workload before tuning
roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
180.0s        0         115524          641.8    199.3    201.3    251.7    285.2    604.0  write

# Run workload after tuning
roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3
roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}'

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
  180.0s        0         288512         1602.9     79.8     75.5    104.9    209.7    738.2  write

Before the change, we see p50 latency at 3x the expected replication latency. This is due to throttling on the leader replica. After the change, we see p50 latencies exactly where we expect them to be. Higher percentile latencies improve accordingly.

We also see a 150% increase in throughput on the workload. This is reflected in the rate at which we write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on each node.

Finally, we do not see a corresponding increase in Raft ready latency, which was the driving reason for the knobs being tuned so low.

Release note (performance improvement): default replication configurations have been tuned to support higher replication throughput in high-latency replication quorums.

cockroach-teamcity · 2020-05-27T23:33:47Z

This change is

petermattis

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @petermattis)

irfansharif

LGTM. We could add some text for how one would determine appropriate values for these knobs. I imagine it's some function of the amount of data you'd expect to be on the wire, in transit, between the leaseholder and follower replicas at any given point in time. You'd then use that limit as the threshold after which you'd want to start throttling at the leaseholder. Is that right?

pkg/kv/kvserver/client_raft_test.go

Fixes cockroachdb#49564. The knobs were tuned four years ago with a focus on stability during the code yellow (see 89e8fe3 and 36b7640). They were tuned primarily in response to observed instability due to long handleRaftReady pauses. Since then, a lot has changed: - we now batch the application of Raft entries in the same raft.Ready struct - we now acknowledge the proposers of Raft entries before their application - we now set a MaxCommittedSizePerReady value to prevent the size of the committed entries in a single raft.Ready struct from growing too large. We introduced this knob a few years ago. Even on its own, it appears to invalidate the motivating reason for the tuning of the other knobs - we now default to 512MB ranges, so we expect to have 1/8 of the number of total ranges on a node In response to these database changes, this commit makes the following adjustments to the replication knobs: - increase `RaftMaxSizePerMsg` from 16 KB to 32 KB - increase `RaftMaxInflightMsgs` from 64 to 128 - increase `RaftLogTruncationThreshold` from 4 MB to 8 MB - increase `RaftProposalQuota` from 1 MB to 4 MB Combined, these changes increase the per-replica replication window size from 1 MB to 4 MB. This should allow for higher throughput replication, especially over high latency links. To test this, we run a global cluster (nodes in us-east1, us-west1, and europe-west1) and write 10 KB blocks as fast as possible to a single Range. This is similar to a workload we see customers run in testing and production environments. ``` \# Setup cluster roachprod create nathan-geo -n=4 --gce-machine-type=n1-standard-16 --geo --gce-zones='us-east1-b,us-west1-b,europe-west1-b' roachprod stage nathan-geo cockroach roachprod start nathan-geo:1-3 \# Setup dataset roachprod run nathan-geo:4 -- './cockroach workload init kv {pgurl:1}' roachprod sql nathan-geo:1 -- -e "ALTER TABLE kv.kv CONFIGURE ZONE USING constraints = COPY FROM PARENT, lease_preferences = '[[+region=us-east1]]'" \# Run workload before tuning roachprod stop nathan-geo:1-3 && roachprod start nathan-geo:1-3 roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 {pgurl:1}' _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 180.0s 0 115524 641.8 199.3 201.3 251.7 285.2 604.0 write \# Run workload after tuning roachprod stop nathan-geo:1-3 && COCKROACH_RAFT_MAX_INFLIGHT_MSGS=128 COCKROACH_RAFT_MAX_SIZE_PER_MSG=32768 COCKROACH_RAFT_LOG_TRUNCATION_THRESHOLD=16777216 roachprod start nathan-geo:1-3 roachprod run nathan-geo:4 -- './cockroach workload run kv --ramp=15s --duration=3m --sequential --min-block-bytes=10000 --max-block-bytes=10000 --concurrency=128 --write-seq=S123829 {pgurl:1}' _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total 180.0s 0 288512 1602.9 79.8 75.5 104.9 209.7 738.2 write ``` Before the change, we see p50 latency at 3x the expected replication latency. This is due to throttling on the leader replica. After the change, we see p50 latencies exactly where we expect them to be. Higher percentile latencies improve accordingly. We also see a 150% increase in throughput on the workload. This is reflected in the rate at which we write to disk, which jumps from ~45 MB/s on each node to ~120 MB/s on each node. Finally, we do not see a corresponding increase in Raft ready latency, which was the driving reason for the knobs being tuned so low. Release note (performance improvement): default replication configurations have been tuned to support higher replication throughput in high-latency replication quorums.

nvanbenschoten · 2020-05-28T16:43:12Z

TFTRs!

We could add some text for how one would determine appropriate values for these knobs. ...

I added a bit more around RaftMaxSizePerMsg because it was under-commented. There was already some amount of detail written about this on each other field in the RaftConfig struct. Specifically, it discussed the combination of RaftMaxInflightMsgs and RaftMaxSizePerMsg as the window size between leaders and followers, which I think answers your question. This is helpful for understanding, but I don't think there's a substitute to reading how these configs are used, considering system-level constraints, and running tests when going in and tuning them.

bors r+

craig · 2020-05-28T18:11:42Z

Build failed (retrying...)

GitHub CI (Cockroach)

craig · 2020-05-28T19:32:33Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten requested review from petermattis and irfansharif May 27, 2020 23:33

petermattis approved these changes May 27, 2020

View reviewed changes

irfansharif approved these changes May 28, 2020

View reviewed changes

pkg/kv/kvserver/client_raft_test.go Outdated Show resolved Hide resolved

nvanbenschoten force-pushed the nvanbenschoten/raftTune branch from 3d89041 to 258b965 Compare May 28, 2020 16:40

craig bot merged commit 163affa into cockroachdb:master May 28, 2020

nvanbenschoten deleted the nvanbenschoten/raftTune branch June 1, 2020 14:46

jseldess mentioned this pull request Jun 24, 2020

kv: re-tune Raft knobs for high throughput WAN replication cockroachdb/docs#7662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: re-tune Raft knobs for high throughput WAN replication #49619

kv: re-tune Raft knobs for high throughput WAN replication #49619

nvanbenschoten commented May 27, 2020

cockroach-teamcity commented May 27, 2020

petermattis left a comment

irfansharif left a comment

nvanbenschoten commented May 28, 2020

craig bot commented May 28, 2020

craig bot commented May 28, 2020

kv: re-tune Raft knobs for high throughput WAN replication #49619

kv: re-tune Raft knobs for high throughput WAN replication #49619

Conversation

nvanbenschoten commented May 27, 2020

cockroach-teamcity commented May 27, 2020

petermattis left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

nvanbenschoten commented May 28, 2020

craig bot commented May 28, 2020

Build failed (retrying...)

craig bot commented May 28, 2020

Build succeeded