Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: investigate using RocksDB 2PC mechanism for Raft log #16948

Closed
petermattis opened this issue Jul 9, 2017 · 13 comments
Closed

perf: investigate using RocksDB 2PC mechanism for Raft log #16948

petermattis opened this issue Jul 9, 2017 · 13 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@petermattis
Copy link
Collaborator

RocksDB supports a 2 phase commit mechanism for use by RocksDB transactions (which we do not currently use).

The 2PC mechanism allows persisting a WriteBatch (via a prepare operation) to the RocksDB WAL without the mutations being applied to the MemTable. A subsequent WriteBatch can be applied which either commits or rolls back the prepared mutations. The upshot of this mechanism is that a set of mutations can be written to the WAL for persistence and only later added to the MemTable. This differs from the normal mode of applying a WriteBatch where the batch is atomically written to the log and the operations are immediately added to the MemTable.

Currently, the Raft leaders and followers have essentially the same behavior with respect to the Raft log. The Raft state machine tells us to append some entries to the Raft log and sometime later it tells us to "commit" those entries (i.e. apply them). In the steady/happy state, Raft log entries are appended to the log and almost immediately committed. A short while later, a heuristic triggers truncation of the Raft log of entries that have been committed to all of the replicas. Currently that heuristic allows a modest amount of entries to build up before truncation, but there isn't a strict need for that. We could truncate the Raft log on followers immediately after an entry has been committed. Doing so could make catch up following a change in leadership more expensive, but let's ignore that for now.

So the steady/happy state on a follower essentially looks like:

  1. Write Raft log entry X
  2. Apply Raft log entry X
  3. Delete Raft log entry X

And these operations happen in quick succession. The time being writing a Raft log entry and applying it is measured in milliseconds. And deletion happens rapidly as well. Under the hood, the above operations look like:

  1. Write Raft log entry X (write Raft log entry to RocksDB WAL and insert into MemTable)
  2. Apply Raft log entry X (apply batch inside Raft log entry: write to RocksDB WAL and insert into MemTable)
  3. Delete Raft log entry X (add tombstone for Raft log entry: write to RocksDB WAL and insert into MemTable)

The contents of the Raft log entry are written twice to the WAL and twice to the MemTable. The MemTable is often flushed before the deletion occurs, so we experience 4x write amplification even before we start talking about normal RocksDB write amplification.

Can we use the RocksDB 2PC mechanism to eliminate part of this overhead? The high-level idea is to "prepare" a Raft log entry to RocksDB causing it to be written to the WAL. If the entry is committed we then write the commit marker to the WAL causing the entry to be both applied and the Raft log entry deleted. On startup, RocksDB scans the WAL and gives us access to the prepared-but-not-committed transactions which correspond to uncommitted Raft log entries. On leadership change, we'd want to rebuild the indexable Raft log so that we can fall back to the existing code. I haven't put much thought into how that would work.

This would be a large and tricky change. There are likely gaping holes in this proposal. The migration story is absent. The expected performance gains need to be verified via experimentation. What does the system do in the unhappy state where one of the replicas is significantly behind?

Cc @irfansharif, @bdarnell, @tschottdorf

@petermattis petermattis added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Jul 9, 2017
@petermattis petermattis added this to the 1.2 milestone Jul 9, 2017
@bdarnell
Copy link
Contributor

The MemTable is often flushed before the deletion occurs

With @irfansharif's work we can tune the raft log rocksdb instance separately, making this less likely.

Can we use the RocksDB 2PC mechanism to eliminate part of this overhead? The high-level idea is to "prepare" a Raft log entry to RocksDB causing it to be written to the WAL. If the entry is committed we then write the commit marker to the WAL causing the entry to be both applied and the Raft log entry deleted.

I'm not sure I follow. Is the idea that raft log entries are persisted solely in the WAL and never written to the memtable (and therefore never written to on-disk sstables)?

I think that using a separate rocksdb instance for the raft log gives us most of the potential benefit here. Separately, we could truncate more aggressively on followers to minimize the number of log entries that live long enough to be written to sstables (instead of driving truncation from the leader and the TruncateLog rpc), but that increases the risk that when leadership changes, we'll have to send snapshots instead of just the tail of the log.

@petermattis
Copy link
Collaborator Author

The MemTable is often flushed before the deletion occurs

With @irfansharif's work we can tune the raft log rocksdb instance separately, making this less likely.

I'm not sure if tuning RocksDB will help. I think we'd need to bump up how aggressively we truncate the Raft log. One thought I just had is that we could indicate that the Raft log should be truncated in quiesce heartbeat messages.

I'm not sure I follow. Is the idea that raft log entries are persisted solely in the WAL and never written to the memtable (and therefore never written to on-disk sstables)?

The idea is that Raft log entries are solely persisted to the WAL and the eventual application of the entry does not involve writing the contents of the Raft log (the WriteBatch) to the WAL again.

@irfansharif
Copy link
Contributor

I think we'd need to bump up how aggressively we truncate the Raft log

As mentioned here, whenever we truncate we need to first sync the primary RocksDB instance. Repeated this detail in the implementation here and my first thinking was to actually reduce how often we truncate.

@petermattis
Copy link
Collaborator Author

I think that using a separate rocksdb instance for the raft log gives us most of the potential benefit here.

With the separate rocksdb instance for the Raft log, we're still writing the WriteBatch for each command: once to the Raft log rocksdb instance WAL and once to the normal rocksdb instance WAL. The proposal here would allow us to write it only once. If this wasn't clear from the above, I need to work on my writing clarity.

@bdarnell
Copy link
Contributor

Oh, so the "prepare" is not writing the entry to the log, it's effectively applying the command but leaving the transaction open. Interesting. I think that could work and would reduce our write amplification by a lot. But there are a lot of open questions about how we'd reconstruct the log entries after leadership changes, and the increased use of snapshots might be too big of a downside. It's worth exploring further.

@petermattis
Copy link
Collaborator Author

Oh, so the "prepare" is not writing the entry to the log, it's effectively applying the command but leaving the transaction open. Interesting. I think that could work and would reduce our write amplification by a lot. But there are a lot of open questions about how we'd reconstruct the log entries after leadership changes, and the increased use of snapshots might be too big of a downside. It's worth exploring further.

It's actually the reverse: the "prepare" writes the entry to the WAL but does not add it to the memtable. The subsequent "commit" adds to the memtable but does not write the mutations to the WAL (only a "commit" marker is written). I need a whiteboard to diagram exactly how this works.

Yes, there are a ton of open questions here.

@bdarnell
Copy link
Contributor

It's actually the reverse: the "prepare" writes the entry to the WAL but does not add it to the memtable.

Yeah, that's what I meant. We'd go through most of the applyRaftCommand code path when "appending the entry to the log" (which is mostly applying the write batch, but there's a little more to it) and prepare that as a transaction, then commit it (and run all the proposal side effects) where we currently call applyRaftCommand.

@petermattis
Copy link
Collaborator Author

Yeah, that's what I meant. We'd go through most of the applyRaftCommand code path when "appending the entry to the log" (which is mostly applying the write batch, but there's a little more to it) and prepare that as a transaction, then commit it (and run all the proposal side effects) where we currently call applyRaftCommand.

Yeah, something like that. I'm not sure if we'd want to actually use a RocksDB transaction, though. There appears to be some associated lock manager that we might want to avoid. We might need to add a separate blob of "log data" to the WriteBatch so we can reconstruct the Raft log entry if the happy/fast path fails.

@petermattis
Copy link
Collaborator Author

RocksDB guarantees it won't garbage collect log files with prepared-but-uncommitted transactions. When appending the Raft log entry to the WAL, it would be great if we could add a key indicating the position of the entry in the WAL. This isn't currently possible with RocksDB but seems like a feasible enhancement. We'd want to eventually abort these Raft log entry transactions in order to allow the WAL files to get GC'd and so that startup time isn't severely impacted (I believe RocksDB has to scan all of the WAL files that contain such transactions). So perhaps that makes this unattractive. Oh well, more food for thought.

@petermattis
Copy link
Collaborator Author

petermattis commented Aug 18, 2017

Here's my plan for experimentation:

  • Disable the current code to append a Raft log entry with code which creates a RocksDB WriteBatch and stores the entry in the WAL using WriteBatch::PutLogData. LogData entries are not written to sstables or otherwise processed by RocksDB.
  • Increase the size of the in-memory Raft log entry cache which will need to handle all reads from the Raft log.
  • Set the WriteOptions::disableWAL flag when applying the Raft command when a Raft log entry is committed.
  • I'll likely need a cluster setting to enable this mode of operation because bootstrapping relies on being able to read persisted Raft log entries.

Together I think these changes (which are relatively small) will provide the same performance as using 2PC. We'd do a synchronous commit of the Raft log entry to the WAL during append and only write to the memtable/sstables (not the WAL) when applying the entry.

@petermattis
Copy link
Collaborator Author

I implemented the plan described above: https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/2pc-experiment?expand=1

The TL;DR? is that there is a drop in disk bandwidth for a fixed ops/sec, but running at capacity kv.experimental_2pc causes a reduction in throughput.

Running kv --concurrency 384 --splits 1000 --min-block-bytes 128 --max-block-bytes 129 --max-rate 7000 on a 6-node cluster:

screen shot 2017-08-27 at 2 44 56 pm

screen shot 2017-08-27 at 2 45 09 pm

At 18:38 I enabled the new kv.experimental_2pc cluster setting. There is a 15-20% drop in write bandwidth. The --max-rate 7000 setting to kv was used to guarantee that we're looking at apples-to-apples write bandwidth for the same throughput. The disk IOs per device is unchanged which is to be expected. We're still performing the same number of synchronous writes to disk, but we're writing somewhat less data by skipping the WAL when applying a Raft command.

Using the debug rocksdb dump_wal command, I verified that contents of the WAL were as expected. ~1/2 the WAL entries contained only uninterpreted LogData. The other 1/2 contained puts and deletes for RaftHardState.

Running kv --concurrency 384 --splits 1000 --min-block-bytes 128 --max-block-bytes 129 on a 6-node cluster (i.e. the same load as above, but without the rate limit):

screen shot 2017-08-27 at 3 02 56 pm

In the above graph, I enabled kv.experimental_2pc shortly after 18:56. Ops/sec decreases by ~10%. I'm not actually sure why. It may have something to with the way I implemented disableWAL which disables grouping of write batches in Go for batches which have that flag enabled. Or it could be something else.

The results here are surprising. I was expecting a more dramatic decrease in write bandwidth and an improvement in performance. Perhaps there is a bug in my experimental code. I did do some verification that the changes are doing what I think they are doing, but perhaps I missed something.

@petermattis petermattis modified the milestones: 1.2, 1.3 Nov 2, 2017
@nvanbenschoten nvanbenschoten added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Apr 24, 2018
@petermattis petermattis modified the milestones: 2.1, 2.2 Jul 12, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. labels Jul 31, 2018
@tbg tbg removed A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@nvanbenschoten nvanbenschoten added A-storage Relating to our storage engine (Pebble) on-disk storage. and removed A-coreperf labels Oct 16, 2018
@nvanbenschoten
Copy link
Member

@petermattis do you mind if I close this? I doubt we're ever going to implement a similar 2PC mechanism in Pebble and I suspect that we'd be better served by pushing on #38322.

@petermattis
Copy link
Collaborator Author

@nvanbenschoten I agree. Feel free to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

5 participants