-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: investigate using RocksDB 2PC mechanism for Raft log #16948
Comments
With @irfansharif's work we can tune the raft log rocksdb instance separately, making this less likely.
I'm not sure I follow. Is the idea that raft log entries are persisted solely in the WAL and never written to the memtable (and therefore never written to on-disk sstables)? I think that using a separate rocksdb instance for the raft log gives us most of the potential benefit here. Separately, we could truncate more aggressively on followers to minimize the number of log entries that live long enough to be written to sstables (instead of driving truncation from the leader and the TruncateLog rpc), but that increases the risk that when leadership changes, we'll have to send snapshots instead of just the tail of the log. |
I'm not sure if tuning RocksDB will help. I think we'd need to bump up how aggressively we truncate the Raft log. One thought I just had is that we could indicate that the Raft log should be truncated in quiesce heartbeat messages.
The idea is that Raft log entries are solely persisted to the WAL and the eventual application of the entry does not involve writing the contents of the Raft log (the |
With the separate rocksdb instance for the Raft log, we're still writing the |
Oh, so the "prepare" is not writing the entry to the log, it's effectively applying the command but leaving the transaction open. Interesting. I think that could work and would reduce our write amplification by a lot. But there are a lot of open questions about how we'd reconstruct the log entries after leadership changes, and the increased use of snapshots might be too big of a downside. It's worth exploring further. |
It's actually the reverse: the "prepare" writes the entry to the WAL but does not add it to the memtable. The subsequent "commit" adds to the memtable but does not write the mutations to the WAL (only a "commit" marker is written). I need a whiteboard to diagram exactly how this works. Yes, there are a ton of open questions here. |
Yeah, that's what I meant. We'd go through most of the |
Yeah, something like that. I'm not sure if we'd want to actually use a RocksDB transaction, though. There appears to be some associated lock manager that we might want to avoid. We might need to add a separate blob of "log data" to the |
RocksDB guarantees it won't garbage collect log files with prepared-but-uncommitted transactions. When appending the Raft log entry to the WAL, it would be great if we could add a key indicating the position of the entry in the WAL. This isn't currently possible with RocksDB but seems like a feasible enhancement. We'd want to eventually abort these Raft log entry transactions in order to allow the WAL files to get GC'd and so that startup time isn't severely impacted (I believe RocksDB has to scan all of the WAL files that contain such transactions). So perhaps that makes this unattractive. Oh well, more food for thought. |
Here's my plan for experimentation:
Together I think these changes (which are relatively small) will provide the same performance as using 2PC. We'd do a synchronous commit of the Raft log entry to the WAL during append and only write to the memtable/sstables (not the WAL) when applying the entry. |
I implemented the plan described above: https://github.com/cockroachdb/cockroach/compare/master...petermattis:pmattis/2pc-experiment?expand=1 The TL;DR? is that there is a drop in disk bandwidth for a fixed ops/sec, but running at capacity Running At Using the Running In the above graph, I enabled The results here are surprising. I was expecting a more dramatic decrease in write bandwidth and an improvement in performance. Perhaps there is a bug in my experimental code. I did do some verification that the changes are doing what I think they are doing, but perhaps I missed something. |
@petermattis do you mind if I close this? I doubt we're ever going to implement a similar 2PC mechanism in Pebble and I suspect that we'd be better served by pushing on #38322. |
@nvanbenschoten I agree. Feel free to close. |
RocksDB supports a 2 phase commit mechanism for use by RocksDB transactions (which we do not currently use).
The 2PC mechanism allows persisting a
WriteBatch
(via a prepare operation) to the RocksDB WAL without the mutations being applied to theMemTable
. A subsequentWriteBatch
can be applied which either commits or rolls back the prepared mutations. The upshot of this mechanism is that a set of mutations can be written to the WAL for persistence and only later added to theMemTable
. This differs from the normal mode of applying aWriteBatch
where the batch is atomically written to the log and the operations are immediately added to theMemTable
.Currently, the Raft leaders and followers have essentially the same behavior with respect to the Raft log. The Raft state machine tells us to append some entries to the Raft log and sometime later it tells us to "commit" those entries (i.e. apply them). In the steady/happy state, Raft log entries are appended to the log and almost immediately committed. A short while later, a heuristic triggers truncation of the Raft log of entries that have been committed to all of the replicas. Currently that heuristic allows a modest amount of entries to build up before truncation, but there isn't a strict need for that. We could truncate the Raft log on followers immediately after an entry has been committed. Doing so could make catch up following a change in leadership more expensive, but let's ignore that for now.
So the steady/happy state on a follower essentially looks like:
And these operations happen in quick succession. The time being writing a Raft log entry and applying it is measured in milliseconds. And deletion happens rapidly as well. Under the hood, the above operations look like:
The contents of the Raft log entry are written twice to the WAL and twice to the MemTable. The MemTable is often flushed before the deletion occurs, so we experience 4x write amplification even before we start talking about normal RocksDB write amplification.
Can we use the RocksDB 2PC mechanism to eliminate part of this overhead? The high-level idea is to "prepare" a Raft log entry to RocksDB causing it to be written to the WAL. If the entry is committed we then write the commit marker to the WAL causing the entry to be both applied and the Raft log entry deleted. On startup, RocksDB scans the WAL and gives us access to the prepared-but-not-committed transactions which correspond to uncommitted Raft log entries. On leadership change, we'd want to rebuild the indexable Raft log so that we can fall back to the existing code. I haven't put much thought into how that would work.
This would be a large and tricky change. There are likely gaping holes in this proposal. The migration story is absent. The expected performance gains need to be verified via experimentation. What does the system do in the unhappy state where one of the replicas is significantly behind?
Cc @irfansharif, @bdarnell, @tschottdorf
The text was updated successfully, but these errors were encountered: