-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: initial import #120041
Merged
Merged
raft: initial import #120041
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
chore: run unit test with matrix strategy
Signed-off-by: Benjamin Wang <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
cleanup unnecessary tools/plugins in genproto.sh and tools/mod
This commit introduces an intermediate state that delays the acknowledgement of a node's self-vote during an election until that vote has been durably persisted (i.e. on the next call to Advance). This change can be viewed as the election counterpart to cockroachdb#14413. This is an intermediate state that limits code movement for the rest of the async storage writes change. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit adds a mechanism to the unstable struct to track "in-progress" log writes that are not yet stable. In-progress writes are still tracked in the unstable struct, which is necessary because they are not yet guaranteed to be present in `Storage` implementations. However, they are not included in the Entries field of future Ready structs, which avoids redundant raft log writes. The commit also does the same thing for the optional unstable snapshots. For now, entries and snapshots are immediately considered stable by `raft.advance`. A future commit will make it possible to accept multiple Ready structs without immediately stabilizing entries and snapshots from earlier Ready structs. This all works towards async Raft log writes, where log writes are decoupled from Ready iterations. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit adds a mechanism to the raft log struct to track in progress log application that has not yet completed. For now, committed entries are immediately considered applied by `raft.advance`. A future commit will make it possible to accept multiple Ready structs without immediately applying committed entries from earlier Ready structs. This all works towards async Raft log writes, where log application is decoupled from Ready iterations. Signed-off-by: Nathan VanBenschoten <[email protected]>
This allows callers to configure whether they want to allow entries that are not already in stable storage to be returned from the method. This will be used in a future commit. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit adds new proto fields and message types for the upcoming async storage writes functionality. These proto changes are not yet used. Signed-off-by: Nathan VanBenschoten <[email protected]>
Refer to: https://avd.aquasec.com/nvd/2022/cve-2022-41717/ Signed-off-by: Benjamin Wang <[email protected]>
deps: bump golang.org/x/net to 0.4.0 to address CVE-2022-41717
Fixes cockroachdb#12257. This change adds opt-in support to raft to perform local storage writes asynchronously from the raft state machine handling loop. A new AsyncStorageWrites configuration instructs the raft node to write to its local storage (raft log and state machine) using a request/response message passing interface instead of the default `Ready`/`Advance` function call interface. Local storage messages can be pipelined and processed asynchronously (with respect to `Ready` iteration), facilitating reduced interference between Raft proposals and increased batching of log appends and state machine application. As a result, use of asynchronous storage writes can reduce end-to-end commit latency and increase maximum throughput. When AsyncStorageWrites is enabled, the `Ready.Message` slice will include new `MsgStorageAppend` and `MsgStorageApply` messages. The messages will target a `LocalAppendThread` and a `LocalApplyThread`, respectively. Messages to the same target must be reliably processed in order. In other words, they can't be dropped (like messages over the network) and those targeted at the same thread can't be reordered. Messages to different targets can be processed in any order. `MsgStorageAppend` carries Raft log entries to append, election votes to persist, and snapshots to apply. All writes performed in response to a `MsgStorageAppend` are expected to be durable. The message assumes the role of the Entries, HardState, and Snapshot fields in Ready. `MsgStorageApply` carries committed entries to apply. The message assumes the role of the CommittedEntries field in Ready. Local messages each carry one or more response messages which should be delivered after the corresponding storage write has been completed. These responses may target the same node or may target other nodes. The storage threads are not responsible for understanding the response messages, only for delivering them to the correct target after performing the storage write. \## Design Considerations - There must be no regression for existing users that do not enable `AsyncStorageWrites`. For instance, CommittedEntries must not wait on unstable entries to be stabilized in cases where a follower is given committed entries in a MsgApp. - Asynchronous storage work should use a message passing interface, like the rest of this library. - The Raft leader and followers should behave symmetrically. Both should be able to use asynchronous storage writes for log appends and entry application. - The LocalAppendThread on a follower should be able to send MsgAppResp messages directly to the leader without passing back through the raft state machine handling loop. - The `unstable` log should remain true to its name. It should hold entries until they are stable and should not rely on an intermediate reliable cache. - Pseudo-targets should be assigned to messages that target the local storage systems to denote required ordering guarantees. - Code should be maximally unified across `AsyncStorageWrites=false` and `AsyncStorageWrites=true`. `AsyncStorageWrites=false` should be a special case of `AsyncStorageWrites=true` where the library hides the possibility of asynchrony. - It should be possible to apply snapshots asynchronously, even though a snapshot touches both the Raft log state and the state machine. The library should make this easy for users to handle by delaying all committed entries until after the snapshot has applied, so snapshot application can be handled by 1) flushing the apply thread, 2) sending the `MsgStorageAppend` that contains a snapshot to the `LocalAppendThread` to be applied. \## Usage When asynchronous storage writes is enabled, the responsibility of code using the library is different from what is presented in raft/doc.go (which has been updated to include a section about async storage writes). Users still read from the Node.Ready() channel. However, they process the updates it contains in a different manner. Users no longer consult the HardState, Entries, and Snapshot fields (steps 1 and 3 in doc.go). They also no longer call Node.Advance() to indicate that they have processed all entries in the Ready (step 4 in doc.go). Instead, all local storage operations are also communicated through messages present in the Ready.Message slice. The local storage messages come in two flavors. The first flavor is log append messages, which target a LocalAppendThread and carry Entries, HardState, and a Snapshot. The second flavor is entry application messages, which target a LocalApplyThread and carry CommittedEntries. Messages to the same target must be reliably processed in order. Messages to different targets can be processed in any order. Each local storage message carries a slice of response messages that must delivered after the corresponding storage write has been completed. With Asynchronous Storage Writes enabled, the total state machine handling loop will look something like this: ```go for { select { case <-s.Ticker: n.Tick() case rd := <-s.Node.Ready(): for _, m := range rd.Messages { switch m.To { case raft.LocalAppendThread: toAppend <- m case raft.LocalApplyThread: toApply <-m default: sendOverNetwork(m) } } case <-s.done: return } } ``` Usage of Asynchronous Storage Writes will typically also contain a pair of storage handler threads, one for log writes (append) and one for entry application to the local state machine (apply). Those will look something like: ```go // append thread go func() { for { select { case m := <-toAppend: saveToStorage(m.State, m.Entries, m.Snapshot) send(m.Responses) case <-s.done: return } } } // apply thread go func() { for { select { case m := <-toApply: for _, entry := range m.CommittedEntries { process(entry) if entry.Type == raftpb.EntryConfChange { var cc raftpb.ConfChange cc.Unmarshal(entry.Data) s.Node.ApplyConfChange(cc) } } send(m.Responses) case <-s.done: return } } } ``` \## Compatibility The library remains backwards compatible with existing users and the change does not introduce any breaking changes. Users that do not set `AsyncStorageWrites` to true in the `Config` struct will not notice a difference with this change. This is despite the fact that the existing "synchronous storage writes" interface was adapted to share a majority of the same code. For instance, `Node.Advance` has been adapted to transparently acknowledge an asynchronous log append attempt and an asynchronous state machine application attempt, internally using the same message passing mechanism introduced in this change. The change has no cross-version compatibility concerns. All changes are local to a process and nodes using asynchronous storage writes appear to behave no differently from the outside. Clusters are free to mix nodes running with and without asynchronous storage writes. \## Performance The bulk of the performance evaluation of this functionality thus far has been done with [rafttoy](https://github.com/nvanbenschoten/rafttoy), a benchmarking harness developed to experiment with Raft proposal pipeline optimization. The harness can be used to run single-node benchmarks or multi-node benchmarks. It supports plugable raft logs, storage engines, network transports, and pipeline implementations. To evaluate this change, we fixed the raft log (`etcd/wal`), storage engine (`pebble`), and network transport (`grpc`). We then built (nvanbenschoten/rafttoy#3) a pipeline implementation on top of the new asynchronous storage writes functionality and compared it against two other pipeline implementations. The three pipeline implementations we compared were: - **basic** (P1): baseline stock raft usage, similar to the code in `doc.go` - **parallel append + early ack** (P2): CockroachDB's current pipeline, which includes two significant variations to the basic pipeline. The first is that it sends MsgApp messages to followers before writing to local Raft log (see [commit](cockroachdb@b67eb69) for explanation), allowing log appends to occur in parallel across replicas. The second is that it acknowledges committed log entries before applying them (see [commit](cockroachdb@87aaea7) for explanation). - **async append + async apply + early ack** (P3): A pipelining using asynchronous storage writes with a separate append thread and a separate apply thread. Also uses the same early acknowledgement optimization from above to ack committed entries before handing them to the apply thread. All testing was performed on a 3 node AWS cluster of m5.4xlarge instances with gp3 EBS volumes (16000 IOPS, 1GB/s throughput). ![Throughput vs latency of Raft proposal pipeline implementations](https://user-images.githubusercontent.com/5438456/197925200-11352c09-569b-460c-ae42-effbf407c4e5.svg) The comparison demonstrates two different benefits of asynchronous storage writes. The first is that it reduces end-to-end latency of proposals by 20-25%. For instance, when serving 16MB/s of write traffic, P1's average latency was 13.2ms, P2's average latency was 7.3ms, and P3's average latency was 5.24ms. This is a reduction in average latency of 28% from the optimized pipeline that does not use asynchronous storage writes. This matches expectations outlined in cockroachdb#17500. The second is that it increases the maximum throughput at saturation. This is because asynchronous storage writes can improve batching for both log appends and log application. In this experiment, we saw the average append batch size under saturation increase from 928 to 1542, which is a similar ratio to the increase in peak throughput. We see a similar difference for apply batch sizes. There is more benchmarking to do. For instance, we'll need to thoroughly verify that this change does not negatively impact the performance of users of this library that do not use asynchronous storage writes. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit makes it more clear that the asyncStorageWrites handling is entirely local to RawNode and that the raft object always operates in "async storage" mode. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit adds a new data-driven test the reproduces a scenario similar to the one described in newStorageAppendRespMsg, exercising a few interesting interactions between asynchronous storage writes, term changes, and log truncation. Signed-off-by: Nathan VanBenschoten <[email protected]>
Pure code movement. Eliminates asyncStorageWrites handling in node.go. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit removes certain cases where `MsgStorageAppendResp` messages were attached as responses to a `MsgStorageAppend` message, even when the response contained no useful information. The most common case where this comes up is when the HardState changes but no new entries are appended to the log. Avoiding the response in these cases eliminates useless work. Additionally, if the HardState does not include a new vote and only includes a new Commit then there will be no response messages on the on `MsgStorageAppend`. Users of this library can use this condition to determine when an fsync is not necessary, similar to how it used to use the `Ready.MustSync` flag. Signed-off-by: Nathan VanBenschoten <[email protected]>
This avoids a call to stable `Storage`. It turns a regression in firstIndex/op from 2 to 3 (or 5 to 7) into an improvement from 2 to 1 (or 5 to 3). ``` name old firstIndex/op new firstIndex/op delta RawNode/single-voter-10 3.00 ± 0% 1.00 ± 0% -66.67% (p=0.000 n=10+10) RawNode/two-voters-10 7.00 ± 0% 3.00 ± 0% -57.14% (p=0.000 n=10+10) ``` Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit fixes the interactions between commit entry pagination and async storage writes. The pagination now properly applies across multiple Ready structs, acting as a limit on outstanding committed entries that have yet to be acked through a MsgStorageApplyResp message. The commit also resolves an abuse of the LogTerm field in MsgStorageApply{Resp}. Signed-off-by: Nathan VanBenschoten <[email protected]>
This commit replaces the HardState field in Message with a Vote. For MsgStorageAppends, the term, vote, and commit fields will either all be set (to facilitate the construction of a HardState) if any of the fields have changed or will all be unset if none of the fields have changed. Signed-off-by: Nathan VanBenschoten <[email protected]>
…syncRaftLogMsg Fixes etcd-io/etcd#12257
Signed-off-by: Kezhi Xiong <[email protected]>
In becomeLeader method(raft.go) we reduce the uncommitted size for empty entry, which is basically a no-op as payload size will always be 0 for empty entries. Added a test case to check that as well. Signed-off-by: Daman <[email protected]>
Signed-off-by: Paco Xu <[email protected]>
Signed-off-by: Benjamin Wang <[email protected]>
security: add dependabot.yml
Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 3. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@v2...v3) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>
…ons/actions/checkout-3 build(deps): bump actions/checkout from 2 to 3
Bumps [actions/setup-go](https://github.com/actions/setup-go) from 2 to 3. - [Release notes](https://github.com/actions/setup-go/releases) - [Commits](actions/setup-go@v2...v3) --- updated-dependencies: - dependency-name: actions/setup-go dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]>
Bumps [github.com/golang/protobuf](https://github.com/golang/protobuf) from 1.5.2 to 1.5.4. - [Release notes](https://github.com/golang/protobuf/releases) - [Commits](golang/protobuf@v1.5.2...v1.5.4) --- updated-dependencies: - dependency-name: github.com/golang/protobuf dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]>
…/github.com/golang/protobuf-1.5.4 build(deps): bump github.com/golang/protobuf from 1.5.2 to 1.5.4
…h-assert-(raft_paper_test.go) Test: Replace t.error/fatal with assert/request in [raft_paper_test.go]
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none
Release note: none Epic: none
Release note: none
Release note: none
Release note: none
Release note: none
rickystewart
approved these changes
Mar 15, 2024
Release note: none
bors r=bdarnell,erikgrinaker,rickystewart,nvanbenschoten,sumeerbhola |
Build succeeded: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR imports
etcd-io/raft
repository intopkg/raft
package, and makes our repository depend onpkg/raft
directly. Dependency frometcd-io/raft
is removed. The import retains the commit history, for better historical navigation on this code.The import was done from commit etcd-io/raft@a76fbc0, with the following command:
To merge changes from
etcd-io/raft
in the future, one should run the following command and resolve merge conflicts:Resolves #120194
Release note: none
Epic: none