Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc: explain ReplicationSessionId in detail #1326

Merged
merged 1 commit into from
Feb 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions openraft/src/docs/data/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,7 @@ pub mod extended_membership {
pub mod effective_membership {
#![doc = include_str!("effective-membership.md")]
}

pub mod replication_session {
#![doc = include_str!("replication-session.md")]
}
119 changes: 119 additions & 0 deletions openraft/src/docs/data/replication-session.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# `ReplicationSessionId`

In OpenRaft, data often needs to be replicated from one node to multiple target
nodes—a process known as **replication**. To guarantee the correctness and
consistency of the replication process, the `ReplicationSessionId` is introduced
as an identifier that distinguishes individual replication sessions. This
ensures that replication states are correctly handled during leader changes or
membership modifications.

When a Leader starts replicating log entries to a set of target nodes, it
initiates a replication session that is uniquely identified by the
`ReplicationSessionId`. The session comprises the following key elements:
- The identity of the Leader (`leader_id`)
- The set of replication targets (`targets: BTreeSet<NodeId>`)

The `ReplicationSessionId` uniquely identifies a replication stream from Leader
to target nodes. When replication progresses (e.g., Node A receives log entry
10), the replication module sends an update message (`{target=A,
matched=log_id(10)}`) with the corresponding `ReplicationSessionId` to track
progress accurately.

The conditions that trigger the creation of a new `ReplicationSessionId` include:
- A change in the Leader’s identity.
- A change in the cluster’s membership.

This mechanism ensures that replication states remain isolated between sessions.

## Structure of `ReplicationSessionId`

The Rust implementation defines the structure as follows:

```rust,ignore
pub struct ReplicationSessionId {
/// The Leader this replication belongs to.
pub(crate) leader_vote: CommittedVote,

/// The log id of the membership log this replication works for.
pub(crate) membership_log_id: Option<LogId>,
}
```

The structure contains two core elements:

1. **`leader_vote`**
This field identifies the Leader that owns this replication session.When a
new Leader is elected, the previous replication session becomes invalid,
preventing state mixing between different Leaders.

2. **`membership_log_id`**
This field stores the ID of the membership log for this replication session.
When membership changes via a log entry, a new replication session is created.
The `membership_log_id` ensures old replication states are not reused with
new membership configurations.

## Isolation Example

Consider a scenario with three membership logs:

1. `log_id=1`: members = {a, b, c}
2. `log_id=5`: members = {a, b}
3. `log_id=10`: members = {a, b, c}

The process occurs as follows:

- When `log_id=1` is appended, OpenRaft starts replicating to Node `c`. After
log entry `log_id=1` has been replicated to Node `c`, an update message
`{target=c, matched=log_id-1}` is enqueued (into a channel) for processing by the
Raft core.

- When `log_id=5` is appended, due to the membership change, replication to Node
`c` is terminated.

- When `log_id=10` is appended, a new replication session to Node `c` is
initiated. At this point, Node `c` is considered a newly joined node without
any logs.

Without proper session isolation, a delayed state update message from a previous
session (e.g., `{target=c, matched=log_id-1}`) could be mistakenly applied to a
new replication session. This would cause the Raft core to incorrectly believe
Node `c` has received and committed certain log entries. The `ReplicationSessionId`
prevents this by strictly isolating replication sessions when membership or Leader
changes occur.

## Consequences of Not Distinguishing Membership Configurations

If the replication session only distinguishes based on the Leader and not the
`membership_log_id`, the Leader at `log_id=10` may incorrectly believe Node `c`
has received `log_id=1`. While this does not cause data loss due to Raft's
membership change algorithm, it creates engineering problems: Node `c` will
return errors when receiving logs starting from `log_id=2` since it's missing
`log_id=1`. The Leader may misinterpret these errors as data corruption and
trigger protective actions like service shutdown.

### Why Data Loss Does Not Occur

Raft itself provides specific guarantees ensuring no committed data will be lost:

- Before proposing a membership configuration with `log_id=10` (denoted as
`c10`), the previous membership configuration must have been committed.
- If the previous configuration were `c8` (associated with `log_id=8`), then
`c8` would have been committed to a quorum within the cluster. This implies
that the preceding `log_id=1` had already been accepted by a quorum.
- Raft’s joint configuration change mechanism ensures that there is an
overlapping quorum between `c8` and `c10`. Consequently, any new Leader,
whether elected under `c8` or `c10`, will have visibility of the already
committed `log_id=1`.
- Based on these principles, committed data is safeguarded against loss.

### Potential Engineering Issues

Even though data loss is avoided, improper session isolation can still lead to
engineering challenges:

- If the Leader erroneously assumes that `log_id=1` has been replicated on Node
`c`, it continues to send further log entries. Node `c`, however, will respond
with errors due to the missing `log_id=1`.
- The Leader may then mistakenly conclude that Node `c` has experienced data
loss, prompting the triggering of protective operations, which might, in turn,
lead to service downtime.
24 changes: 6 additions & 18 deletions openraft/src/replication/replication_session_id.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,14 @@ use crate::RaftTypeConfig;

/// Uniquely identifies a replication session.
///
/// A replication session is a group of replication stream, e.g., the leader(node-1) in a cluster of
/// 3 nodes may have a replication config `{2,3}`.
/// A replication session represents a set of replication streams from a leader to its followers.
/// For example, in a cluster of 3 nodes where node-1 is the leader, it maintains replication
/// streams to nodes {2,3}.
///
/// Replication state can not be shared between two leaders(different `vote`) or between two
/// membership configs: E.g. given 3 membership log:
/// - `log_id=1, members={a,b,c}`
/// - `log_id=5, members={a,b}`
/// - `log_id=10, members={a,b,c}`
/// A replication session is uniquely identified by the leader's vote and the membership
/// configuration. When either changes, a new replication session is created.
///
/// When log_id=1 is appended, openraft spawns a replication to node `c`.
/// Then log_id=1 is replicated to node `c`.
/// Then a replication state update message `{target=c, matched=log_id-1}` is piped in message
/// queue(`tx_api`), waiting the raft core to process.
///
/// Then log_id=5 is appended, replication to node `c` is dropped.
///
/// Then log_id=10 is appended, another replication to node `c` is spawned.
/// Now node `c` is a new empty node, no log is replicated to it.
/// But the delayed message `{target=c, matched=log_id-1}` may be process by raft core and make raft
/// core believe node `c` already has `log_id=1`, and commit it.
/// See: [ReplicationSession](crate::docs::data::replication_session)
#[derive(Debug, Clone)]
#[derive(PartialEq, Eq)]
pub(crate) struct ReplicationSessionId<C>
Expand Down
Loading