Skip to content

Commit

Permalink
Merge #140233
Browse files Browse the repository at this point in the history
140233: raft: send MsgAppResp to latest leader after handling snapshot r=pav-kv a=hakuuww

raft [does not](https://github.com/cockroachdb/cockroach/blob/55cf17041236ea300c38f50bb55628d28297642f/pkg/raft/tracker/progress.go#L388) send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.

Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from [raft message.From](https://github.com/cockroachdb/cockroach/blob/145f8b978819e1e137f5a330baa83ce78385300f/pkg/raft/raftpb/raft.proto#L77C24-L77C28) ), even if it was aware that leadership had changed to Node 2 ( [raft.lead](https://github.com/cockroachdb/cockroach/blob/2083bac24d1453392955dac8bff50cbc6f64cd9d/pkg/raft/raft.go#L365) ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved.

To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1.

If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp). 

Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired.

With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1. 

This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency. 

A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status. 
(this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. [roachtest link](#133155 (comment)) )

An simple example scenario to explain why there can be a write unavailability:
some pre-conditions: Node 3 is a new learner or slow follower.
- If Node 3 is made the new leaseholder but has not yet received/applied the lease entry, it cannot start serving reads or writes until it learned it became the leaseholder, after applying the lease entry. The cluster can still think Node 3 is the leaseholder, if a quorum of the cluster has applied the new lease entry. Therefore the cluster would forward write and read requests to Node 3.
- Node 3 must wait for Node 2 (the new Raft leader) to replicate new entries which includes the lease entry, before it recognizes itself as the leaseholder
- However, Node 2 will only replicate new entries when it sees Node 3 in stateReplicate, which could be delayed if Node 3 remains in stateSnapshot longer than necessary.

Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead. 
This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3.

Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly.

In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario.

The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer. 
It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own.

potential fix to #134257

Epic: None

Release note: None

Co-authored-by: Anthony Xu <[email protected]>
  • Loading branch information
craig[bot] and hakuuww committed Feb 20, 2025
2 parents 7b04347 + c8c1b56 commit 3690e14
Show file tree
Hide file tree
Showing 2 changed files with 228 additions and 0 deletions.
32 changes: 32 additions & 0 deletions pkg/raft/raft.go
Original file line number Diff line number Diff line change
Expand Up @@ -2467,8 +2467,40 @@ func (r *raft) handleSnapshot(m pb.Message) {
if r.restore(s) {
r.logger.Infof("%x [commit: %d] restored snapshot [index: %d, term: %d]",
r.id, r.raftLog.committed, id.index, id.term)

// To send MsgAppResp to any leader, we must be sure that our log is
// consistent with that leader's log.
//
// After restore(s), our log ends at the entryID of this snapshot.
// The snapshot came from a node at m.Term (typically the leader).
// Everything in this snapshot has been committed during m.Term or earlier,
// and will be present in all future term logs. It is thus safe to send
// MsgAppResp to any leader at term >= m.Term.
//
// From section 5.4 of the Raft paper:
// if a log entry is committed in a given term, then that entry will be
// present in the logs of the leaders for all higher-numbered terms.
r.send(pb.Message{To: m.From, Type: pb.MsgAppResp, Index: r.raftLog.lastIndex(),
Commit: r.raftLog.committed})

// A leadership change may have happened while the snapshot was in flight.
// Therefore we need to send the MsgAppResp to the new leader as well.
//
// If we don't send response to the new leader, the new leader will not
// know that we have committed up to the snapshot index. This will cause
// delay for the new leader to transfer this follower to stateReplicate.
// (Since the new leader may send its own snapshot to this follower and
// wait for the MsgAppResp of that snapshot).
// Which may ultimately cause unwanted commit delay for client.
//
// Sending this response to the new leader allows the new leader to know
// this follower is caught up ASAP.
//
// TODO(pav-kv): consider if we can only send to r.lead.
if r.lead != None && r.lead != m.From {
r.send(pb.Message{To: r.lead, Type: pb.MsgAppResp, Index: r.raftLog.lastIndex(),
Commit: r.raftLog.committed})
}
} else {
r.logger.Infof("%x [commit: %d] ignored snapshot [index: %d, term: %d]",
r.id, r.raftLog.committed, id.index, id.term)
Expand Down
196 changes: 196 additions & 0 deletions pkg/raft/testdata/snapshot_sent_and_leadership_change.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# This test sets up a 3-node Raft cluster to verify redundant snapshot handling
# across leadership changes. It ensures that a slow follower (Node 3) correctly
# processes snapshots from both the previous (Node 1) and current (Node 2)
# leader. And also the new leader (Node 2) can avoid wasting necessary time in
# StateSnapshot.

# In the case that this slow follower happens to become the leaseholder,
# it is crucial for the leader node's progress tracker to have the leaseholder
# node transition from StateSnapshot -> StateReplicate ASAP

# Summary of steps:
# 1. Node 1 (previous leader) sends a snapshot to Node 3 (slow follower).
# 2. Leadership changes from Node 1 → Node 2.
# 3. Node 2 starts sending MsgApp to all followers, including Node 3.
# 4. Node 3, now aware of the leader change, responds with MsgAppResp to Node 2.
# 5. Node 2 discovers it has already compacted past Node 3's match index, so it
# sends a new snapshot to Node 3.
# 6. Node 3 receives the snapshot from previous term leader first.
# It sends MsgAppResp to both Node 1 and Node 2.
# 7. Node 2 had previously marked Node 3 as StateSnapshot when sending its
# own snapshot 2->3. If Node 3 responds to Snapshot 1->3, Node 2 transitions
# Node 3 back to StateReplicate.
# Even though its own snapshot is still inflight.


# This allows the new leader node 2 to transfer to StateReplicate from
# StateSnapshot earlier.

# Turn off output during the setup of the test.
log-level none
----
ok

# Start cluster of 3
add-nodes 3 voters=(1,2,3) index=10
----
ok

# Elect 1 as leader.
campaign 1
----
ok

stabilize
----
ok

propose 1 m13
----
ok

propose 1 m14
----
ok

propose 1 m15
----
ok

propose 1 m16
----
ok

# Create a network partition of (1,2) and (3)
stabilize 1 2
----
ok

deliver-msgs drop=(3)
----
ok

log-level debug
----
ok

# Follower 3 is far behind, so it will need a snapshot.
raft-log 3
----
1/11 EntryNormal ""

# Leader1 sends a snapshot to 3
send-snapshot 1 3
----
1->3 MsgSnap Term:1 Log:0/0
Snapshot: Index:15 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false

# Transfer leadership to 2, without 3 hearing about it.
transfer-leadership from=1 to=2
----
INFO 1 [term 1] starts to transfer leadership to 2
INFO 1 sends MsgTimeoutNow to 2 immediately as 2 already has up-to-date log
DEBUG 1 setting election elapsed to start from 3 ticks after store liveness support expired
INFO 1 became follower at term 1
DEBUG 1 reset election elapsed to 0

log-level none
----
ok

stabilize 1 2
----
ok

log-level debug
----
ok

# Deliver MsgApp 2->3 first
deliver-msgs type=MsgApp 3
----
2->3 MsgApp Term:2 Log:1/15 Commit:15 Entries:[2/16 EntryNormal ""]
INFO 3 [term: 1] received a MsgApp message with higher term from 2 [term: 2], new leader indicated, advancing term
DEBUG 3 setting election elapsed to start from 3 ticks after store liveness support expired
INFO 3 became follower at term 2
DEBUG 3 reset election elapsed to 0
DEBUG 3 [logterm: 0, index: 15] rejected MsgApp [logterm: 1, index: 15] from 2

# Deliver MsgSnap from 1->3 second
deliver-msgs type=MsgSnap 3
----
1->3 MsgSnap Term:1 Log:0/0
Snapshot: Index:15 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false
INFO log [committed=11, applied=11, applying=11, unstable.offset=12, unstable.offsetInProgress=12, len(unstable.Entries)=0] starts to restore snapshot [index: 15, term: 1]
INFO 3 switched to configuration voters=(1 2 3)
INFO 3 [commit: 15, lastindex: 15, lastterm: 1] restored snapshot [index: 15, term: 1]
INFO 3 [commit: 15] restored snapshot [index: 15, term: 1]

# 3 first responds to MsgApp 2->3, with hint = 11
# 3 then responds to MsgSnap 1->3, with commit index = 15
stabilize 3
----
> 3 handling Ready
Ready MustSync=true:
HardState Term:2 Commit:15 Lead:2 LeadEpoch:0
Snapshot Index:15 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false
Messages:
3->2 MsgAppResp Term:2 Log:1/15 Rejected (Hint: 11) Commit:11
3->1 MsgAppResp Term:2 Log:0/15 Commit:15
3->2 MsgAppResp Term:2 Log:0/15 Commit:15
> 3 receiving messages
2->3 MsgVote Term:2 Log:1/15
INFO 3 [logterm: 1, index: 15, vote: 0] rejected MsgVote from 2 [logterm: 1, index: 15] at term 2
2->3 MsgFortifyLeader Term:2 Log:0/0
> 3 handling Ready
Ready MustSync=true:
HardState Term:2 Commit:15 Lead:2 LeadEpoch:1
Messages:
3->2 MsgVoteResp Term:2 Log:0/0 Rejected (Hint: 0)
3->2 MsgFortifyLeaderResp Term:2 Log:0/0 LeadEpoch:1

# 3 has applied the snapshot
raft-log 3
----
log is empty: first index=16, last index=15

# Do compaction on 2
compact 2 15
----
2/16 EntryNormal ""


# 2 processes MsgAppResp from 3 with hint 11,
# since compaction already happened on 2, 2 sends MsgSnap 2->3 up to index 16
# 2 processes MsgAppResp from 3 with commit index 15
# This is different from the snapshot just sent by 2, but 2 still sees 3 has
# caught up
# 2 changes follower 3's state in progress tracker to StateReplicate
# from StateSnapshot
# 2 then sends MsgApp 2->3 for entry 16, since 3 is caught up and in state
# replicate from the perspective of 2
stabilize 2
----
> 2 receiving messages
3->2 MsgAppResp Term:2 Log:1/15 Rejected (Hint: 11) Commit:11
DEBUG 2 received MsgAppResp(rejected, hint: (index 11, term 1)) from 3 for index 15
DEBUG 2 decreased progress of 3 to [StateProbe match=0 next=12 sentCommit=11 matchCommit=11]
DEBUG 2 [firstindex: 16, commit: 16] sent snapshot[index: 16, term: 2] to 3 [StateProbe match=0 next=12 sentCommit=11 matchCommit=11]
DEBUG 2 paused sending replication messages to 3 [StateSnapshot match=0 next=17 sentCommit=16 matchCommit=11 paused pendingSnap=16]
3->2 MsgAppResp Term:2 Log:0/15 Commit:15
DEBUG 2 recovered from needing snapshot, resumed sending replication messages to 3 [StateSnapshot match=15 next=17 sentCommit=16 matchCommit=15 paused pendingSnap=16]
3->2 MsgVoteResp Term:2 Log:0/0 Rejected (Hint: 0)
3->2 MsgFortifyLeaderResp Term:2 Log:0/0 LeadEpoch:1
> 2 handling Ready
Ready MustSync=false:
Messages:
2->3 MsgSnap Term:2 Log:0/0
Snapshot: Index:16 Term:2 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false
2->3 MsgApp Term:2 Log:1/15 Commit:16 Entries:[2/16 EntryNormal ""]
2->3 MsgApp Term:2 Log:2/16 Commit:16

status 2
----
1: StateReplicate match=16 next=17 sentCommit=16 matchCommit=16
2: StateReplicate match=16 next=17 sentCommit=15 matchCommit=15
3: StateReplicate match=15 next=17 sentCommit=16 matchCommit=15 inflight=1

0 comments on commit 3690e14

Please sign in to comment.