Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: send MsgAppResp to latest leader after handling snapshot #140233

Merged

Conversation

hakuuww
Copy link
Contributor

@hakuuww hakuuww commented Jan 31, 2025

raft does not send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.

Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from raft message.From ), even if it was aware that leadership had changed to Node 2 ( raft.lead ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved.

To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1.

If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp).

Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired.

With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1.

This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency.

A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status.
(this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. roachtest link )

An simple example scenario to explain why there can be a write unavailability:
some pre-conditions: Node 3 is a new learner or slow follower.

  • If Node 3 is made the new leaseholder but has not yet received/applied the lease entry, it cannot start serving reads or writes until it learned it became the leaseholder, after applying the lease entry. The cluster can still think Node 3 is the leaseholder, if a quorum of the cluster has applied the new lease entry. Therefore the cluster would forward write and read requests to Node 3.
  • Node 3 must wait for Node 2 (the new Raft leader) to replicate new entries which includes the lease entry, before it recognizes itself as the leaseholder
  • However, Node 2 will only replicate new entries when it sees Node 3 in stateReplicate, which could be delayed if Node 3 remains in stateSnapshot longer than necessary.

Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead.
This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3.

Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly.

In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario.

The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer.
It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own.

potential fix to #134257

Epic: None

Release note: None

Copy link

blathers-crl bot commented Jan 31, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch 2 times, most recently from 1f96b02 to 7ad78e7 Compare January 31, 2025 21:23
@hakuuww
Copy link
Contributor Author

hakuuww commented Feb 2, 2025

bors ping

@craig
Copy link
Contributor

craig bot commented Feb 2, 2025

pong

@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch 5 times, most recently from 88c53fd to b48a2d9 Compare February 4, 2025 00:05
@hakuuww hakuuww requested a review from pav-kv February 4, 2025 00:05
@hakuuww hakuuww changed the title raft: experiment follower sending MsgAppResp back to what it thinks i… raft: Send MsgAppResp to both sender and current leader after applyinf snapshot Feb 4, 2025
@hakuuww hakuuww changed the title raft: Send MsgAppResp to both sender and current leader after applyinf snapshot raft: Send MsgAppResp to both sender and current leader after applying snapshot Feb 4, 2025
@hakuuww hakuuww changed the title raft: Send MsgAppResp to both sender and current leader after applying snapshot raft: Follower sends MsgAppResp to both sender and current leader after applying snapshot Feb 4, 2025
@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch from b48a2d9 to 2083bac Compare February 4, 2025 14:49
@hakuuww hakuuww changed the title raft: Follower sends MsgAppResp to both sender and current leader after applying snapshot raft: Ensure proper state transition for current leader when follower receives prior leader snapshot Feb 4, 2025
@hakuuww hakuuww changed the title raft: Ensure proper state transition for current leader when follower receives prior leader snapshot (wip) raft: Ensure proper state transition for current leader when follower receives prior leader snapshot Feb 4, 2025
@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch from 2083bac to 3431934 Compare February 4, 2025 16:10
@@ -243,6 +243,7 @@ stabilize 3
Snapshot Index:15 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false
Messages:
3->2 MsgAppResp Term:2 Log:1/15 Rejected (Hint: 11) Commit:11
3->2 MsgAppResp Term:2 Log:0/15 Commit:15
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slow follower Node 3 sends MsgAppResp to the current leader after applying the snapshot from previous leader.

@hakuuww hakuuww changed the title (wip) raft: Ensure proper state transition for current leader when follower receives prior leader snapshot raft: Ensure proper state transition for current leader when follower receives prior leader snapshot Feb 4, 2025
@hakuuww hakuuww marked this pull request as ready for review February 4, 2025 16:23
@hakuuww hakuuww requested a review from a team as a code owner February 4, 2025 16:23
1->2 MsgAppResp Term:2 Log:0/16 Commit:15
1->2 MsgAppResp Term:2 Log:0/16 Commit:16

# Drop unnecessary msgs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these messages are not unnecessary, maybe be more specific what we're trying to do here?

Do we need to drop MsgVote btw? Would the test pass with it being delivered? Same question re MsgFortifyLeader.

Generally, it's good to not drop messages in these tests unless absolutely necessary to achieve what we're trying to achieve. We want to be as close to reality as possible: most messages are delivered, but maybe some are delayed.

@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch from 3431934 to 212c3da Compare February 7, 2025 00:04
@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch 2 times, most recently from 7eea2e7 to edc66cd Compare February 14, 2025 15:49
@hakuuww hakuuww changed the title raft: Ensure proper state transition for current leader when follower receives prior leader snapshot raft: send MsgAppResp to latest leader after handling snapshot Feb 18, 2025
@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch from edc66cd to 2626cf5 Compare February 18, 2025 18:52
@hakuuww
Copy link
Contributor Author

hakuuww commented Feb 18, 2025

TYFTR!

bors r=pav-kv

@hakuuww
Copy link
Contributor Author

hakuuww commented Feb 19, 2025

bors retry

@craig
Copy link
Contributor

craig bot commented Feb 19, 2025

try

Already running a review

…and original sender

Previously, in the special scenario of a leader sending a snapshot to a follower, followed by a leadership change, the receiver of the snapshot does not send MsgAppResp to the new leader. This can cause delay in the new leader catching up this follower.

This commit resolves that issue by having the follower(snapshot receiver) send MsgAppResp to both the original sender of the snapshot and the new leader.

References: cockroachdb#134257

Epic: None
Release note: None
@hakuuww hakuuww force-pushed the experimentLeaderHandlingOfMsgAppRespSnapshot branch from 2626cf5 to c8c1b56 Compare February 20, 2025 17:02
@hakuuww
Copy link
Contributor Author

hakuuww commented Feb 20, 2025

bors retry

@craig
Copy link
Contributor

craig bot commented Feb 20, 2025

try

Already running a review

@hakuuww
Copy link
Contributor Author

hakuuww commented Feb 20, 2025

bors r=pav-kv

@craig craig bot merged commit 3690e14 into cockroachdb:master Feb 20, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants