-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: send MsgAppResp to latest leader after handling snapshot #140233
raft: send MsgAppResp to latest leader after handling snapshot #140233
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
1f96b02
to
7ad78e7
Compare
bors ping |
pong |
88c53fd
to
b48a2d9
Compare
b48a2d9
to
2083bac
Compare
2083bac
to
3431934
Compare
@@ -243,6 +243,7 @@ stabilize 3 | |||
Snapshot Index:15 Term:1 ConfState:Voters:[1 2 3] VotersOutgoing:[] Learners:[] LearnersNext:[] AutoLeave:false | |||
Messages: | |||
3->2 MsgAppResp Term:2 Log:1/15 Rejected (Hint: 11) Commit:11 | |||
3->2 MsgAppResp Term:2 Log:0/15 Commit:15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slow follower Node 3 sends MsgAppResp to the current leader after applying the snapshot from previous leader.
1->2 MsgAppResp Term:2 Log:0/16 Commit:15 | ||
1->2 MsgAppResp Term:2 Log:0/16 Commit:16 | ||
|
||
# Drop unnecessary msgs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these messages are not unnecessary, maybe be more specific what we're trying to do here?
Do we need to drop MsgVote
btw? Would the test pass with it being delivered? Same question re MsgFortifyLeader
.
Generally, it's good to not drop messages in these tests unless absolutely necessary to achieve what we're trying to achieve. We want to be as close to reality as possible: most messages are delivered, but maybe some are delayed.
3431934
to
212c3da
Compare
7eea2e7
to
edc66cd
Compare
edc66cd
to
2626cf5
Compare
TYFTR! bors r=pav-kv |
bors retry |
tryAlready running a review |
Epic: None Release note: None
…and original sender Previously, in the special scenario of a leader sending a snapshot to a follower, followed by a leadership change, the receiver of the snapshot does not send MsgAppResp to the new leader. This can cause delay in the new leader catching up this follower. This commit resolves that issue by having the follower(snapshot receiver) send MsgAppResp to both the original sender of the snapshot and the new leader. References: cockroachdb#134257 Epic: None Release note: None
2626cf5
to
c8c1b56
Compare
bors retry |
tryAlready running a review |
bors r=pav-kv |
raft does not send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.
Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from raft message.From ), even if it was aware that leadership had changed to Node 2 ( raft.lead ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved.
To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1.
If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp).
Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired.
With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1.
This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency.
A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status.
(this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. roachtest link )
An simple example scenario to explain why there can be a write unavailability:
some pre-conditions: Node 3 is a new learner or slow follower.
Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead.
This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3.
Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly.
In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario.
The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer.
It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own.
potential fix to #134257
Epic: None
Release note: None