raft: send MsgAppResp to latest leader after handling snapshot #140233

hakuuww · 2025-01-31T16:15:26Z

raft does not send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.

Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from raft message.From ), even if it was aware that leadership had changed to Node 2 ( raft.lead ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved.

To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1.

If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp).

Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired.

With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1.

This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency.

A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status.
(this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. roachtest link )

An simple example scenario to explain why there can be a write unavailability:
some pre-conditions: Node 3 is a new learner or slow follower.

If Node 3 is made the new leaseholder but has not yet received/applied the lease entry, it cannot start serving reads or writes until it learned it became the leaseholder, after applying the lease entry. The cluster can still think Node 3 is the leaseholder, if a quorum of the cluster has applied the new lease entry. Therefore the cluster would forward write and read requests to Node 3.
Node 3 must wait for Node 2 (the new Raft leader) to replicate new entries which includes the lease entry, before it recognizes itself as the leaseholder
However, Node 2 will only replicate new entries when it sees Node 3 in stateReplicate, which could be delayed if Node 3 remains in stateSnapshot longer than necessary.

Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead.
This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3.

Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly.

In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario.

The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer.
It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own.

potential fix to #134257

Epic: None

Release note: None

blathers-crl · 2025-01-31T16:15:30Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-01-31T16:15:35Z

This change is

hakuuww · 2025-02-02T23:09:16Z

bors ping

craig · 2025-02-02T23:09:19Z

pong

hakuuww · 2025-02-04T16:12:08Z