-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: perturbation/metamorphic/backfill failed #133155
Comments
Removing the release blocker label given we decided to not treat these metamorphic perturbation test failures until we have them a bit better stabilized earlier today. @andrewbaptist I'm also assigning this to you, given we bumped the memory on the machines that run this test yesterday, and we wanted to come back to this backfill test if it failed again. |
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout. roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ f9918d8f81a1829df63ac734fd6d21c60141e338:
Parameters:
|
The test had no write throughput for about 5s in the middle of the test (12:30:21 - 12:30:26)
An example of a slow trace - it appears to be r255 is supposed to be on n4, but doesn't make it there successfully. I'm not sure why this doesn't show up in the output as a failure, but that isn't overly important. Here is the log related to the range at the time:
Specifically n5 (the leaseholder) is trying to remove itself and add n2 as the new replica. n5 sends the lease to n4 at 12:30:19, and we get the warning that it expired before the epoch upgrade. Both n4 is healthy during this period (no AC throttling, reasonable scheduler latency, ...) but n5 is slightly little less healthy due to the AC CPU throttling, however the go scheduler latency is in the microseconds and runnable goroutines are low. Other raft requests are processed very quickly. |
To clarify the timeline of concern, the following sequence of events happens:
The major question is why didn't that lease transfer successfully, both n4 and n5 were healthy and other traffic between them was flowing fine. The network connections seemed healthy, n4 had no compactions or long GC during that interval. One possibility is we "forgot" to schedule r255 on n4 during this interval. It is certainly possible that it was quiesced at 12:30:19 (although I don't know how to tell conclusively). |
The last log lines on
|
So, |
From the trace that @andrewbaptist linked, the previous lease before
And then:
I wonder though why in this trace we see a bunch of |
Based on this line:
I assume n5 has the lease at 12:30:19.758 and then transferred it to Maybe n5 is no longer the leaseholder at this point and, however its still finishing its change replica from before. So this could be a |
(NB: this is after it got another snapshot) |
@andrewbaptist See also the lease history in |
Recap of timeline:
The interesting bit is:
It could be due to cockroach/pkg/kv/kvserver/replica_command.go Lines 1402 to 1414 in 6666054
(traced back to #89340) The comment matches our scenario: we sent an initial snapshot to One bit that does not check out is: So, an alternative hypothesis:
However, between 3 and 4, the lease transfer goes through the proposal buffer, which should reject based on the same |
A step back. Looking again at the second snapshot, it actually consists of 2 phases: "queued" (8.34s) and "send" (8.30s):
So the second snapshot was initiated approximately at the same time as the first snapshot was sent from
This explains why the 2 snapshots are just a couple of entries apart. Also, it seems that the first snapshot got there first, and the second one was waiting until the first one completes and the receiver accepts it. |
140233: raft: send MsgAppResp to latest leader after handling snapshot r=pav-kv a=hakuuww raft [does not](https://github.com/cockroachdb/cockroach/blob/55cf17041236ea300c38f50bb55628d28297642f/pkg/raft/tracker/progress.go#L388) send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed. Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from [raft message.From](https://github.com/cockroachdb/cockroach/blob/145f8b978819e1e137f5a330baa83ce78385300f/pkg/raft/raftpb/raft.proto#L77C24-L77C28) ), even if it was aware that leadership had changed to Node 2 ( [raft.lead](https://github.com/cockroachdb/cockroach/blob/2083bac24d1453392955dac8bff50cbc6f64cd9d/pkg/raft/raft.go#L365) ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved. To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1. If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp). Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired. With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1. This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency. A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status. (this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. [roachtest link](#133155 (comment)) ) An simple example scenario to explain why there can be a write unavailability: some pre-conditions: Node 3 is a new learner or slow follower. - If Node 3 is made the new leaseholder but has not yet received/applied the lease entry, it cannot start serving reads or writes until it learned it became the leaseholder, after applying the lease entry. The cluster can still think Node 3 is the leaseholder, if a quorum of the cluster has applied the new lease entry. Therefore the cluster would forward write and read requests to Node 3. - Node 3 must wait for Node 2 (the new Raft leader) to replicate new entries which includes the lease entry, before it recognizes itself as the leaseholder - However, Node 2 will only replicate new entries when it sees Node 3 in stateReplicate, which could be delayed if Node 3 remains in stateSnapshot longer than necessary. Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead. This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3. Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly. In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario. The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer. It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own. potential fix to #134257 Epic: None Release note: None Co-authored-by: Anthony Xu <[email protected]>
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 1e5b3c212b45419c960038718c48a5dd75a111a0:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=16
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=true
ROACHTEST_runtimeAssertionsBuild=true
ROACHTEST_ssd=1
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
This test on roachdash | Improve this report!
Jira issue: CRDB-43480
The text was updated successfully, but these errors were encountered: