Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: perturbation/metamorphic/backfill failed #133155

Closed
cockroach-teamcity opened this issue Oct 22, 2024 · 13 comments
Closed

roachtest: perturbation/metamorphic/backfill failed #133155

cockroach-teamcity opened this issue Oct 22, 2024 · 13 comments
Assignees
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 22, 2024

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ 1e5b3c212b45419c960038718c48a5dd75a111a0:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:857
	            				main/pkg/cmd/roachtest/test_runner.go:1281
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/backfill
	Messages:   	FAILURE: follower-read  : Increase 69.8096 > 40.0000 BASE: 5.460481ms SCORE: 381.194166ms
	            	
	            	FAILURE: read           : Increase 67.4791 > 40.0000 BASE: 5.402082ms SCORE: 364.527724ms
	            	
	            	FAILURE: follower-read  : Increase 659282.5797 > 40.0000 BASE: 5.460481ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 666409.7287 > 40.0000 BASE: 5.402082ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 311381.0554 > 40.0000 BASE: 11.561397ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-43480

@cockroach-teamcity cockroach-teamcity added B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Oct 22, 2024
@arulajmani arulajmani removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 23, 2024
@arulajmani
Copy link
Collaborator

Removing the release blocker label given we decided to not treat these metamorphic perturbation test failures until we have them a bit better stabilized earlier today. @andrewbaptist I'm also assigning this to you, given we bumped the memory on the machines that run this test yesterday, and we wanted to come back to this backfill test if it failed again.

@cockroach-teamcity
Copy link
Member Author

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.perturbation/metamorphic/backfill failed with artifacts on master @ f9918d8f81a1829df63ac734fd6d21c60141e338:

(assertions.go:363).Fail: 
	Error Trace:	pkg/cmd/roachtest/tests/admission_control_latency.go:913
	            				pkg/cmd/roachtest/test_runner.go:1287
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/backfill
	Messages:   	FAILURE: follower-read  : Increase 220029.2101 > 40.0000 BASE: 16.361464ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 150945.0481 > 40.0000 BASE: 23.849739ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 194491.6507 > 40.0000 BASE: 18.509792ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/backfill/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@andrewbaptist
Copy link
Contributor

andrewbaptist commented Nov 1, 2024

The test had no write throughput for about 5s in the middle of the test (12:30:21 - 12:30:26)

{"time":"2024-11-01T12:30:21.870819182Z","errs":0,"avgt":104.0,"avgl":3996.1,"p50l":1.7,"p95l":17.8,"p99l":24.1,"maxl":25.2,"type":"write"}
{"time":"2024-11-01T12:30:22.870930461Z","errs":0,"avgt":0.0,"avgl":3990.9,"p50l":0.0,"p95l":0.0,"p99l":0.0,"maxl":0.0,"type":"write"}
{"time":"2024-11-01T12:30:23.870720455Z","errs":0,"avgt":0.0,"avgl":3985.8,"p50l":0.0,"p95l":0.0,"p99l":0.0,"maxl":0.0,"type":"write"}
{"time":"2024-11-01T12:30:24.871413192Z","errs":0,"avgt":0.0,"avgl":3980.7,"p50l":0.0,"p95l":0.0,"p99l":0.0,"maxl":0.0,"type":"write"}
{"time":"2024-11-01T12:30:25.87117462Z","errs":0,"avgt":0.0,"avgl":3975.6,"p50l":0.0,"p95l":0.0,"p99l":0.0,"maxl":0.0,"type":"write"}
{"time":"2024-11-01T12:30:26.871456672Z","errs":0,"avgt":760.8,"avgl":3971.4,"p50l":2.2,"p95l":6710.9,"p99l":6979.3,"maxl":6979.3,"type":"write"}
{"time":"2024-11-01T12:30:27.87243092Z","errs":0,"avgt":3510.6,"avgl":3970.9,"p50l":2.1,"p95l":805.3,"p99l":7516.2,"maxl":8053.1,"type":"write"}
{"time":"2024-11-01T12:30:28.871128407Z","errs":0,"avgt":3991.2,"avgl":3970.9,"p50l":2.0,"p95l":6.3,"p99l":23.1,"maxl":8053.1,"type":"write"}
{"time":"2024-11-01T12:30:29.871054931Z","errs":0,"avgt":4041.3,"avgl":3971.0,"p50l":2.1,"p95l":6.0,"p99l":12.6,"maxl":37.7,"type":"write"}

https://grafana.testeng.crdb.io/d/crdb-home/crdb-home?orgId=1&var-DS_PROMETHEUS=Prometheus&var-test_run_id=teamcity-17542157&var-test_name=perturbation-metamorphic-backfill&from=1730464080000&to=1730464320000

An example of a slow trace - it appears to be r255 is supposed to be on n4, but doesn't make it there successfully.
https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_RoachtestNightlyGceBazel/17542157:id/perturbation/metamorphic/backfill/run_1/artifacts.zip!/stmtbundle/2024-11-01T12_30_27Z-UPSERT-7570.817ms.zip!/trace.txt
2024-11-01T12_30_27Z-UPSERT-7570.817ms.zip

I'm not sure why this doesn't show up in the output as a failure, but that isn't overly important. Here is the log related to the range at the time:

241101 12:30:11.365505 2487255 13@kv/kvserver/allocator/plan/replicate.go:841 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 297  rebalancing voter n5,s5 to n2,s2: [1:17587, 5*:17587, 4:17587]
241101 12:30:11.365830 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 298  change replicas (add [(n2,s2):6LEARNER] remove []): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5, (n4,s4):4, next=6, gen=69]
241101 12:30:19.707332 2487323 13@kv/kvserver/store_snapshot.go:2137 ⋮ [T1,Vsystem,n5,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 299  streamed snapshot 64bea5dd at applied index 17604 to (n2,s2):6LEARNER with 266 MiB in 8.29s @ 32 MiB/s: kvs=510638 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s, queued: 0.01s
241101 12:30:19.758280 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 300  change replicas (add [(n2,s2):6VOTER_INCOMING] remove [(n5,s5):5VOTER_DEMOTING_LEARNER]): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5, (n4,s4):4, (n2,s2):6LEARNER, next=7, gen=70]
241101 12:30:25.789415 247 2@kv/kvserver/replica_proposal.go:649 ⋮ [T1,Vsystem,n4,s4,r255/4:‹/Table/109/1/-1{22482…-}›,raft] 22782  lease expired before epoch/leader lease upgrade, client traffic may have been delayed [lease=repl=(n4,s4):4 seq=19 start=1730464225.763279802,1 epo=1 min-exp=0,0 pro=1730464225.786728250,0 prev=repl=(n2,s2):6VOTER_INCOMING seq=18 start=1730464219.763694984,0 exp=1730464225.763279802,0 pro=1730464219.763279802,0 acquisition-type=Request]
241101 12:30:26.917010 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 301  change replicas (add [] remove []): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5VOTER_DEMOTING_LEARNER, (n4,s4):4, (n2,s2):6VOTER_INCOMING, next=7, gen=71]
241101 12:30:26.919790 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 302  change replicas (add [] remove []): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5VOTER_DEMOTING_LEARNER, (n4,s4):4, (n2,s2):6VOTER_INCOMING, next=7, gen=71]
241101 12:30:26.923875 4066606 13@kv/kvserver/replica_raft.go:403 ⋮ [T1,Vsystem,n4,s4,r255/4:‹/Table/109/1/-1{22482…-}›] 784  proposing LEAVE_JOINT: after=[(n3,s3):1 (n5,s5):5LEARNER (n4,s4):4 (n2,s2):6] next=7
241101 12:30:26.925587 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 303  change replicas (add [] remove [(n5,s5):5LEARNER]): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5LEARNER, (n4,s4):4, (n2,s2):6, next=7, gen=72]
241101 12:30:26.932677 4066658 13@kv/kvserver/replica_raft.go:403 ⋮ [T1,Vsystem,n4,s4,r255/4:‹/Table/109/1/-1{22482…-}›] 785  proposing SIMPLE(r5) [(n5,s5):5LEARNER]: after=[(n3,s3):1 (n2,s2):6 (n4,s4):4] next=7
241101 12:30:26.935771 243 kv/kvserver/store_remove_replica.go:152 ⋮ [T1,Vsystem,n5,s5,r255/5:‹/Table/109/1/-1{22482…-}›,raft] 15694  removing replica r255/5
241101 12:30:28.053291 4047364 13@kv/kvserver/store_snapshot.go:2137 ⋮ [T1,Vsystem,n4,s4,r255/4:‹/Table/109/1/-1{22482…-}›] 786  streamed snapshot 9bd9fdeb at applied index 17606 to (n2,s2):6LEARNER with 266 MiB in 8.30s @ 32 MiB/s: kvs=510640 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s, queued: 8.34s

Specifically n5 (the leaseholder) is trying to remove itself and add n2 as the new replica. n5 sends the lease to n4 at 12:30:19, and we get the warning that it expired before the epoch upgrade. Both n4 is healthy during this period (no AC throttling, reasonable scheduler latency, ...) but n5 is slightly little less healthy due to the AC CPU throttling, however the go scheduler latency is in the microseconds and runnable goroutines are low. Other raft requests are processed very quickly.

@andrewbaptist
Copy link
Contributor

To clarify the timeline of concern, the following sequence of events happens:

12:30:11.365505 - n5 - (leaseholder) allocator decides to transfer replica from n5 -> n2, begins replica transfer
12:30:19.707332 - n5 - completes replica transfer to n2 after ~8s - begins lease transfer to n4
12:30:25.789415 - n4 - reports lease expired before lease upgrade

The major question is why didn't that lease transfer successfully, both n4 and n5 were healthy and other traffic between them was flowing fine. The network connections seemed healthy, n4 had no compactions or long GC during that interval. One possibility is we "forgot" to schedule r255 on n4 during this interval. It is certainly possible that it was quiesced at 12:30:19 (although I don't know how to tell conclusively).

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 1, 2024

n2 held the last lease before n4:

n2 lease
prev=repl=(n2,s2):6VOTER_INCOMING seq=18 start=1730464219.763694984,0 exp=1730464225.763279802,0 pro=1730464219.763279802

Proposed: 12:30:19.763
Start: 12:30:19.763
Expiration: 12:30:25.763
n4 lease (requested)
repl=(n4,s4):4 seq=19 start=1730464225.763279802,1 epo=1 min-exp=0,0 pro=1730464225.786728250,0

Proposed: 12:30:25.786
Start: 12:30:25.763

The last log lines on n2 for this r255 range:

12:30:19.707332 [n5] streamed snapshot 64bea5dd at applied index 17604 to (n2,s2):6LEARNER with 266 MiB in 8.29s @ 32 MiB/s: kvs=510638 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s, queued: 0.01s
12:30:19.709180 applying snapshot 64bea5dd from (n5,s5):5 at applied index 17604
12:30:19.754974 applied snapshot 64bea5dd from (n5,s5):5 at applied index 17604 (total=46ms data=259 MiB excise=true ingestion=8@44ms)
12:30:19.758280 [n5] change replicas (add [(n2,s2):6VOTER_INCOMING] remove [(n5,s5):5VOTER_DEMOTING_LEARNER]): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5, (n4,s4):4, (n2,s2):6LEARNER, next=7, gen=70]
12:30:28.055656 applying snapshot 9bd9fdeb from (n4,s4):4 at applied index 17606

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 1, 2024

So, n2 received a snapshot at 12:30:19.709180. A little later, at 12:30:19.763 it got a lease. But, for some reason, it was unavailable for a few seconds (did not see/apply the lease?). So n4 requested a lease and sent a snapshot to n2.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 1, 2024

From the trace that @andrewbaptist linked, the previous lease before n2 was:

n4 lease
repl=(n4,s4):4 seq=17 start=1730464211.367732900,0 epo=1 min-exp=1730464217.365993389,0 pro=1730464211.395416908

Proposed: 12:30:11.395
Start: 12:30:11.367
Min-Exp: 12:30:17.365

And then:

n2 lease (transfer)
prev=repl=(n2,s2):6VOTER_INCOMING seq=18 start=1730464219.763694984,0 exp=1730464225.763279802,0 pro=1730464219.763279802

Proposed: 12:30:19.763
Start: 12:30:19.763
Expiration: 12:30:25.763
n4 lease (requested)
repl=(n4,s4):4 seq=19 start=1730464225.763279802,1 epo=1 min-exp=0,0 pro=1730464225.786728250,0

Proposed: 12:30:25.786
Start: 12:30:25.763

I wonder though why in this trace we see a bunch of NotLeaseholderErrors from n2 redirecting to n4, but the retries are nevertheless routed to n2 again and again. Upd: probably because n2 returns a stale lease for n4 which it still thinks is active.

@andrewbaptist
Copy link
Contributor

andrewbaptist commented Nov 1, 2024

r255 is a fairly large range which is why it took 8s to transfer from n5 -> n2 (32 MB/s * 8 = 256MB). It might also mean that n2 was ~8s behind on raft traffic after it got the snapshot, so if immediately got a lease transfer it wouldn’t see it for a little bit. I’m a little confused how at 12:30:11 n4 got the lease.

Based on this line:

241101 12:30:19.758280 2487255 13@kv/kvserver/replica_command.go:2494 ⋮ [T1,Vsystem,n5,replicate,s5,r255/5:‹/Table/109/1/-1{22482…-}›] 300  change replicas (add [(n2,s2):6VOTER_INCOMING] remove [(n5,s5):5VOTER_DEMOTING_LEARNER]): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5, (n4,s4):4, (n2,s2):6LEARNER, next=7, gen=70]

I assume n5 has the lease at 12:30:19.758 and then transferred it to n2.

Maybe n5 is no longer the leaseholder at this point and, however its still finishing its change replica from before. So this could be a admin request that is processed by a different node.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 1, 2024

n2 only applied the lease at 12:30:28:

W241101 12:30:28.108552 251 2@kv/kvserver/replica_proposal.go:635 ⋮ [T1,Vsystem,n2,s2,r255/6:‹/Table/109/1/-1{22482…-}›,raft] 40161  applied lease after ~8.35s replication lag, client traffic may have been delayed [lease=repl=(n2,s2):6VOTER_INCOMING seq=18 start=1730464219.763694984,0 exp=1730464225.763279802,0 pro=1730464219.763279802,0 prev=repl=(n4,s4):4 seq=17 start=1730464211.367732900,0 epo=1 min-exp=1730464217.365993389,0 pro=1730464211.395416908,0 acquisition-type=Transfer]

(NB: this is after it got another snapshot)

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 1, 2024

@andrewbaptist n5 never had the lease. It was n4, then it transferred it to n2, and then n4 acquired it again.

See also the lease history in ranges.json, it matches this sequence.

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 2, 2024

Recap of timeline:

12:30:11.367 n4 lease (seq=16) is started

~12:30:19.713 [n4] starts streaming snapshot 9bd9fdeb to n2 (blocking MsgApp stream)

# n4 proposes a joint config with n2 as VOTER_INCOMING
12:30:19.761565 [n4] proposing ENTER_JOINT(r5 l5 v6) [(n2,s2):6VOTER_INCOMING], [(n5,s5):5VOTER_DEMOTING_LEARNER]: after=[(n3,s3):1 (n5,s5):5VOTER_DEMOTING_LEARNER (n4,s4):4 (n2,s2):6VOTER_INCOMING] next=7

12:30:19.763 [n4] proposes lease transfer to n2 (seq=17)

12:30:25.786 [n4] reacquires the lease (seq=18)
12:30:25.789415 [n4] lease expired before epoch/leader lease upgrade, client traffic may have been delayed ...

12:30:26.923875 [n4] proposing LEAVE_JOINT: after=[(n3,s3):1 (n5,s5):5LEARNER (n4,s4):4 (n2,s2):6] next=7 ...

# the snapshot is finally delivered to n2 in 8.34s
12:30:28.053291 [n4] streamed snapshot 9bd9fdeb at applied index 17606 to
(n2,s2):6LEARNER with 266 MiB in 8.30s @ 32 MiB/s:
kvs=510640 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s,
queued: 8.34s

# n2 finally applies its expired lease
# (either from the snapshot, or from a follow-up MsgApp/commit)
12:30:28.108552 [n2] applied lease after ~8.35s replication lag, client traffic may have been delayed [lease=repl=(n2,s2):6VOTER_INCOMING seq=18 start=1730464219.763694984,0 exp=1730464225.763279802,0 pro=1730464219.763279802,0 prev=repl=(n4,s4):4 seq=17 start=1730464211.367732900,0 epo=1 min-exp=1730464217.365993389,0 pro=1730464211.395416908,0 acquisition-type=Transfer]

The interesting bit is:

~12:30:19.713 [n4] starts streaming snapshot 9bd9fdeb to n2 (blocking MsgApp stream)
12:30:19.763 [n4] proposes lease transfer to n2 (seq=17)
# the latter should have not happened because n2 needs a snapshot

It could be due to

// We bypass safety checks when transferring the lease to the VOTER_INCOMING.
// We do so because we could get stuck without a path to exit the joint
// configuration if we rejected this lease transfer while waiting to confirm
// that the target is up-to-date on its log. That confirmation may never
// arrive if the target is dead or partitioned away, and while we'd rather not
// transfer the lease to a dead node, at least we have a mechanism to recovery
// from that state. We also just sent the VOTER_INCOMING a snapshot (as a
// LEARNER, before promotion), so it is unlikely that the replica is actually
// dead or behind on its log.
// TODO(nvanbenschoten): this isn't great. Instead of bypassing safety checks,
// we should build a mechanism to choose an alternate lease transfer target
// after some amount of time.
err = r.store.DB().AdminTransferLeaseBypassingSafetyChecks(ctx, r.startKey, voterIncomingTarget.StoreID)

(traced back to #89340)

The comment matches our scenario: we sent an initial snapshot to n2, then made it VOTER_INCOMING, and transferring the lease to it. This “bypass checks” gets plumbed all the way to here. So we aren’t checking whether n2 needs a snapshot (and in our case it does need one, probably because there was a racing raft log compaction).

One bit that does not check out is: n4 is not demoting itself. So we might actually not be hitting this path. In this case it's unclear why n4 chose n2 to transfer the lease. It might have been due to ReplicaMayNeedSnapshot check racing with log compaction in an unfortunate way (something similar to #81978).

So, an alternative hypothesis:

  1. The lease transfer was proposed after passing the !ReplicaMayNeedSnapshot check.
  2. Shortly after that, log compaction happens, and snapshot is sent to n2.
  3. When the lease transfer is evaluated, it passes CheckCanReceiveLease because it doesn't check the log.
  4. So the lease transfer is committed, but it can't be replicated to n2 yet because we're still streaming the snapshot.
  5. Only when the snapshot completes, we send a follow-up MsgApp with a committed lease transfer to n2.

However, between 3 and 4, the lease transfer goes through the proposal buffer, which should reject based on the same ReplicaMayNeedSnapshot check. So this again does not fully check out...

@pav-kv
Copy link
Collaborator

pav-kv commented Nov 2, 2024

A step back.

Looking again at the second snapshot, it actually consists of 2 phases: "queued" (8.34s) and "send" (8.30s):

12:30:11.367 n4 lease (seq=16) is started
~12:30:11.413 [n4] snapshot to n2 initiated (12:30:28.053291 minus 8.30+8.34s)
# why was it queued for 8.34s? probably because of the other snapshot (from n5) below
~12:30:19.753 [n4] snapshot to n2 sent (12:30:28.053291 minus 8.30s)

12:30:19.761565 [n4] proposing ENTER_JOINT(r5 l5 v6) [(n2,s2):6VOTER_INCOMING], [(n5,s5):5VOTER_DEMOTING_LEARNER]: after=[(n3,s3):1 (n5,s5):5VOTER_DEMOTING_LEARNER (n4,s4):4 (n2,s2):6VOTER_INCOMING] next=7
12:30:19.763 n4 proposed lease transfer to n2 (seq=17)

12:30:28.053291 [n4] streamed snapshot 9bd9fdeb at applied index 17606 to
(n2,s2):6LEARNER with 266 MiB in 8.30s @ 32 MiB/s:
kvs=510640 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s,
queued: 8.34s

So the second snapshot was initiated approximately at the same time as the first snapshot was sent from n5 (just a few ms after):

12:30:11.365830 [n5] change replicas (add [(n2,s2):6LEARNER] remove []): existing descriptor r255:‹/Table/109/1/-1{224827050299356085-}› [(n3,s3):1, (n5,s5):5, (n4,s4):4, next=6, gen=69]
24110

12:30:11.395 [n5] propose lease transfer to n4

~12:30:11.407 [n5] snapshot initiated/sent to n2 (12:30:19.707332 minus 8.29+0.01s)

12:30:19.707332 [n5] streamed snapshot 64bea5dd at applied index 17604 to (n2,s2):6LEARNER
with 266 MiB in 8.29s @ 32 MiB/s:
kvs=510638 rangeKVs=0 sharedSSTs=0, externalSSTs=0, rate-limit: 32 MiB/s,
queued: 0.01s

12:30:19.758280

This explains why the 2 snapshots are just a couple of entries apart.

Also, it seems that the first snapshot got there first, and the second one was waiting until the first one completes and the receiver accepts it.

@andrewbaptist
Copy link
Contributor

@pav-kv It seems like this issue is relatively well understood and with #134257 filed, I'm closing this issue. If there are additional failures that are similar we should look at a way to more easily root cause it to this same issue, but lets leave all future discussion there.

craig bot pushed a commit that referenced this issue Feb 20, 2025
140233: raft: send MsgAppResp to latest leader after handling snapshot r=pav-kv a=hakuuww

raft [does not](https://github.com/cockroachdb/cockroach/blob/55cf17041236ea300c38f50bb55628d28297642f/pkg/raft/tracker/progress.go#L388) send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.

Previously, when a slow follower (Node 3) received a snapshot from the previous leader (Node 1), it would only send a MsgAppResp back to the original sender (Node 1) (inferred from [raft message.From](https://github.com/cockroachdb/cockroach/blob/145f8b978819e1e137f5a330baa83ce78385300f/pkg/raft/raftpb/raft.proto#L77C24-L77C28) ), even if it was aware that leadership had changed to Node 2 ( [raft.lead](https://github.com/cockroachdb/cockroach/blob/2083bac24d1453392955dac8bff50cbc6f64cd9d/pkg/raft/raft.go#L365) ). This behavior resulted in a delay in informing the new leader (Node 2) of Node 3’s updated state. Which can be improved.

To address this issue, in cases where Node 3 is aware of the new leader (Node 2), the slow follower now sends MsgAppResp to both Node 1 and Node 2(if their peer id differ) upon receiving and applying a snapshot from Node 1.

If Node 3 has already acknowledged Node 2 as the new leader, then Node 2 likely had already marked Node 3 as stateSnapshot (transitioning from stateProbe after sending MsgApp and receiving MsgAppResp). 

Note: its possible that the Node 3 knows Node 2 is the new leader. But Node 3's initial response to Node 2 failed to deliver. For the test case included in this PR. We assumed Node 2 received Node 3's response to probe. If Node 3's initial response to Node 2 failed to deliver, if the leader Node 2 sees another MsgAppResp from Node 3 with a up to date commit index(resulted from Node 3 processing the snapshot from Node 1), then the leader Node 2 will just transition Node 3's state to stateReplicate. So everything is ok and desired.

With this PR change, Node 2 now can transition Node 3 back to stateReplicate upon receiving MsgAppResp for the snapshot received from Node 1. 

This is great because the optimization prevents unnecessary delays in replication progress and helps reduce potential write latency. 

A significant issue(few seconds of write unavailability/latency spike) can arise if a slow follower (Node 3) becomes the new leaseholder during a leadership change but remains unaware of its lease status. 
(this is an unlikely corner case, but have happened in perturbation/metamorphic/backfill test before, a write unavailability for around 5 seconds on the range experiencing the leaseholder and raft leader change. [roachtest link](#133155 (comment)) )

An simple example scenario to explain why there can be a write unavailability:
some pre-conditions: Node 3 is a new learner or slow follower.
- If Node 3 is made the new leaseholder but has not yet received/applied the lease entry, it cannot start serving reads or writes until it learned it became the leaseholder, after applying the lease entry. The cluster can still think Node 3 is the leaseholder, if a quorum of the cluster has applied the new lease entry. Therefore the cluster would forward write and read requests to Node 3.
- Node 3 must wait for Node 2 (the new Raft leader) to replicate new entries which includes the lease entry, before it recognizes itself as the leaseholder
- However, Node 2 will only replicate new entries when it sees Node 3 in stateReplicate, which could be delayed if Node 3 remains in stateSnapshot longer than necessary.

Snapshot transfers are relatively slow(a few seconds) due to their large size and network overhead. 
This change eliminates the need for waiting for an additional snapshot transfer(from 2 to 3) by allowing the new leader (Node 2) to transition Node 3 back to stateReplicate sooner and start sending MsgApp messages instead of waiting for the snapshot response. The optimization works if the previous leader sent a snapshot to Node3.

Since sending/processing/responding to MsgApp is much faster than sending a snapshot, Node 3 will receive and apply the lease entry sooner, allowing it to recognize its leaseholder status and begin serving reads and writes to upper layer more quickly.

In conclusion this optimization reduces potential write and read latency spike/unavailability in the above scenario.

The problem is not completely fixed, since we are still waiting for at least one snapshot to transfer. 
It would be ideal If we can avoid giving the lease to a slow follower/learner. This would be an issue of its own.

potential fix to #134257

Epic: None

Release note: None

Co-authored-by: Anthony Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-runtime-assertions-enabled branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants