Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: log if lease applies with a delay #96257

Merged
merged 1 commit into from
Feb 8, 2023

Conversation

tbg
Copy link
Member

@tbg tbg commented Jan 31, 2023

When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.

This warning makes it fairly obvious when it happens.

W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165 lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0]

Addresses #95991.

Epic: none
Release note: None

@blathers-crl
Copy link

blathers-crl bot commented Jan 31, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from nvanbenschoten January 31, 2023 08:40
@tbg tbg marked this pull request as ready for review January 31, 2023 08:40
@tbg tbg requested a review from a team January 31, 2023 08:40
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we transfer expiration-based leases, can we also warn if the previous unclaimed lease expired because the new leaseholder failed to apply it? Otherwise, this log message won't fire if the delay is greater than 6 seconds.

@tbg tbg force-pushed the slow-lease-warning branch from a96eb2d to 2705d7b Compare February 6, 2023 08:25
@tbg
Copy link
Member Author

tbg commented Feb 6, 2023

That's a good point, doesn't the current code handle this though? If, say, a recipient of a lease transfer applies the lease via the log only after, say, 120 minutes, it will still see the lease and log the warning. I don't think there's any condition here that checks that the lease is still valid at apply time.

The case where we do miss the logging is if the recipient eventually gets caught up via a snapshot and the snapshot has another leaseholder already. In that case we can't rely on the recipient of the lease logging anything since it never even learns about the lease transfer.

@tbg tbg requested a review from erikgrinaker February 6, 2023 08:26
@erikgrinaker
Copy link
Contributor

That's a good point, doesn't the current code handle this though? If, say, a recipient of a lease transfer applies the lease via the log only after, say, 120 minutes, it will still see the lease and log the warning. I don't think there's any condition here that checks that the lease is still valid at apply time.

That's true, but that could be so late that we no longer care. Let's say we transfer a lease from A to C, C fails to apply the lease and it expires, then after 10 seconds B acquires a new lease (without logging anything), and only once C recovers 2 hours later do we log a warning. Meanwhile, the range was unavailable for 10 seconds and people are upset.

Can we easily add a condition where we log a lapsed expiration lease that was never applied (i.e. converted to an epoch lease)?

The case where we do miss the logging is if the recipient eventually gets caught up via a snapshot and the snapshot has another leaseholder already. In that case we can't rely on the recipient of the lease logging anything since it never even learns about the lease transfer.

Right, I think we can ignore this case for our purposes here.

@tbg tbg force-pushed the slow-lease-warning branch from 271d7e0 to 2c029f0 Compare February 6, 2023 12:04
@tbg
Copy link
Member Author

tbg commented Feb 6, 2023

Can we easily add a condition where we log a lapsed expiration lease that was never applied (i.e. converted to an epoch lease)?

Good idea, I think we can, see new commit.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should do the trick, thanks!

When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.

This warning makes it fairly obvious when it happens.

> W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165  lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0]

Also log if we're acquiring an epoch-based lease following a
non-cooperatively expired expiration-based lease, which would suggest
that a lease transfer went to a node that couldn't service the lease.
This would likely have caused an outage, and the log message will
provide a way to pinpoint its end timestamp

Addresses cockroachdb#95991.

Epic: none
Release note: None
@tbg
Copy link
Member Author

tbg commented Feb 8, 2023

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Feb 8, 2023

Build succeeded:

@tbg
Copy link
Member Author

tbg commented Feb 9, 2023

blathers backport 22.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants