-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: log if lease applies with a delay #96257
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we transfer expiration-based leases, can we also warn if the previous unclaimed lease expired because the new leaseholder failed to apply it? Otherwise, this log message won't fire if the delay is greater than 6 seconds.
a96eb2d
to
2705d7b
Compare
That's a good point, doesn't the current code handle this though? If, say, a recipient of a lease transfer applies the lease via the log only after, say, 120 minutes, it will still see the lease and log the warning. I don't think there's any condition here that checks that the lease is still valid at apply time. The case where we do miss the logging is if the recipient eventually gets caught up via a snapshot and the snapshot has another leaseholder already. In that case we can't rely on the recipient of the lease logging anything since it never even learns about the lease transfer. |
That's true, but that could be so late that we no longer care. Let's say we transfer a lease from A to C, C fails to apply the lease and it expires, then after 10 seconds B acquires a new lease (without logging anything), and only once C recovers 2 hours later do we log a warning. Meanwhile, the range was unavailable for 10 seconds and people are upset. Can we easily add a condition where we log a lapsed expiration lease that was never applied (i.e. converted to an epoch lease)?
Right, I think we can ignore this case for our purposes here. |
271d7e0
to
2c029f0
Compare
Good idea, I think we can, see new commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should do the trick, thanks!
When we transfer a lease to a lagging follower, there's often a latency blip that we get asked to investigate. This is time consuming; it's often very subtle to even figure out that it happened. We try to be better about not doing it, but at least on 22.1 we know it's possible, and we can't backport the rather involved fixes. This warning makes it fairly obvious when it happens. > W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165 lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0] Also log if we're acquiring an epoch-based lease following a non-cooperatively expired expiration-based lease, which would suggest that a lease transfer went to a node that couldn't service the lease. This would likely have caused an outage, and the log message will provide a way to pinpoint its end timestamp Addresses cockroachdb#95991. Epic: none Release note: None
2c029f0
to
548a4be
Compare
bors r=erikgrinaker |
Build succeeded: |
blathers backport 22.1 |
When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.
This warning makes it fairly obvious when it happens.
Addresses #95991.
Epic: none
Release note: None