Skip to content

Commit

Permalink
kvserver: log if lease applies with a delay
Browse files Browse the repository at this point in the history
When we transfer a lease to a lagging follower, there's often a latency
blip that we get asked to investigate. This is time consuming; it's
often very subtle to even figure out that it happened. We try to be
better about not doing it, but at least on 22.1 we know it's possible,
and we can't backport the rather involved fixes.

This warning makes it fairly obvious when it happens.

> W230131 [...] [T1,n2,s2,r23/3:‹/Table/2{1-2}›,raft] 165  lease repl=(n2,s2):3 seq=5 start=1675153630.108829000,0 epo=3 pro=1675153630.108829000,0 active after replication lag of ~0.58s; foreground traffic may have been impacted [prev=repl=(n3,s3):2 seq=4 start=1675153407.528408000,0 epo=2 pro=1675153419.837642000,0]

Addresses #95991.

Epic: none
Release note: None
  • Loading branch information
tbg committed Feb 6, 2023
1 parent 8e24570 commit 2705d7b
Showing 1 changed file with 24 additions and 1 deletion.
25 changes: 24 additions & 1 deletion pkg/kv/kvserver/replica_proposal.go
Original file line number Diff line number Diff line change
Expand Up @@ -351,10 +351,33 @@ func (r *Replica) leasePostApplyLocked(
requiresExpirationBasedLease := r.requiresExpiringLeaseRLocked()
hasExpirationBasedLease := newLease.Type() == roachpb.LeaseExpiration

now := r.store.Clock().NowAsClockTimestamp()

// If we hold the lease now and the lease was proposed "earlier", there
// must have been replication lag, and possibly reads and/or writes were
// delayed.
//
// We see this most commonly with lease transfers targeting a behind replica,
// or, in the worst case, a snapshot. We are constantly improving our
// heuristics for avoiding that[^1] but if it does happen it's good to know
// from the logs.
//
// In the case of a lease transfer, the two timestamps compared below are from
// different clocks, so there could be skew. We just pretend this is not the
// case, which is good enough here.
//
// [^1]: https://github.com/cockroachdb/cockroach/pull/82758
const slowLeaseWarnThreshold = 500 * time.Millisecond
if delta := time.Duration(now.WallTime - newLease.ProposedTS.WallTime); iAmTheLeaseHolder && leaseChangingHands && delta > slowLeaseWarnThreshold {
log.Warningf(ctx,
"lease %v active after replication lag of ~%.2fs; foreground traffic may have been impacted [prev=%v]",
newLease, delta.Seconds(), prevLease,
)
}

// Gossip the first range whenever its lease is acquired. We check to make
// sure the lease is active so that a trailing replica won't process an old
// lease request and attempt to gossip the first range.
now := r.store.Clock().NowAsClockTimestamp()
if leaseChangingHands && iAmTheLeaseHolder && r.IsFirstRange() && r.ownsValidLeaseRLocked(ctx, now) {
r.gossipFirstRangeLocked(ctx)
}
Expand Down

0 comments on commit 2705d7b

Please sign in to comment.