storage: Repeated node liveness failures #16565

bdarnell · 2017-06-16T16:00:19Z

A user is reporting problems related to range leases: The number of ranges with valid leases grows gradually over time, then crashes to 1/3 or less of its peak value. When this happens a backlog of requests starts to build up on other nodes and eventually runs them out of memory (and the cycle repeats). The logs show repeated errors from the node liveness subsystem (logged by replica_range_lease.go, with messages "heartbeat failed on epoch increment" and "mismatch incrementing epoch". These messages are milliseconds apart and the "mismatch" errors show exactly the same "actual" value.

I think what's going on is that we can queue up a large number of IncrementEpoch calls in the range lease code (proportional to the number of ranges, when the rest of the node liveness system expects activity proportional to the number of nodes), and because of the two-level locking (sem and mu, many of them can race to issue their ConditionalPuts while holding sem.

#16564 is an attempt at a quick fix by using separate semaphores so that incrementing another node's epoch doesn't impair our ability to heartbeat our own liveness. A slightly less quick fix would be to improve synchronization so that multiple ranges don't try to act on the same node's liveness.

The text was updated successfully, but these errors were encountered:

a6802739 · 2017-06-20T14:39:52Z

@bdarnell, It's our production's issue, right? Could I have a see on it? And If I take a look at it, could you give me some reference for this? Thank you very much.

bdarnell · 2017-06-20T15:42:14Z

I'm going to make some PRs today with the important changes from #16564.

When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565

bdarnell · 2017-06-20T19:24:12Z

The semaphore issue that I mentioned above was not the real issue (it's not good, but it only had a minor impact in this case). The real problem was that because this cluster was running with an increased TTL in its default zone config, it was accumulating so many old MVCC versions on its liveness records that the MVCC GC operation (which blocks heartbeats) would last long enough that leases would expire (which would in turn prevent the GC operation itself from applying, so we'd never make progress).

#16637 fixes this by limiting how much work we do in a single GC pass. You may also need to change the zone configs so that the system ranges don't have an increased TTL (by setting a config with a TTL of 24 hours for the zone .system). We want to do that by default in #14990.

a6802739 · 2017-06-21T01:55:06Z

@bdarnell If the increased TTL setting will lead to the huge GC on liveness records, why we didn't find the same problem on user-space records?

bdarnell · 2017-06-21T02:10:40Z

The liveness records are updated every 4 seconds (once for each node), so you'll have a lot of versions of them no matter what else you do with your cluster. You'll only have similar GC problems with other tables if you're also updating other records that often (not inserting new records, but updating or deleting existing ones).

Even if a GC takes a long time on user-space records, it's not good, but it's much less severe. It will block everything else on that range for a few seconds, but then it will finish and go back to normal (and it will be a long time before the next big GC). The problem with the node liveness records is that while they're being garbage collected, nodes can't send heartbeats to update their leases. If the leases expire while the GC is still going on, another node can steal the lease and the GC will fail to apply, so it will have to be retried soon, starting the process again.

a6802739 · 2017-06-21T03:30:50Z

@bdarnell, what is the connection between node liveness records and their leases? And How other nodes will notify the leases expire? And also does the TransferLeaseRequest will be blocked by the GC?
Thank you for your explanation.

When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565

bdarnell · 2017-06-21T22:57:19Z

Node liveness records and their relationship to leases are described in this RFC. Node liveness records are used to control leases on other ranges, but the range where the liveness records themselves live is an "expiration-based" lease (which uses an older system). If the node that holds the lease can't refresh that lease, another node can "steal" the lease by proposing a RequestLease command (there are no TransferLease requests in this case; TransferLease is only used when the rebalancing system decides to move a lease).

When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565

There's no reason to block our own liveness updates when incrementing another node's epoch; doing so could cause cascading failures when the liveness span gets slow. This was originally suspected as the cause of cockroachdb#16565 (and was proposed in cockroachdb#16564). That issue turned out to have other causes, but this change seems like a good idea anyway.

bdarnell added the high priority label Jun 16, 2017

bdarnell mentioned this issue Jun 20, 2017

storage/engine: Limit batch size in MVCC GC #16637

Merged

bdarnell closed this as completed in #16637 Jun 21, 2017

bdarnell mentioned this issue Jun 27, 2017

cherrypick: storage/engine: Limit batch size in MVCC GC #16735

Merged

bdarnell mentioned this issue Jul 6, 2017

storage: Use separate locks when updating our own liveness #16918

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Repeated node liveness failures #16565

storage: Repeated node liveness failures #16565

bdarnell commented Jun 16, 2017

a6802739 commented Jun 20, 2017 •

edited

Loading

bdarnell commented Jun 20, 2017

bdarnell commented Jun 20, 2017

a6802739 commented Jun 21, 2017

bdarnell commented Jun 21, 2017

a6802739 commented Jun 21, 2017

bdarnell commented Jun 21, 2017

storage: Repeated node liveness failures #16565

storage: Repeated node liveness failures #16565

Comments

bdarnell commented Jun 16, 2017

a6802739 commented Jun 20, 2017 • edited Loading

bdarnell commented Jun 20, 2017

bdarnell commented Jun 20, 2017

a6802739 commented Jun 21, 2017

bdarnell commented Jun 21, 2017

a6802739 commented Jun 21, 2017

bdarnell commented Jun 21, 2017

a6802739 commented Jun 20, 2017 •

edited

Loading