-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Repeated node liveness failures #16565
Comments
@bdarnell, It's our production's issue, right? Could I have a see on it? And If I take a look at it, could you give me some reference for this? Thank you very much. |
I'm going to make some PRs today with the important changes from #16564. |
When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565
The semaphore issue that I mentioned above was not the real issue (it's not good, but it only had a minor impact in this case). The real problem was that because this cluster was running with an increased TTL in its default zone config, it was accumulating so many old MVCC versions on its liveness records that the MVCC GC operation (which blocks heartbeats) would last long enough that leases would expire (which would in turn prevent the GC operation itself from applying, so we'd never make progress). #16637 fixes this by limiting how much work we do in a single GC pass. You may also need to change the zone configs so that the system ranges don't have an increased TTL (by setting a config with a TTL of 24 hours for the zone |
@bdarnell If the increased TTL setting will lead to the huge GC on |
The liveness records are updated every 4 seconds (once for each node), so you'll have a lot of versions of them no matter what else you do with your cluster. You'll only have similar GC problems with other tables if you're also updating other records that often (not inserting new records, but updating or deleting existing ones). Even if a GC takes a long time on user-space records, it's not good, but it's much less severe. It will block everything else on that range for a few seconds, but then it will finish and go back to normal (and it will be a long time before the next big GC). The problem with the node liveness records is that while they're being garbage collected, nodes can't send heartbeats to update their leases. If the leases expire while the GC is still going on, another node can steal the lease and the GC will fail to apply, so it will have to be retried soon, starting the process again. |
@bdarnell, what is the connection between |
When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565
Node liveness records and their relationship to leases are described in this RFC. Node liveness records are used to control leases on other ranges, but the range where the liveness records themselves live is an "expiration-based" lease (which uses an older system). If the node that holds the lease can't refresh that lease, another node can "steal" the lease by proposing a |
When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565
When there is a lot of garbage to collect, a single run of MVCCGarbageCollect on the node liveness range can last long enough that leases expire, leading to cascading unavailability. Fixes cockroachdb#16565
There's no reason to block our own liveness updates when incrementing another node's epoch; doing so could cause cascading failures when the liveness span gets slow. This was originally suspected as the cause of cockroachdb#16565 (and was proposed in cockroachdb#16564). That issue turned out to have other causes, but this change seems like a good idea anyway.
There's no reason to block our own liveness updates when incrementing another node's epoch; doing so could cause cascading failures when the liveness span gets slow. This was originally suspected as the cause of cockroachdb#16565 (and was proposed in cockroachdb#16564). That issue turned out to have other causes, but this change seems like a good idea anyway.
There's no reason to block our own liveness updates when incrementing another node's epoch; doing so could cause cascading failures when the liveness span gets slow. This was originally suspected as the cause of cockroachdb#16565 (and was proposed in cockroachdb#16564). That issue turned out to have other causes, but this change seems like a good idea anyway.
A user is reporting problems related to range leases: The number of ranges with valid leases grows gradually over time, then crashes to 1/3 or less of its peak value. When this happens a backlog of requests starts to build up on other nodes and eventually runs them out of memory (and the cycle repeats). The logs show repeated errors from the node liveness subsystem (logged by
replica_range_lease.go
, with messages "heartbeat failed on epoch increment" and "mismatch incrementing epoch". These messages are milliseconds apart and the "mismatch" errors show exactly the same "actual" value.I think what's going on is that we can queue up a large number of
IncrementEpoch
calls in the range lease code (proportional to the number of ranges, when the rest of the node liveness system expects activity proportional to the number of nodes), and because of the two-level locking (sem
andmu
, many of them can race to issue their ConditionalPuts while holdingsem
.#16564 is an attempt at a quick fix by using separate semaphores so that incrementing another node's epoch doesn't impair our ability to heartbeat our own liveness. A slightly less quick fix would be to improve synchronization so that multiple ranges don't try to act on the same node's liveness.
The text was updated successfully, but these errors were encountered: