Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Repeated node liveness failures #16565

Closed
bdarnell opened this issue Jun 16, 2017 · 7 comments
Closed

storage: Repeated node liveness failures #16565

bdarnell opened this issue Jun 16, 2017 · 7 comments

Comments

@bdarnell
Copy link
Contributor

A user is reporting problems related to range leases: The number of ranges with valid leases grows gradually over time, then crashes to 1/3 or less of its peak value. When this happens a backlog of requests starts to build up on other nodes and eventually runs them out of memory (and the cycle repeats). The logs show repeated errors from the node liveness subsystem (logged by replica_range_lease.go, with messages "heartbeat failed on epoch increment" and "mismatch incrementing epoch". These messages are milliseconds apart and the "mismatch" errors show exactly the same "actual" value.

I think what's going on is that we can queue up a large number of IncrementEpoch calls in the range lease code (proportional to the number of ranges, when the rest of the node liveness system expects activity proportional to the number of nodes), and because of the two-level locking (sem and mu, many of them can race to issue their ConditionalPuts while holding sem.

#16564 is an attempt at a quick fix by using separate semaphores so that incrementing another node's epoch doesn't impair our ability to heartbeat our own liveness. A slightly less quick fix would be to improve synchronization so that multiple ranges don't try to act on the same node's liveness.

@a6802739
Copy link
Contributor

a6802739 commented Jun 20, 2017

@bdarnell, It's our production's issue, right? Could I have a see on it? And If I take a look at it, could you give me some reference for this? Thank you very much.

@bdarnell
Copy link
Contributor Author

I'm going to make some PRs today with the important changes from #16564.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 20, 2017
When there is a lot of garbage to collect, a single run of
MVCCGarbageCollect on the node liveness range can last long enough
that leases expire, leading to cascading unavailability.

Fixes cockroachdb#16565
@bdarnell
Copy link
Contributor Author

The semaphore issue that I mentioned above was not the real issue (it's not good, but it only had a minor impact in this case). The real problem was that because this cluster was running with an increased TTL in its default zone config, it was accumulating so many old MVCC versions on its liveness records that the MVCC GC operation (which blocks heartbeats) would last long enough that leases would expire (which would in turn prevent the GC operation itself from applying, so we'd never make progress).

#16637 fixes this by limiting how much work we do in a single GC pass. You may also need to change the zone configs so that the system ranges don't have an increased TTL (by setting a config with a TTL of 24 hours for the zone .system). We want to do that by default in #14990.

@a6802739
Copy link
Contributor

@bdarnell If the increased TTL setting will lead to the huge GC on liveness records, why we didn't find the same problem on user-space records?

@bdarnell
Copy link
Contributor Author

The liveness records are updated every 4 seconds (once for each node), so you'll have a lot of versions of them no matter what else you do with your cluster. You'll only have similar GC problems with other tables if you're also updating other records that often (not inserting new records, but updating or deleting existing ones).

Even if a GC takes a long time on user-space records, it's not good, but it's much less severe. It will block everything else on that range for a few seconds, but then it will finish and go back to normal (and it will be a long time before the next big GC). The problem with the node liveness records is that while they're being garbage collected, nodes can't send heartbeats to update their leases. If the leases expire while the GC is still going on, another node can steal the lease and the GC will fail to apply, so it will have to be retried soon, starting the process again.

@a6802739
Copy link
Contributor

@bdarnell, what is the connection between node liveness records and their leases? And How other nodes will notify the leases expire? And also does the TransferLeaseRequest will be blocked by the GC?
Thank you for your explanation.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 21, 2017
When there is a lot of garbage to collect, a single run of
MVCCGarbageCollect on the node liveness range can last long enough
that leases expire, leading to cascading unavailability.

Fixes cockroachdb#16565
@bdarnell
Copy link
Contributor Author

Node liveness records and their relationship to leases are described in this RFC. Node liveness records are used to control leases on other ranges, but the range where the liveness records themselves live is an "expiration-based" lease (which uses an older system). If the node that holds the lease can't refresh that lease, another node can "steal" the lease by proposing a RequestLease command (there are no TransferLease requests in this case; TransferLease is only used when the rebalancing system decides to move a lease).

bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 27, 2017
When there is a lot of garbage to collect, a single run of
MVCCGarbageCollect on the node liveness range can last long enough
that leases expire, leading to cascading unavailability.

Fixes cockroachdb#16565
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jun 27, 2017
When there is a lot of garbage to collect, a single run of
MVCCGarbageCollect on the node liveness range can last long enough
that leases expire, leading to cascading unavailability.

Fixes cockroachdb#16565
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 6, 2017
There's no reason to block our own liveness updates when incrementing
another node's epoch; doing so could cause cascading failures when
the liveness span gets slow.

This was originally suspected as the cause of cockroachdb#16565 (and was proposed
in cockroachdb#16564). That issue turned out to have other causes, but this
change seems like a good idea anyway.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 6, 2017
There's no reason to block our own liveness updates when incrementing
another node's epoch; doing so could cause cascading failures when
the liveness span gets slow.

This was originally suspected as the cause of cockroachdb#16565 (and was proposed
in cockroachdb#16564). That issue turned out to have other causes, but this
change seems like a good idea anyway.
bdarnell added a commit to bdarnell/cockroach that referenced this issue Jul 10, 2017
There's no reason to block our own liveness updates when incrementing
another node's epoch; doing so could cause cascading failures when
the liveness span gets slow.

This was originally suspected as the cause of cockroachdb#16565 (and was proposed
in cockroachdb#16564). That issue turned out to have other causes, but this
change seems like a good idea anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants