kvserver: don't always refuse to take leases when draining #55624

andreimatei · 2020-10-15T23:11:58Z

Before this patch, a draining node would not take any new leases once
it's draining. This is a problem in case all replicas of a range are
draining at the same time (e.g. when draining a single-node cluster, or
when draining multiple nodes at the same time perhaps by mistake) -
nobody wants the lease. Particularly because the liveness range is
expiration-based (and thus permanently in need of new leases to be
granted), this quickly results in nodes failing their liveness.
It also becomes more of a problem with #55148, where we start refusing
to take the lease on replicas that are not the leader - so if the leader
is draining, we deadlock.

This patch makes an exception for leaders, which now no longer refuse
the lease even when they're draining. The reasonsing being that it's too
easy to deadlock otherwise.

Release note: None

cockroach-teamcity · 2020-10-15T23:12:06Z

This change is

andreimatei · 2020-10-15T23:12:08Z

Do y'all like this?

knz · 2020-10-16T10:55:04Z

Regarding the main contribution of this patch:

This patch makes an exception for leaders, which now no longer refuse
the lease even when they're draining.

I am ok with this change, as long as we ensure that the drain process in charge of moving leases+leadership away continues to be successful doing so (which I believe is the case here)

However the rest of the motivation in the commit message is suspicious:

because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness.

I don't understand what you write here, what is the "failure" you talk of?

We have a liveness-related bug which I address in #55460: the drain process is iterative but mistakenly fails to re-wake up the stores upon every lookup of the liveness record, which can cause a deadlock. I introduced this bug but it's easy enough to fix without changing the lease transfer policy.

We have code that tries to transfer the leadership away from draining nodes [...] but that code is pretty best-effort

How so is it best effort? If it currently fails to do so, that is a bug that needs to be corrected.

Draining multiple nodes at once might be questionable, but I think it's an acceptible practice.

I somewhat agree with the fact it's acceptable but I would still not recommend it.

My expectation as an operator is that any drain is going to preserve cluster availability. So my expectation is also that a drain does not complete as long as there's no quorum on the range.

So my expectation is that if my replication factor is N, then when I start draining floor(N/2)+1 nodes concurrently then some or all the drains will never complete. This is by design.

We could argue that we need a separate shutdown API to bring down an entire cluster gracefully, with the understanding that it breaks availability. This API does not exist currently and the drain API is not meant to support it (nor would I want it to)

andreimatei · 2020-10-16T21:16:17Z

because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness.

I don't understand what you write here, what is the "failure" you talk of?

What I'm trying to say is that, if all replicas of the liveness range are draining, then all nodes in the cluster become unable to heartbeat their records. This is just a special case of the general problem - which is that we can get into these situations where, although there's quorum, ranges get stuck until one of the draining node restarts.

We have a liveness-related bug which I address in #55460: the drain process is iterative but mistakenly fails to re-wake up the stores upon every lookup of the liveness record, which can cause a deadlock. I introduced this bug but it's easy enough to fix without changing the lease transfer policy.

You're referring to the discussion here. As I'm saying over there, it seems to me that what that patch is trying to do there is just papering over the problem, where the better thing to do is this patch. That patch improves the situation for the liveness range, but doesn't really do anything for other ranges because the period where nodes temporarily stop draining is very short.

We have code that tries to transfer the leadership away from draining nodes [...] but that code is pretty best-effort

How so is it best effort? If it currently fails to do so, that is a bug that needs to be corrected.

Well I meant that it's best effort because it governed by various timeouts, but actually the reality is that the draining code actually doesn't deal with leadership explicitly at all (although it certainly pretends to) - see #55619.

So my expectation is that if my replication factor is N, then when I start draining floor(N/2)+1 nodes concurrently then some or all the drains will never complete. This is by design.

Ideally, all the drains would not complete at once, but some should complete. And as the first completes, the node respective node would get restarted, and then a second drain can complete, and so on. So, ideally, I should be able to start draining all the nodes at once, and then restart them one by one as individual drains complete, and eventually all the nodes should get restarted. There's some other pieces missing for this story, but we should strive towards it. This patch helps as we've discussed above - it lets the basic functions of a cluster work even when all nodes (or more specifically, all replicas of a range) are draining at the same time.

bdarnell

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)

knz · 2020-10-19T16:47:45Z

because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness.

I don't understand what you write here, what is the "failure" you talk of?

What I'm trying to say is that, if all replicas of the liveness range are draining, then all nodes in the cluster become unable to heartbeat their records. This is just a special case of the general problem - which is that we can get into these situations where, although there's quorum, ranges get stuck until one of the draining node restarts.

My personal preference would be for draining replicas to continue heartbeating. I don't see why it's useful that they stop.

You're referring to the discussion here. As I'm saying over there, it seems to me that what that patch is trying to do there is just papering over the problem, where the better thing to do is this patch. That patch improves the situation for the liveness range, but doesn't really do anything for other ranges because the period where nodes temporarily stop draining is very short.

Yes this is what I was referring to but I don't have any opinion on this that's better than yours (And what I also replied on that thread)

Well I meant that it's best effort because it governed by various timeouts, but actually the reality is that the draining code actually doesn't deal with leadership explicitly at all (although it certainly pretends to) - see #55619.

Ok yes I understand this now. But how is this PR helping with this?

So my expectation is that if my replication factor is N, then when I start draining floor(N/2)+1 nodes concurrently then some or all the drains will never complete. This is by design.

Ideally, all the drains would not complete at once, but some should complete. And as the first completes, the node respective node would get restarted, and then a second drain can complete, and so on. So, ideally, I should be able to start draining all the nodes at once, and then restart them one by one as individual drains complete, and eventually all the nodes should get restarted. There's some other pieces missing for this story, but we should strive towards it.

That "ideally" does not work for me. We'd get the behavior you describe here if we were to do things the way you envision. But that's a tail wagging its dog! What argument do you have that this ideal behavior is also what our operators want? It looks, from a usability perspective, to be super hard to teach and understand, and it has extremely obscure failure scenarios.

This patch helps as we've discussed above - it lets the basic functions of a cluster work even when all nodes (or more specifically, all replicas of a range) are draining at the same time.

I can see how this PR moves you towards the "ideal" world you would like to see. But I do not buy into the idea that's desirable for CockroachDB and its users (yet).

I can see why we want to merge this PR (and I think I'm OK with the change, as I wrote first thing above) but I'd like a more compelling explanation in the commit message.

andreimatei

My personal preference would be for draining replicas to continue heartbeating. I don't see why it's useful that they stop.

I think there's some misunderstanding. Who said anything about the draining nodes stopping their heartbeats? Quite the contrary - the point of this PR is to let them succeed in heartbeating (by letting the liveness range get a lease).

Well I meant that it's best effort because it governed by various timeouts, but actually the reality is that the draining code actually doesn't deal with leadership explicitly at all (although it certainly pretends to) - see #55619.

Ok yes I understand this now. But how is this PR helping with this?

What is "this"? If "this" is the fact that the draining code doesn't deal with leadership for lease-less ranges, then this patch helps because #55148 doesn't deadlock any more (with this patch, it becomes OK to refuse to always refuse to take a lease if you're not the leader).
It "this" is #55460, then the conjecture is that, with this PR, you no longer need the s.node.SetDraining(false /* drain */) in #55460. If I understand correctly, the reason why you had to add that call was because, without it, the s.nodeLiveness.SetDraining(ctx, true /* drain */) because the liveness range can't get a lease any more in various scenarios. With this patch, it'll get the needed lease.

Ideally, all the drains would not complete at once, but some should complete. And as the first completes, the node respective node would get restarted, and then a second drain can complete, and so on. So, ideally, I should be able to start draining all the nodes at once, and then restart them one by one as individual drains complete, and eventually all the nodes should get restarted. There's some other pieces missing for this story, but we should strive towards it.

That "ideally" does not work for me. We'd get the behavior you describe here if we were to do things the way you envision. But that's a tail wagging its dog! What argument do you have that this ideal behavior is also what our operators want? It looks, from a usability perspective, to be super hard to teach and understand, and it has extremely obscure failure scenarios.

Perhaps I haven't made my point very well; I don't want any tails wagging any dogs. I was trying to argue that this patch is a good idea even separately from #55148. Perhaps I should have talked about only why #55148 needs this, and not talk about anything else.
Anyway, if we're in agreement, I'll think about how to change the commit message. When it comes to draining multiple nodes at once, I'm not particularly interested in telling people to start doing that. What I am interested in is CRDB having reasonable behavior is as many scenarios as possible. If an operators drains multiple nodes at once (perhaps by mistake), they'd surely prefer the cluster to work then the cluster not to work. Without this PR, the cluster most definitely does not work. With this PR, at least the KV layer works (the SQL layer might not, I'm not sure). I see this as a pretty clear win.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nvanbenschoten, and @tbg)

nvanbenschoten

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)

tbg

I'm mildly conflicted about this change. It's desirable to guard against the complete failure of KV should (say) all nodes of the liveness range start draining immediately.
What's the behavior with this patch when we drain, say, all nodes? Leases transfers will still respect draining, so essentially no more leases are transfered. Raft leadership by and large will thus stay where it is, and the drains grind to a halt. That seems like an improvement.

The cost is that during a regular drain, it's more difficult to ascertain that all leases are actually gone. In effect, we need to transfer leases away and also wait for all raft leaderships to have moved away. And even then, it's possible that a raft leadership comes back last minute, a lease pops up where one wasn't held before, and we kill the process with a lease on the node, resulting in an up to 10s outage on the affected ranges. Regular draining is the much more common operation and we have to have a way to make it bulletproof.
We have looked at leases mostly so far. What if we prevented draining nodes from campaigning (or gave them a higher threshold for doing so, say 2x the configured one)? That seems like it would do the trick. Though maybe it's not necessary. Under normal operations (one single node getting drained at all times), do we ever expect random raft elections (or inbound leadership transfers)? I can only imagine that those could arise from ranges that become active again after a period of inactivity, but those should be quiesced and thus preserve leadership.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)

andreimatei

Under normal operations (one single node getting drained at all times), do we ever expect random raft elections (or inbound leadership transfers)?

You mean incoming lease transfers coming to a draining node? I'm not sure... There's lease rebalancing based on load, manual SCATTERs, and I think there's got to be some rebalancing caused by nodes being added to the cluster. I'm not sure which ones look at draining and which don't; maybe they all do. Of course, information on who's draining comes from gossip, so it's not the most reliable.

And even then, it's possible that a raft leadership comes back last minute, a lease pops up where one wasn't held before, and we kill the process with a lease on the node, resulting in an up to 10s outage on the affected ranges.

What I think we should do about this is to split the lease draining process in two phases. Phase 1 is unbounded in duration; during it we continuously try to shed leases and leadership, and prefer to not take leases - but still take them if we have the leadership. The 2nd phase would run when the node is about to terminate - so it will have a bounded duration of a second or so. During this phase, we'd indeed refuse to take a lease (because we're quitting anyway), and do another pass at transferring leases and leadership away from replicas that have it.

What if we prevented draining nodes from campaigning (or gave them a higher threshold for doing so, say 2x the configured one)?

Right, I think this is the right line of though here. We should discourage draining nodes from campaigning for new leadership, and we should be transferring the existing leadership away.

Just to make sure this hasn't gotten lost in the chatter, lets remember how we got here. The code as it stands is incompatible with #55148 (kvserver: don't allow raft forwarding of lease requests) - that PR makes everybody but the leader refuse to take the lease. The draining code, as is, refuses to take the lease on a draining node. This means that if the leader is draining, nobody takes the lease. So, we need to do something.
Doing what this patch does I think is very attractive because, in conjunction with #55148, means that the lease acquisition code will not need to look at either the draining status or the leadership status above Raft: my plan is to merge this, and then change the requestLeaseLocked() code to simply remove the

	if r.store.IsDraining() && (r.raftBasicStatusRLocked().RaftState != raft.StateLeader) {

block because it won't be necessary any more - either the replica is the leader, in which case we want the lease regardless of draining status, or isn't the leader in which case we don't want it regardless of draining status. I'm gonna fix the failing tests here and merge this, and then we can resume talking about how to improve leadership management while draining.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @knz and @tbg)

tbg

is a test failing because of your change? That's good, because I was about to ask you to write a test.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @knz and @tbg)

tbg

I broadly agree with your suggested future steps, though I hope we can make do without an explicit second stage. Incoming leases should really not be expected any more by waiting out the draining liveness record (or at least I think making all subsystems available of the draining status is the first choice of strategy rather than jumping two an extra stage). So we get something like this:

mark as draining
wait until cluster must be aware of draining (implies no new incoming leases, no incoming raft leadership transfers)
lower raft campaigning aggressiveness (tbd what exactly makes sense here, if anything) (implies no more randomly acquired raft leaderships any more, except where they're truly necessary)
for { transfer all leases away; if noMoreLeases && noMoreRaftLeaders { return } else { sleep(time.Second) }
final wait to improve distsender cache updates in rest of cluster

This is pretty close to what the draining code looks like at and before #55460.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @knz and @tbg)

andreimatei

I've been working on a test here but it's taking me a while. Coming soon.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @knz and @tbg)

andreimatei

Added a test, PTAL

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @bdarnell, @knz, @nvanbenschoten, and @tbg)

knz

I like the test, although I think you can simplify it since we already have a testing thing to pause heartbeats.

I think you can also update the PR description at top using your commit messages, which are nicely explanatory.

Reviewed 4 of 4 files at r2, 3 of 3 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 2 stale) (waiting on @andreimatei and @tbg)

pkg/kv/kvserver/replica_test.go, line 1452 at r4 (raw file):

	// rejectLivenessHeartbeats is a gRPC filter which returns errors on all
	// attempts to heartbeat n1's liveness when suspendN1Heartbeats is set.

There is already a feature to do this: (*NodeLiveness) PauseHeartbeatLoopForTest().

tbg · 2020-10-23T18:50:36Z

I won't be able to look until Tuesday unfortunately. Can someone else sign off on the test? The other stuff was LGTM.

nvanbenschoten

the only concerns I have past what knz said are aesthetic.

Reviewed 4 of 4 files at r2, 3 of 3 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @knz, and @tbg)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 263 at r2 (raw file):

	clusterID *base.ClusterIDContainer

	// disableFirstRangeUpdates disables updates of the first range via

Aren't all of these fields testing knobs as well? We should either pull everything into the testingKnobs struct or not introduce it.

pkg/kv/kvclient/kvcoord/testing_knobs.go, line 20 at r2 (raw file):

)

// RequestFilterType is type of functions used as RequestFilter.

"the type"?

pkg/kv/kvclient/kvcoord/testing_knobs.go, line 29 at r2 (raw file):

	// any request. If either of the retvals is non-nil, then that's used as the
	// RPC's result and the call is not performed any more.
	RequestFilter RequestFilterType

nit: keep the formatting consistent here, otherwise the code quality deteriorates. In this case, we have a new line between each field.

pkg/kv/kvserver/node_liveness.go, line 262 at r3 (raw file):

}

// NodeLivenessTestingKnobs allows tests to override some node liveness

Shouldn't this live in pkg/kv/kverserver/testing_knobs.go alongside StoreTestingKnobs?

pkg/kv/kvserver/replica_range_lease.go, line 844 at r4 (raw file):

}

func (r *Replica) currentLeaseStatus(ctx context.Context) kvserverpb.LeaseStatus {

It seems like this should live right below leaseStatus and contain a comment referencing the relationship between this method and that one.

pkg/kv/kvserver/replica_test.go, line 1433 at r4 (raw file):

// Test that draining nodes only take the lease if they're the leader.
func TestReplicaDrainLease(t *testing.T) {

You stressed this test for a while, right?

pkg/kv/kvserver/replica_test.go, line 1452 at r4 (raw file):

Previously, knz (kena) wrote…

There is already a feature to do this: (*NodeLiveness) PauseHeartbeatLoopForTest().

+1

lunevalex · 2020-10-26T16:30:53Z

pkg/kv/kvserver/replica_test.go, line 1493 at r4 (raw file):

	defer tc.Stopper().Stop(ctx)
	rngKey := tc.ScratchRange(t)
	_, err := tc.AddReplicas(rngKey, tc.Target(1))

nit: you can use tc.AddReplicasOrFatal()

lunevalex

Reviewable status: complete! 2 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @knz, and @tbg)

pkg/kv/kvserver/replica_test.go, line 1498 at r4 (raw file):

	s1 := tc.Server(0)
	s2 := tc.Server(1)
	store1, err := s1.GetStores().(*Stores).GetStore(s1.GetFirstStoreID())

nit: there is a convenience method tc.GetFirstStoreFromServer(t, 0)

pkg/kv/kvserver/replica_test.go, line 1503 at r4 (raw file):

	require.NoError(t, err)

	rd, err := tc.LookupRange(rngKey)

nit: there is a LookupRangeOrFatal as well

Release note: None

This test either has completely rotted, or has always been confused. It tries to verify that different mechanisms trigger raft reproposals, but it doesn't seem to actually depend on reproposals. The test says that an increment depend on the reproposal of a previous lease request, but there's no such lease request. Also, technically reproposals of lease requests stopped being a thing a while ago. It also talks about Raft leadership changing, but that's not explicit in the test. The test fails with the next commit that changes how draining replicas handle lease requests. It's unclear to me what's salvageable from the test. Release note: None

Before this patch, a draining node would not take any new leases once it's draining. This is a problem in case all replicas of a range are draining at the same time (e.g. when draining a single-node cluster, or when draining multiple nodes at the same time perhaps by mistake) - nobody wants the lease. Particularly because the liveness range is expiration-based (and thus permanently in need of new leases to be granted), this quickly results in nodes failing their liveness. It also becomes more of a problem with cockroachdb#55148, where we start refusing to take the lease on replicas that are not the leader - so if the leader is draining, we deadlock. This patch makes an exception for leaders, which now no longer refuse the lease even when they're draining. The reasonsing being that it's too easy to deadlock otherwise. Release note: None

andreimatei

TFTR
I've prepended a commit deleting TestRefreshPendingCommands - the test we were discussing on email. I don't think there's anything to salvage there.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 3 stale) (waiting on @knz, @lunevalex, @nvanbenschoten, and @tbg)

pkg/kv/kvclient/kvcoord/dist_sender.go, line 263 at r2 (raw file):