kv: Start nodes in a new status to prevent lease transfers #96980

andrewbaptist · 2023-02-10T20:38:03Z

Previously when a store started it immediately became a target for lease and replica transfers. This could cause problems if the store was still recovering from being down because it was behind on Raft updates. This patch adds a delay until publishing a membership status of ACTIVE until it has gone through its Raft backlog and cleaned up its LSM.

Epic: none
Release note (ops change): A node now transitions through an additional STARTING state on startup.

blathers-crl · 2023-02-10T20:38:07Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-02-10T20:38:12Z

This change is

Previously when a store started it immediately became a target for lease and replica transfers. This could cause problems if the store was still recovering from being down because it was behind on Raft updates. This patch adds a delay until publishing a membership status of ACTIVE until it has gone through its Raft backlog and cleaned up its LSM. Epic: none Release note (ops change): A node now transitions through an additional STARTING state on startup.

irfansharif

Using a new enum for liveness membership feels a tad heavy-handed and stateful for what we want to achieve. It's persisted state that we'd be using to effectively inform remote nodes to not transfer leases to a freshly started node. But on remote stores we already observe this freshly restarted node's liveness epoch getting bumped, which happens as part of the liveness heartbeat loop:

cockroach/pkg/kv/kvserver/liveness/liveness.go

Lines 750 to 779 in a28aa6c

    
           incrementEpoch := true 
        
           heartbeatInterval := nl.livenessThreshold - nl.renewalDuration 
        
           ticker := time.NewTicker(heartbeatInterval) 
        
           defer ticker.Stop() 
        
           for { 
        
           	select { 
        
           	case <-nl.heartbeatToken: 
        
           	case <-nl.stopper.ShouldQuiesce(): 
        
           		return 
        
           	} 
        
           	// Give the context a timeout approximately as long as the time we 
        
           	// have left before our liveness entry expires. 
        
           	if err := contextutil.RunWithTimeout(ctx, "node liveness heartbeat", nl.renewalDuration, 
        
           		func(ctx context.Context) error { 
        
           			// Retry heartbeat in the event the conditional put fails. 
        
           			for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); { 
        
           				oldLiveness, ok := nl.Self() 
        
           				if !ok { 
        
           					nodeID := nl.gossip.NodeID.Get() 
        
           					liveness, err := nl.getLivenessFromKV(ctx, nodeID) 
        
           					if err != nil { 
        
           						log.Infof(ctx, "unable to get liveness record from KV: %s", err) 
        
           						if grpcutil.IsConnectionRejected(err) { 
        
           							return err 
        
           						} 
        
           						continue 
        
           					} 
        
           					oldLiveness = liveness 
        
           				} 
        
           				if err := nl.heartbeatInternal(ctx, oldLiveness, incrementEpoch); err != nil {

And as of aece96b, we're also gossiping each node's IO overload score periodically. It seems to me then we have all the information we need on remote stores. Am I missing something? Also note that by only using node-level state to prevent lease transfers, we're making multi-store setups behave a tad awkwardly; a single IO overloaded store would prevent lease transfers to replicas on other stores that could've very well have taken it.

Reviewable status: complete! 0 of 0 LGTMs obtained

andrewbaptist · 2023-02-16T14:58:08Z

Its not quite right that we have all the information we need, but I am going to put this PR on hold waiting the results of the allocator changes #97142. The underlying problem is that there is a time interval (10 sec - 5 min) before we get to IO overload and can start using this value effectively. During this window it is STILL bad to transfer leases to this store. So I think the better thing is to wait for only the Raft catchup, and not the IO overload signal once I bring this back. I also need to wait for #97044 to complete which will allow accurately knowing the state of Raft.

Finally I agree that this is a little heavy handed since this is a Node level metric vs a Store level metric. The Raft catchup is also store level, so I can change to using that once it is in place.

andrewbaptist force-pushed the 230210.startup-slowly branch from 4f631e9 to d0c4b85 Compare February 10, 2023 21:31

andrewbaptist force-pushed the 230210.startup-slowly branch from d0c4b85 to 08b6d0b Compare February 13, 2023 20:03

irfansharif reviewed Feb 16, 2023

View reviewed changes

andrewbaptist mentioned this pull request Feb 16, 2023

kvserver: Always treat restarted nodes as suspect #97263

Closed

andrewbaptist closed this Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: Start nodes in a new status to prevent lease transfers #96980

kv: Start nodes in a new status to prevent lease transfers #96980

andrewbaptist commented Feb 10, 2023

blathers-crl bot commented Feb 10, 2023

cockroach-teamcity commented Feb 10, 2023

irfansharif left a comment •

edited

Loading

andrewbaptist commented Feb 16, 2023

	incrementEpoch := true
	heartbeatInterval := nl.livenessThreshold - nl.renewalDuration
	ticker := time.NewTicker(heartbeatInterval)
	defer ticker.Stop()
	for {
	select {
	case <-nl.heartbeatToken:
	case <-nl.stopper.ShouldQuiesce():
	return
	}
	// Give the context a timeout approximately as long as the time we
	// have left before our liveness entry expires.
	if err := contextutil.RunWithTimeout(ctx, "node liveness heartbeat", nl.renewalDuration,
	func(ctx context.Context) error {
	// Retry heartbeat in the event the conditional put fails.
	for r := retry.StartWithCtx(ctx, retryOpts); r.Next(); {
	oldLiveness, ok := nl.Self()
	if !ok {
	nodeID := nl.gossip.NodeID.Get()
	liveness, err := nl.getLivenessFromKV(ctx, nodeID)
	if err != nil {
	log.Infof(ctx, "unable to get liveness record from KV: %s", err)
	if grpcutil.IsConnectionRejected(err) {
	return err
	}
	continue
	}
	oldLiveness = liveness
	}
	if err := nl.heartbeatInternal(ctx, oldLiveness, incrementEpoch); err != nil {

kv: Start nodes in a new status to prevent lease transfers #96980

kv: Start nodes in a new status to prevent lease transfers #96980

Conversation

andrewbaptist commented Feb 10, 2023

blathers-crl bot commented Feb 10, 2023

cockroach-teamcity commented Feb 10, 2023

irfansharif left a comment • edited Loading

Choose a reason for hiding this comment

andrewbaptist commented Feb 16, 2023

irfansharif left a comment •

edited

Loading