Control pod routing in service objects with label #181

spilchen · 2022-03-18T21:03:59Z

This adds a new label to pods to control when they are visible for Service objects. This is separate from the ready state. A label gives us the ability to control routing for scaling events. For instance, it will allow us to drain out nodes ahead of them being scaled down. Or, we can add them after we do an add node and have verified that Vertica has finished rebalancing the nodes.

The label isn't part of the statefulset spec. Rather we add (or remove) it using a new reconcile actor.

This should close timing scenarios during scaling seen occasionally in the e2e tests (e.g. step 40 in k-safety-0-scaling test).

The online upgrade reconciler got a bit of an overhaul. When adding the pending delete state in pod facts, it broke handling of transient subclusters. So I am now adding transient subclusters to the VerticaDB during online upgrade. I think this is a cleaner solution anyway. It allowed me to cleanup transient handling in sc_finder.go for instance.

roypaulin

looks good.

This adds a new label to pods to control when they are visible for Service objects. This is separate from the ready state. A label gives us the ability to control routing for scaling events. For instance, it will allow us to drain out nodes ahead of them being scaled down. Or, we can add them after we do an add node and have verified that Vertica has finished rebalancing the nodes. The label isn't part of the statefulset spec. Rather we add (or remove) it using a new reconcile actor. This should close timing scenarios during scaling seen occasionally in the e2e tests (e.g. step 40 in k-safety-0-scaling test). The online upgrade reconciler got a bit of an overhaul. When adding the pending delete state in pod facts, it broke handling of transient subclusters. So I am now adding transient subclusters to the VerticaDB during online upgrade. I think this is a cleaner solution anyway. It allowed me to cleanup transient handling in sc_finder.go for instance.

This will add drain logic when we do a scale down so that we don't kill nodes that have active connections. This builds on the work done in #181. Namely the pending delete state, to know if a pod is going to be scaled down. And the client routing change so that the Service object does not route new connections to the pod we are draining.

Matt Spilchen added 9 commits March 16, 2022 21:14

use client-access label

f33d631

More cases where we need to call labeler

963bc6d

Code cleanup

77fe612

Merge branch 'main' into routing-labeler

3871f3c

Code cleanup

7665f3a

Add transient into VerticaDB during online upgrade

74bc678

Merge branch 'main' into routing-labeler

b785913

Code cleanup

95fd780

Fix e2e test

3342399

spilchen self-assigned this Mar 18, 2022

spilchen requested a review from roypaulin March 21, 2022 11:24

roypaulin approved these changes Mar 21, 2022

View reviewed changes

spilchen merged commit 7882e1b into vertica:main Mar 21, 2022

spilchen deleted the routing-labeler branch March 21, 2022 17:06

spilchen mentioned this pull request Mar 22, 2022

Drain node during scale down #183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control pod routing in service objects with label #181

Control pod routing in service objects with label #181

spilchen commented Mar 18, 2022

roypaulin left a comment

Control pod routing in service objects with label #181

Control pod routing in service objects with label #181

Conversation

spilchen commented Mar 18, 2022

roypaulin left a comment

Choose a reason for hiding this comment