Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-group-auto-discovery support for oci #7403

Merged
merged 8 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion cluster-autoscaler/cloudprovider/oci/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,13 @@ use-instance-principals = true

n/a

### Node Group Auto Discovery
`--node-group-auto-discovery` could be given in below pattern. It would discover the nodepools under given compartment by matching the nodepool tags (either they are Freeform or Defined tags)
```
clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the fields in this required? Or are any optional? Can we specify in the comment string above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of the fields are mandatory. I've added a statement to make it clear in readme.

```
Auto discovery can not be used along with static discovery (`node` parameter) to prevent conflicts.

## Deployment

### Create OCI config secret (only if _not_ using Instance Principals)
Expand Down Expand Up @@ -271,7 +278,8 @@ kubectl apply -f ./cloudprovider/oci/examples/oci-nodepool-cluster-autoscaler-w-
correctly (`oci-cloud-controller-manager`).
- Avoid manually changing pools that are managed by the Cluster Autoscaler. For example, do not add or remove nodes
using kubectl, or using the Console (or the Oracle Cloud Infrastructure CLI or API).
- `--node-group-auto-discovery` and `--node-autoprovisioning-enabled=true` are not supported.
- `--node-autoprovisioning-enabled=true` are not supported.
- `--node-group-auto-discovery` and `node` parameters can not be used together as it can cause conflicts.
- We set a `nvidia.com/gpu:NoSchedule` taint on nodes in a GPU enabled pools.

## Helpful links
Expand Down
18 changes: 18 additions & 0 deletions cluster-autoscaler/cloudprovider/oci/common/oci_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,3 +205,21 @@ func GetAllPoolTypes(groups []string) (string, error) {
}
return ocidType, nil
}

// HasNodeGroupTags checks if nodepoolTags is provided
func HasNodeGroupTags(nodeGroupAutoDiscoveryList []string) (bool, bool, error) {
instancePoolTagsFound := false
nodePoolTagsFound := false
for _, arg := range nodeGroupAutoDiscoveryList {
if strings.Contains(arg, "nodepoolTags") {
nodePoolTagsFound = true
}
if strings.Contains(arg, "instancepoolTags") {
instancePoolTagsFound = true
}
}
if instancePoolTagsFound == true && nodePoolTagsFound == true {
return instancePoolTagsFound, nodePoolTagsFound, fmt.Errorf("can not use both instancepoolTags and nodepoolTags")
}
return instancePoolTagsFound, nodePoolTagsFound, nil
}
Original file line number Diff line number Diff line change
Expand Up @@ -153,8 +153,14 @@ func BuildOCI(opts config.AutoscalingOptions, do cloudprovider.NodeGroupDiscover
if err != nil {
klog.Fatalf("Failed to get pool type: %v", err)
}
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) {
manager, err := nodepools.CreateNodePoolManager(opts.CloudConfig, do, createKubeClient(opts))
_, nodepoolTagsFound, err := ocicommon.HasNodeGroupTags(opts.NodeGroupAutoDiscovery)
if err != nil {
klog.Fatalf("Failed to get auto discovery tags: %v", err)
}
if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) && nodepoolTagsFound == true {
klog.Fatalf("-nodes and -node-group-auto-discovery parameters can not be used together.")
} else if strings.HasPrefix(ocidType, npconsts.OciNodePoolResourceIdent) || nodepoolTagsFound == true {
manager, err := nodepools.CreateNodePoolManager(opts.CloudConfig, opts.NodeGroupAutoDiscovery, do, createKubeClient(opts))
if err != nil {
klog.Fatalf("Could not create OCI OKE cloud provider: %v", err)
}
Expand Down
147 changes: 146 additions & 1 deletion cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ import (
"fmt"
"math"
"os"
"regexp"
"strconv"
"strings"
"time"
Expand All @@ -34,6 +35,11 @@ import (
const (
maxAddTaintRetries = 5
maxGetNodepoolRetries = 3
clusterId = "clusterId"
compartmentId = "compartmentId"
nodepoolTags = "nodepoolTags"
min = "min"
max = "max"
)

var (
Expand Down Expand Up @@ -75,10 +81,11 @@ type okeClient interface {
GetNodePool(context.Context, oke.GetNodePoolRequest) (oke.GetNodePoolResponse, error)
UpdateNodePool(context.Context, oke.UpdateNodePoolRequest) (oke.UpdateNodePoolResponse, error)
DeleteNode(context.Context, oke.DeleteNodeRequest) (oke.DeleteNodeResponse, error)
ListNodePools(ctx context.Context, request oke.ListNodePoolsRequest) (oke.ListNodePoolsResponse, error)
}

// CreateNodePoolManager creates an NodePoolManager that can manage autoscaling node pools
func CreateNodePoolManager(cloudConfigPath string, discoveryOpts cloudprovider.NodeGroupDiscoveryOptions, kubeClient kubernetes.Interface) (NodePoolManager, error) {
func CreateNodePoolManager(cloudConfigPath string, nodeGroupAutoDiscoveryList []string, discoveryOpts cloudprovider.NodeGroupDiscoveryOptions, kubeClient kubernetes.Interface) (NodePoolManager, error) {

var err error
var configProvider common.ConfigurationProvider
Expand Down Expand Up @@ -151,6 +158,20 @@ func CreateNodePoolManager(cloudConfigPath string, discoveryOpts cloudprovider.N
nodePoolCache: newNodePoolCache(&okeClient),
}

// auto discover nodepools from compartments with nodeGroupAutoDiscovery parameter
klog.Infof("checking node groups for autodiscovery ... ")
for _, arg := range nodeGroupAutoDiscoveryList {
nodeGroup, err := nodeGroupFromArg(arg)
if err != nil {
return nil, fmt.Errorf("unable to construct node group auto discovery from argument: %v", err)
}
nodeGroup.manager = manager
nodeGroup.kubeClient = kubeClient

manager.nodeGroups = append(manager.nodeGroups, *nodeGroup)
autoDiscoverNodeGroups(manager, manager.okeClient, *nodeGroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the auto discovery happens in CreateNodePoolManager, I assume that auto discovery only happens during startup. Is that right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, below in the forceRefresh function, we also autoDiscoverNodeGroups.

}

Copy link
Contributor

@jlamillan jlamillan Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like node-pools that were explicitly configured via --nodes should be added before (and take precedent over) node-pools that were discovered via --node-group-auto-discovery.

Do you agree? That also raises the question of the expected behavior of, say, the max or min node setting when a pool is specified via --nodes=2:5:ocid1.nodepool.oc1.np-a and also discovered via --node-group-auto-discovery=clusterId:ocid1.cluster.oc1.c-1,compartmentId:ocid1.compartment.oc1..c1,nodepoolTags:cluster-autoscaler-also/enabled=true,min:0,max:10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeGroupAutoDiscovery actually overrides nodes parameter with this implementation which means nodes parameter is ignored if nodeGroupAutoDiscovery is provided.

What I can think of as a solution,

  1. We could force the user to provide only one of them in the config, so the CA would fail on startup if both of the parameters were provided and also we would log an error line to state the reason for the end-user to see. The end-user should fix the configuration by removing one of them.
  2. If we want both parameters work together, we need to decide which one has a higher priority over the other. I would say nodes parameter should override nodeGroupAutoDiscovery min/max values.

Please let me know of your thoughts and I will proceed accordingly to make the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention seems to be to warn against using it in the docs[1,2], and/or disallow [1] it in the code.

I'm fine with either documenting it and/or errorring out. As you mentioned, currently the code quietly overrides any static node-pools while also logging messages as it processes each static-node pool, which could cause confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added extra checks to prevent using both parameters together, and also documented it in oci/README.

// Contains all the specs from the args that give us the pools.
for _, arg := range discoveryOpts.NodeGroupSpecs {
np, err := nodePoolFromArg(arg)
Expand Down Expand Up @@ -180,6 +201,48 @@ func CreateNodePoolManager(cloudConfigPath string, discoveryOpts cloudprovider.N
return manager, nil
}

func autoDiscoverNodeGroups(m *ociManagerImpl, okeClient okeClient, nodeGroup nodeGroupAutoDiscovery) (bool, error) {
var resp, reqErr = okeClient.ListNodePools(context.Background(), oke.ListNodePoolsRequest{
ClusterId: common.String(nodeGroup.clusterId),
CompartmentId: common.String(nodeGroup.compartmentId),
})
if reqErr != nil {
klog.Errorf("failed to fetch the nodepool list with clusterId: %s, compartmentId: %s. Error: %v", nodeGroup.clusterId, nodeGroup.compartmentId, reqErr)
return false, reqErr
}
for _, nodePoolSummary := range resp.Items {
if validateNodepoolTags(nodeGroup.tags, nodePoolSummary.FreeformTags, nodePoolSummary.DefinedTags) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few types of tags including defined tags and free form tags.

As I understand it, user defined tags on a Node Pool resource would appear in the form tags (i.e. nodePoolSummary.FreeformTags not nodePoolSummary.DefinedTags). Is there a reason we're not checking all the tag namspaces for a match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not only Freeform tags. Users can also create their own namespace and defined tags. We check both of them to make sure we don't miss a tag applied by the user.

Defined tag holds a namespace but FreeForm tag does not.

  • Defined tag : namepsace.tagKey=tagValue
  • Freeform tag: tagKey=tagValue

When we query Nodepool through api, the response returns them in separate fields.

  • FreeformTags is a map[string=>string]
  • DefinedTags is a map[string => map[string]string]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

nodepool := &nodePool{}
nodepool.id = *nodePoolSummary.Id
nodepool.minSize = nodeGroup.minSize
nodepool.maxSize = nodeGroup.maxSize

nodepool.manager = nodeGroup.manager
nodepool.kubeClient = nodeGroup.kubeClient

m.staticNodePools[nodepool.id] = nodepool
klog.V(5).Infof("auto discovered nodepool in compartment : %s , nodepoolid: %s", nodeGroup.compartmentId, nodepool.id)
} else {
klog.Warningf("nodepool ignored as the tags do not satisfy the requirement : %s , %v, %v", *nodePoolSummary.Id, nodePoolSummary.FreeformTags, nodePoolSummary.DefinedTags)
}
}
return true, nil
}

func validateNodepoolTags(nodeGroupTags map[string]string, freeFormTags map[string]string, definedTags map[string]map[string]interface{}) bool {
if nodeGroupTags != nil {
for tagKey, tagValue := range nodeGroupTags {
namespacedTagKey := strings.Split(tagKey, ".")
if len(namespacedTagKey) == 2 && tagValue != definedTags[namespacedTagKey[0]][namespacedTagKey[1]] {
return false
} else if len(namespacedTagKey) != 2 && tagValue != freeFormTags[tagKey] {
return false
}
}
}
return true
}

// nodePoolFromArg parses a node group spec represented in the form of `<minSize>:<maxSize>:<ocid>` and produces a node group spec object
func nodePoolFromArg(value string) (*nodePool, error) {
tokens := strings.SplitN(value, ":", 3)
Expand Down Expand Up @@ -207,6 +270,78 @@ func nodePoolFromArg(value string) (*nodePool, error) {
return spec, nil
}

// nodeGroupFromArg parses a node group spec represented in the form of
// `clusterId:<clusterId>,compartmentId:<compartmentId>,nodepoolTags:<tagKey1>=<tagValue1>&<tagKey2>=<tagValue2>,min:<min>,max:<max>`
// and produces a node group auto discovery object,
// nodepoolTags are optional and CA will capture all nodes if no tags are provided.
func nodeGroupFromArg(value string) (*nodeGroupAutoDiscovery, error) {
// this regex will find the key-value pairs in any given order if separated with a colon
regexPattern := `(?:` + compartmentId + `:(?P<` + compartmentId + `>[^,]+)`
regexPattern = regexPattern + `|` + nodepoolTags + `:(?P<` + nodepoolTags + `>[^,]+)`
regexPattern = regexPattern + `|` + max + `:(?P<` + max + `>[^,]+)`
regexPattern = regexPattern + `|` + min + `:(?P<` + min + `>[^,]+)`
regexPattern = regexPattern + `|` + clusterId + `:(?P<` + clusterId + `>[^,]+)`
regexPattern = regexPattern + `)(?:,|$)`

re := regexp.MustCompile(regexPattern)

parametersMap := make(map[string]string)

// push key-value pairs into a map
for _, match := range re.FindAllStringSubmatch(value, -1) {
for i, name := range re.SubexpNames() {
if i != 0 && match[i] != "" {
parametersMap[name] = match[i]
}
}
}

spec := &nodeGroupAutoDiscovery{}

if parametersMap[clusterId] != "" {
spec.clusterId = parametersMap[clusterId]
} else {
return nil, fmt.Errorf("failed to set %s, it is missing in node-group-auto-discovery parameter", clusterId)
}

if parametersMap[compartmentId] != "" {
spec.compartmentId = parametersMap[compartmentId]
} else {
return nil, fmt.Errorf("failed to set %s, it is missing in node-group-auto-discovery parameter", compartmentId)
}

if size, err := strconv.Atoi(parametersMap[min]); err == nil {
spec.minSize = size
} else {
return nil, fmt.Errorf("failed to set %s size: %s, expected integer", min, parametersMap[min])
}

if size, err := strconv.Atoi(parametersMap[max]); err == nil {
spec.maxSize = size
} else {
return nil, fmt.Errorf("failed to set %s size: %s, expected integer", max, parametersMap[max])
}

if parametersMap[nodepoolTags] != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it valid not to specify node pool tags? Or are they required? This will silently continue on if there are no nodePoolTags specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made nodepooltags optional at the beginning, but then I noticed we may also need to support instancepooltags after @jlamillan's feedbacks.
in the final state, nodepool tags are not optional as I do a validation already to decide whether an instancepoolmanager or nodepoolmanager would be initiated.
In short, not it's not optional and I added a failure case to address your feedback.

nodepoolTags := parametersMap[nodepoolTags]

spec.tags = make(map[string]string)

pairs := strings.Split(nodepoolTags, "&")

for _, pair := range pairs {
parts := strings.Split(pair, "=")
if len(parts) == 2 {
spec.tags[parts[0]] = parts[1]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be returning an error if the length is not 2? Or is that a valid use case? Right now we will just silently continue on.

Copy link
Contributor Author

@gvnc gvnc Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would actually be a formatting error, I've added an else statement to fix it.

}
}

klog.Infof("node group auto discovery spec constructed: %+v", spec)

return spec, nil
}

type ociManagerImpl struct {
cfg *ocicommon.CloudConfig
okeClient okeClient
Expand All @@ -215,6 +350,7 @@ type ociManagerImpl struct {
ociTagsGetter ocicommon.TagsGetter
registeredTaintsGetter RegisteredTaintsGetter
staticNodePools map[string]NodePool
nodeGroups []nodeGroupAutoDiscovery

lastRefresh time.Time

Expand Down Expand Up @@ -253,6 +389,15 @@ func (m *ociManagerImpl) TaintToPreventFurtherSchedulingOnRestart(nodes []*apiv1
}

func (m *ociManagerImpl) forceRefresh() error {
// auto discover node groups
if m.nodeGroups != nil {
// empty previous nodepool map to do an auto discovery
m.staticNodePools = make(map[string]NodePool)
for _, nodeGroup := range m.nodeGroups {
autoDiscoverNodeGroups(m, m.okeClient, nodeGroup)
}
}
// rebuild nodepool cache
err := m.nodePoolCache.rebuild(m.staticNodePools, maxGetNodepoolRetries)
if err != nil {
return err
Expand Down
71 changes: 71 additions & 0 deletions cluster-autoscaler/cloudprovider/oci/nodepools/oci_manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -336,6 +336,10 @@ func (c mockOKEClient) DeleteNode(context.Context, oke.DeleteNodeRequest) (oke.D
}, nil
}

func (c mockOKEClient) ListNodePools(context.Context, oke.ListNodePoolsRequest) (oke.ListNodePoolsResponse, error) {
return oke.ListNodePoolsResponse{}, nil
}

func TestRemoveInstance(t *testing.T) {
instanceId1 := "instance1"
instanceId2 := "instance2"
Expand Down Expand Up @@ -384,3 +388,70 @@ func TestRemoveInstance(t *testing.T) {
}
}
}

func TestNodeGroupFromArg(t *testing.T) {
var nodeGroupArg = "clusterId:testClusterId,compartmentId:testCompartmentId,nodepoolTags:ca-managed=true&namespace.foo=bar,min:1,max:5"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but can we update the IDs to look like real ocids just in case there are weird parsing bugs. For example

ocid1.cluster.oc1.test-region.test
ocid1.compartment.oc1.test-region.test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

nodeGroupAutoDiscovery, err := nodeGroupFromArg(nodeGroupArg)
if err != nil {
t.Errorf("Error: #{err}")
}
if nodeGroupAutoDiscovery.clusterId != "testClusterId" {
t.Errorf("Error: clusterId should be testClusterId")
}
if nodeGroupAutoDiscovery.compartmentId != "testCompartmentId" {
t.Errorf("Error: compartmentId should be testCompartmentId")
}
if nodeGroupAutoDiscovery.minSize != 1 {
t.Errorf("Error: minSize should be 1")
}
if nodeGroupAutoDiscovery.maxSize != 5 {
t.Errorf("Error: maxSize should be 5")
}
if nodeGroupAutoDiscovery.tags["ca-managed"] != "true" {
t.Errorf("Error: ca-managed:true is missing in tags.")
}
if nodeGroupAutoDiscovery.tags["namespace.foo"] != "bar" {
t.Errorf("Error: namespace.foo:bar is missing in tags.")
}
}

func TestValidateNodePoolTags(t *testing.T) {

var nodeGroupTags map[string]string = nil
var nodePoolTags map[string]string = nil
var definedTags map[string]map[string]interface{} = nil

if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == false {
t.Errorf("validateNodepoolTags shouldn't return false for empty tags map")
}

nodeGroupTags = make(map[string]string)
nodeGroupTags["test"] = "test"

if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == true {
t.Errorf("validateNodepoolTags shouldn't return true for tags missing")
}

nodePoolTags = make(map[string]string)
nodePoolTags["foo"] = "bar"

if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == true {
t.Errorf("validateNodepoolTags shouldn't return true for not matching tags")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to do something like a traditional table driven test like this?

https://go.dev/wiki/TableDrivenTests

My main concern with the current test is that if someone adds a new test case in the middle of this, it will mess up every test that comes after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this test to meet the table driven test requirements.
now it runs with test cases given in a map.


nodePoolTags["test"] = "test"

if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == false {
t.Errorf("validateNodepoolTags shouldn't return false for matching tags")
}

nodeGroupTags["ns.tag1"] = "tag2"
definedTagsMap := make(map[string]interface{})
definedTagsMap["tag1"] = "tag2"
definedTags = make(map[string]map[string]interface{})
definedTags["ns"] = definedTagsMap

if validateNodepoolTags(nodeGroupTags, nodePoolTags, definedTags) == false {
t.Errorf("validateNodepoolTags shouldn't return false for namespaced tags")
}
}
11 changes: 11 additions & 0 deletions cluster-autoscaler/cloudprovider/oci/nodepools/oci_node_pool.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,17 @@ type nodePool struct {
maxSize int
}

type nodeGroupAutoDiscovery struct {
manager NodePoolManager
kubeClient kubernetes.Interface

clusterId string
compartmentId string
tags map[string]string
minSize int
maxSize int
}

// MaxSize returns maximum size of the node group.
func (np *nodePool) MaxSize() int {
return np.maxSize
Expand Down
Loading