Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle dynamic parameters which need to be changed every time before generating works for member cluster? #3436

Open
mkloveyy opened this issue Apr 20, 2023 · 20 comments
Labels
kind/question Indicates an issue that is a support question.

Comments

@mkloveyy
Copy link

mkloveyy commented Apr 20, 2023

Below is a basic scenario in our release process:

image

A fort pod means the first machine deployed in a cluster. During the bastion deployment phase, no traffic is routed to it. And in the canary deployment phase, traffic is routed to it.

We've implement a deploy CRD called Rollout based on openkruise to handle the whole process. Part of rollout's definition for above case is as follows:

apiVersion:
kind: Rollout
metadata:
  name:
  namespace:
spec:
  fortPull: false
  replicas: 4
  selector:
    matchLabels:
      app: 
  strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: "false"
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: "true"
        type: Fort
      - properties:  # first batch of rolling
          delay: "-1"
          maxUnavailable: "0"
          replicas: 50%
          timeout: "5"
        type: Batch
      - properties:  # second batch of rolling
          delay: "-1"
          maxUnavailable: "0"
          replicas: 100%
          timeout: "5"
        type: Batch
  subsetType: CloneSet
  template:
    ...

So in multi-cluster case, every time user creating a release, we need to do:

  1. Generate canary.steps for all member clusters.
  2. Allocate a fort pod to one of member clusters (values in steps are different between fort and non-fort clusters).

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

# cluster-x.yaml, fort pod in this cluster
strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: "false"  # pull out fort
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: "true"  # pull out fort
        type: Fort
      ...

# cluster-z.yaml, no fort pod
strategy:
    canary:
      steps:
      - properties:  # fort
          delay: "-1"
          pull: nil  # skip
        type: Fort
      - properties:  # canary
          delay: "-1"
          pull: nil  # skip
        type: Fort
      ...

Fort pod is allocated by below rules:

  1. If app has no fort pod before, allocate it to the higher weight cluster based on ResourceBinding. If weights are equal, allocate it the first cluster in ResourceBinding clusters by default.
  2. If app has a fort pod, keep it to the current cluster it's staying.

Currently, we are trying to put the code in the reviseReplicas hook (#3375)
image

but we encountered some issues during the develop, most related to the fort pod:

  1. If replicas is 0, reviseReplicas hook will not be executed.
    We enable users to start a release if replicas is 0, but we cannot update in this case as reviseReplicas hook won't be reached.
// pkg/controllers/binding/common.go
func needReviseReplicas(replicas int32, placement *policyv1alpha1.Placement) bool {
    return replicas > 0 && placement != nil && placement.ReplicaSchedulingType() == policyv1alpha1.ReplicaSchedulingTypeDivided
}
  1. We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.
    Usually we put the fort pod to cluster with higher weight. I the weights are evenly distributed, we decide to select the first cluster in ResourceBinding by default. But we can't know the cluster in this hook as it was not passed.
// pkg/controllers/binding/common.go
	for i := range targetClusters {
		...
		if needReviseReplicas(replicas, placement) {
			if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseReplica) {
				clonedWorkload, err = resourceInterpreter.ReviseReplica(clonedWorkload, int64(targetCluster.Replicas))
				if err != nil {
					klog.Errorf("Failed to revise replica for %s/%s/%s in cluster %s, err is: %v",
						workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
					return err
				}
			}
		}
                ...
        } 
}

So, is there any suitable ways to implement such differentiated configuration? And where is the appropriate place to put the code? Maybe expanding reviseReplica by adding cluster_name and removing replicas > 0 judgement and giving a new hook point?

Environment:

  • Karmada version: v1.4.1-with-lua
  • Kubernetes version: 1.19
  • Others:
@mkloveyy mkloveyy added the kind/question Indicates an issue that is a support question. label Apr 20, 2023
@XiShanYongYe-Chang
Copy link
Member

/cc @RainbowMango

@chaunceyjiang
Copy link
Member

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
spec:
  overrideRules:
    - targetCluster:
        clusterNames:
          - cluster-x
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  "false"
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  "true"
    - targetCluster:
        clusterNames:
          - cluster-z
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  nil
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  nil

Can this help you?

@mkloveyy
Copy link
Author

mkloveyy commented Apr 20, 2023

For example, above rollout need to be updated in two clusters: cluster-x and cluster-z. The differences between two cluster are as follows:

apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
spec:
  overrideRules:
    - targetCluster:
        clusterNames:
          - cluster-x
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  "false"
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  "true"
    - targetCluster:
        clusterNames:
          - cluster-z
      overriders:
        plaintext:
          - path: /spec/strategy/canary/steps/0/pull
            operator: replace
            value:  nil
          - path: /spec/strategy/canary/steps/1/pull
            operator: replace
            value:  nil

Can this help you?

Thanks for your reply!

Currently we do not use OverridePolicy to put canary.steps config in but implement the generating code in reviseReplicas hook and it will be directly applied on works.

1. If replicas is 0, reviseReplicas hook will not be executed.
2. We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

And still we cannot decide which cluster to put the fort pod in as issues above.

@chaunceyjiang
Copy link
Member

Currently we do not use OverridePolicy to put canary.steps config in but implement the generating code in reviseReplicas hook and it will be directly applied on works.

I'm sorry, I didn't understand why you don't use OverridePolicy and instead want to use reviseReplicas?

@mkloveyy
Copy link
Author

I'm sorry, I didn't understand why you don't use OverridePolicy and instead want to use reviseReplicas?
Mainly for two reason:

  1. We need to regenerate /spec/strategy/canary/steps every time when user creates a release. If we put this dynamic config in op, we need to ensure op updated first before user updates the global resource. I think now we don't have this option.
  2. We consider OverridePolicy is not very suitable for put dynamic configs. Now we put some static config in OverridePolicy:
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
  annotations:
    versionid: "47677"
  name: 
  namespace: 
spec:
  resourceSelectors:
    - apiVersion: 
      kind: Rollout
      name:
      namespace:
  overrideRules:
    - targetCluster:
        clusterNames:
          - cluster-x
      overriders:
        plaintext:
          - path: /metadata/annotations/cdos~1az
            operator: replace
            value: X
          - path: /spec/template/spec/containers/0/env/11
            operator: replace
            value: 
              name: CDOS_K8S_CLUSTER
              value: cluster-x
    - targetCluster:
        clusterNames:
          - cluster-z
      overriders:
        plaintext:
          - path: /metadata/annotations/cdos~1az
            operator: replace
            value: Z
          - path: /spec/template/spec/containers/0/env/11
            operator: replace
            value: 
              name: CDOS_K8S_CLUSTER
              value: cluster-z

@Poor12
Copy link
Member

Poor12 commented Apr 20, 2023

ReviseReplicas is basicly for dividing replicas when replica scheduling policy is divided. I don't think your scenario is suitable to use it because you do not change replicas in resource template but change its strategy in different clusters.

@mkloveyy
Copy link
Author

ReviseReplicas is basicly for dividing replicas when replica scheduling policy is divided. I don't think your scenario is suitable to use it because you do not change replicas in resource template but change its strategy in different clusters.

Thanks for your reply!

Where do you think would be the appropriate place to put this piece of logic? Or, to expand the scope a bit, how could we better implement such scenarios?

@chaunceyjiang
Copy link
Member

We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

Similar problem.
#3159

I'm thinking about whether we should introduce cluster info in the ReviseReplica step.

@mkloveyy
Copy link
Author

We cannot get cluster_name in reviseReplicas hook, so we cannot judge if we need to put fort pod.

Similar problem. #3159

I'm thinking about whether we should introduce cluster info in the ReviseReplica step.

Agreed. If do so, we can implement some custom logic that need to base on cluster info.

@RainbowMango
Copy link
Member

If replicas is 0, reviseReplicas hook will not be executed.

Do you mean you want to revise the Rollout by reviseReplicas hook point? Have you implemented InterpretReplica?

We need to regenerate /spec/strategy/canary/steps every time when user creates a release.

Does creates a release means create a Rollout?

I talked to @snowplayfire yesterday about this issue. Technically, we can do that by extending the resource interpreter but seems it essentially belongs to the override area. Not sure yet, maybe we can have a chat at community meeting.

@mkloveyy
Copy link
Author

mkloveyy commented Apr 23, 2023

Do you mean you want to revise the Rollout by reviseReplicas hook point? Have you implemented InterpretReplica?

Yes, we have implemented InterpretReplica. However, as we need the newest ResourceBinding, we need to revise the Rollout between BuildResourceBinding and ensureWork.

We need to regenerate /spec/strategy/canary/steps every time when user creates a release.

Does creates a release means create a Rollout?

Roughly correct. A release means a deploy access which creates or updates a Rollout.

I talked to @snowplayfire yesterday about this issue. Technically, we can do that by extending the resource interpreter but seems it essentially belongs to the override area.

My pleasure. Today we had a meeting with @snowplayfire and I've heard about this from her. And we've also communicated with @XiShanYongYe-Chang and @chaunceyjiang. Currently I think maybe both the two solutions(Expanding reviseReplica by adding cluster_name and Giving a new hook point) have more universal value and can solve our problems.

Not sure yet, maybe we can have a chat at community meeting.

Agreed. If possible, we hope to have some concrete discussions and reach some conclusions at next Tuesday's meeting.
Anyway, thank you very much for your assistance!😊

@CharlesQQ
Copy link
Member

Does this issue meet the requirements? #1567

@CharlesQQ
Copy link
Member

Or add pause field for PP, issue like: #517

@mkloveyy
Copy link
Author

mkloveyy commented Apr 25, 2023

Does this issue meet the requirements? #1567

Maybe this plan is more suitable for us. However, as OverridePolicy seems to be used for storing static parameters, I consider whether dynamic parameters can be applied on the Work directly.

@mkloveyy
Copy link
Author

mkloveyy commented Apr 26, 2023

@RainbowMango
I think maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

cops, ops, err := overrideManager.ApplyOverridePolicies(clonedWorkload, targetCluster.Name)
if err != nil {
    klog.Errorf("Failed to apply overrides for %s/%s/%s, err is: %v", clonedWorkload.GetKind(), clonedWorkload.GetNamespace(), clonedWorkload.GetName(), err)
        return err
}

if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseObject) {
    clonedWorkload, err = resourceInterpreter.ReviseObject(clonedWorkload, targetCluster.Name, int64(targetCluster.Replicas))
        if err != nil {
            klog.Errorf("Failed to revise object for %s/%s/%s in cluster %s, err is: %v", workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
            return err
        }
}

The definition of this hook is to allow users to directly modify the work.

@XiShanYongYe-Chang
Copy link
Member

Maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

Ask some guys to help take a look @RainbowMango @chaunceyjiang @CharlesQQ @Poor12

@mkloveyy
Copy link
Author

@shiyan2016 @snowplayfire @lxtywypc
We can have a look together.

@chaunceyjiang
Copy link
Member

cops, ops, err := overrideManager.ApplyOverridePolicies(clonedWorkload, targetCluster.Name)
if err != nil {
    klog.Errorf("Failed to apply overrides for %s/%s/%s, err is: %v", clonedWorkload.GetKind(), clonedWorkload.GetNamespace(), clonedWorkload.GetName(), err)
        return err
}

if resourceInterpreter.HookEnabled(clonedWorkload.GroupVersionKind(), configv1alpha1.InterpreterOperationReviseObject) {
    clonedWorkload, err = resourceInterpreter.ReviseObject(workload, clonedWorkload, targetCluster.Name)
        if err != nil {
            klog.Errorf("Failed to revise object for %s/%s/%s in cluster %s, err is: %v", workload.GetKind(), workload.GetNamespace(), workload.GetName(), targetCluster.Name, err)
            return err
        }
}

My suggestion is resourceInterpreter.ReviseObject(workload, clonedWorkload, targetCluster.Name)

@CharlesQQ
Copy link
Member

CharlesQQ commented Apr 28, 2023

I think maybe there is another option that we can introduce a new hook point maybe called reviseObject. This hook appears after op has been applied on works.

reviseObject may be lead to user don't know which cluster's work object is revised?

@mkloveyy
Copy link
Author

mkloveyy commented Apr 28, 2023

reviseObject may be lead to user don't know which cluster's work object is revised?

That's why we need to introduce targetCluster.Name to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

6 participants