Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The replicas of deployment is incorrectly when the related HPA is abnormal, #4109

Closed
Rains6 opened this issue Oct 9, 2023 · 12 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Rains6
Copy link
Contributor

Rains6 commented Oct 9, 2023

What happened:
The hpaReplicasSyncer controller is enabled. When the hpa delivered to the member cluster is abnormal, the desiredReplicas of the hpa is 0. In this case, the replicas synchronized to the control plane deployment are incorrect. Expected to use currentReplicas instead of desiredReplicas as the calculated value when hpa is abnormal

What you expected to happen:
Expected to use currentReplicas instead of desiredReplicas as the calculated value when hpa is abnormal.

How to reproduce it (as minimally and precisely as possible):
1.The hpaReplicasSyncer controller is enabled. Delivering Deployment and HPA to member cluster A.
2.The HPA in cluster A is abnormal. In this case, the value of desiredReplicas of the HPA is 0.
image
3.On the control plane, the replicas of the deployment is 0, which is expected to be 1.

Anything else we need to know?:

Environment:

  • Karmada version: v1.7.0.alpha3
  • kubectl-karmada or karmadactl version (the result of kubectl-karmada version or karmadactl version):
  • Others:
@chaunceyjiang
Copy link
Member

/assign

@chaunceyjiang
Copy link
Member

chaunceyjiang commented Oct 25, 2023

The root cause of this issue is the instability of HPA.

The current implementation of #4072 heavily relies on HPA. If there is an exception with HPA, the number of replicas synchronized from the member cluster to the control plane becomes meaningless. And since HPA itself also heavily relies on the stability of metrics-server, HPA itself becomes even more unstable.

There are two failures that can occur with HPA:

  1. Exception with metrics-server.
  2. Accidental deletion of HPA.

Therefore, we are trying to introduce a new mechanism to avoid strong dependency on HPA:
Solution 1:
Directly query the scale sub-resource of workloads in the member cluster. This can accurately obtain the number of replicas for workloads. However, it cannot be used in Karmada's PULL mode.

Solution 2:
Aggregate status resources for workloads in the control plane, but for some custom resources, there may not be replica-related infos in their status field.

@chaunceyjiang
Copy link
Member

@XiShanYongYe-Chang @jwcesign @lxtywypc @RainbowMango Do you have any other solutions?

@XiShanYongYe-Chang
Copy link
Member

Look for more people's ideas.
/cc @GitHubxsy

@lxtywypc
Copy link
Contributor

Humm...In fact we chose solution 2 for our own implement. We introduced some 'parser's to tell what replicas is for each kind of workload.

We also consider that if it is necessary to expand the InterpretStatus in resource-interpreter, or introduce a new InterpreterOperation, to intrepret some replicas info into status of work. We believe these info could help us do more in the future.

@XiShanYongYe-Chang
Copy link
Member

We introduced some 'parser's to tell what replicas is for each kind of workload.

Doesn't this require a new component?

We also consider that if it is necessary to expand the InterpretStatus in resource-interpreter, or introduce a new InterpreterOperation, to intrepret some replicas info into status of work. We believe these info could help us do more in the future.

Can you expand on what's relevant to the current issue? And we can start a new issue to talk about the rest.

@lxtywypc
Copy link
Contributor

lxtywypc commented Nov 1, 2023

Doesn't this require a new component?

We hard-coded some parsers in our own project.

Can you expand on what's relevant to the current issue? And we can start a new issue to talk about the rest.

I mean that if we could introduce a new hook point to interpret actual replica-related info in each member clusters into status of work, we could use these info directly in hpaReplicasSyncer.

Like this:

apiVersion: work.karmada.io/v1alpha1
kind: Work
metadata:
  name: workload-example
  namespace: karmada-es-cluster1
spec:
  workload:
  # ...
status:
  manifestStatuses:
  - status:
  # ...
  replicas: 1  # new replica-related field, could be used in hpaReplicasSyncer
  readyReplicas: 1 # new replica-related field

@chaunceyjiang
Copy link
Member

I mean that if we could introduce a new hook point to interpret actual replica-related info in each member clusters into status of work, we could use these info directly in hpaReplicasSyncer.

I think this is a good idea.

@RainbowMango
Copy link
Member

I mean that if we could introduce a new hook point to interpret actual replica-related info in each member clusters into status of work, we could use these info directly in hpaReplicasSyncer.

I get it.
The first thing we need to do is extend the ReflectStatus to get replica-related info, such as replica(as desired replica) and readyReplicas(as current ready replicas).

After that, we need to extend the Work API to record the info, then it will be used by hpaReplicasSyncer.

All these works seem dedicated to hpaReplicasSyncer, can this info be used in other scenarios? I'm wondering if it's worth doing it this way.

@lxtywypc
Copy link
Contributor

All these works seem dedicated to hpaReplicasSyncer, can this info be used in other scenarios? I'm wondering if it's worth doing it this way.

Now it seems dedicated to hpaReplicasSyncer, but I believe the replicas related could help do more in the future, especially on scheduling.

Maybe we could invite more others to raise their mind.

@XiShanYongYe-Chang
Copy link
Member

/close

@karmada-bot
Copy link
Collaborator

@XiShanYongYe-Chang: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants