Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm: promote autoscaling to stable #7368

Open
3 of 9 tasks
dimitarvdimitrov opened this issue Feb 12, 2024 · 12 comments
Open
3 of 9 tasks

helm: promote autoscaling to stable #7368

dimitarvdimitrov opened this issue Feb 12, 2024 · 12 comments
Assignees
Labels
helm help wanted Extra attention is needed

Comments

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Feb 12, 2024

#7282 added autoscaling to Helm as an experimental feature. This issue is about remaining work in order to promote autoscaling to stable.

Bugs

Docs/Migration procedure

Remote URL UX

  • Support a remote different from the metamonitoring setup feat: Adding global kedaAutoscaling section #7392
  • Take the additional headers and auth from metamonitoring setup. Currently basic auth and extra headers are ignored.
  • Default to X-Scope-OrgID: metamonitoring if the config is already sending metrics to the same Mimir installation (the same way that metamonitoring computes it)
  • Add validation that is same mimir cluster is used, then metamonitoring is also enabled

Helm-jsonnet diffing

  • Add helm-jsonnet diffing so that the autoscaling configs don't get out of sync when we change one and forget to change the other. This is a matter of enabling autoscaling on select components in these two files and then making sure there are no differences between the rendered manifests. Minor differences can still be ignored via kustomiztions like this one

Make dashbaords compatible with helm-deployed KEDA objects

@beatkind
Copy link
Contributor

@dimitarvdimitrov if you want, you can assign me to both of these issues. #7368 & #7367

@dimitarvdimitrov
Copy link
Contributor Author

thanks @beatkind 🙏 I'm not sure of the level of detail in these issues, so ask away if anything isn't 100% clear

@beatkind
Copy link
Contributor

beatkind commented Feb 13, 2024

Some thoughts on

Support a remote different from the metamonitoring setup

#7282 (comment)

Basically drills down to have a global section for kedaAutoscaling:

kedaAutoscaling:
     prometheusAddress: http://... 
     customHeaders:
       {}
     pollingInterval: 10

@QuentinBisson
Copy link
Contributor

@dimitarvdimitrov is there an issue to add hpa to the components that do not have it yet like the ingester?

@dimitarvdimitrov
Copy link
Contributor Author

dimitarvdimitrov commented Apr 8, 2024

for ingesters I could only find an internal one unfortunately - grafana/mimir-squad#1410. I see that @jhalterman was last working on that. Jonathan is there a public issue for this work?

@jhalterman
Copy link
Member

The issue you cited is the only one. There's nothing public yet.

@ankense-cariad
Copy link

@dimitarvdimitrov @jhalterman are there plans to add support for ingesters or will that remain out of scope for autoscaling in the helm chart with keda? The last comment in April indicated that there might be internal information on how to configure hpa for ingesters, but nothing has been made public, can a public example be published?

@dimitarvdimitrov
Copy link
Contributor Author

I think @pr00se has been working on ingester autoscaling. Patryk, do you have any plans for bringing this to upstream jsonnet and helm chart?

@ankense-cariad
Copy link

ankense-cariad commented Jan 14, 2025

@dimitarvdimitrov @pr00se The ideal scenario would be for the helm chart to support container lifecycle hooks so that it is possible to terminate ingester pods during scale events and they could exit the ring properly without getting stuck in an UNHEALTHY state. Something like:

lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "curl -X POST http://localhost:8080/shutdown"]

@dimitarvdimitrov
Copy link
Contributor Author

the shutdown endpoint is not meant to be called on every pod lifecycle stop. It's mean to be called before the ingester is shut down for the last time. After calling POST /shutdown the ingester isn't expected to come back up. That's the reason the rollout-operator was created in the first place.

@seanankenbruck
Copy link

seanankenbruck commented Jan 24, 2025

Thanks @dimitarvdimitrov, we had actually implemented an HPA and modified the ingester statefulset with a custom script to call all of the endpoints in the sequence described inside of the Scaling down ingesters section of the documentation. The lifecycle stop commands do not get executed and our ingesters end up in an unhealthy state.

I've read the rollout-operator documentation and in the section titled Scaling based on reference resource. This section suggests that autoscaling can be achieved using a combination of an HPA and the rollout-operator. However, I cannot find any useful documentation or examples in either the mimir or rollout-operator repos that describe how to use these two components in tandem to achieve the desired behavior.

Is it possible (and recommended by the community) to manage ingester scaling using a combination of an HPA and the rollout-operator?

@dimitarvdimitrov
Copy link
Contributor Author

@seanankenbruck I can point you to the jsonent that we use to set up the rollout-operator and HPA. This is the setup necessary for the rollout-operator (most of it should be present in the rollout-operator helm chart). And this is the HPA autoscaling setup for the new kafka-based ingest storage (where ingesters are deployed slightly differently to before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helm help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants