helm: autoscaling migration procedure #7367

dimitarvdimitrov · 2024-02-12T19:43:24Z

#7282 added autoscaling to Helm as an experimental feature. This issue is about adding support in the helm chart for a smooth migration and adding documentation for the migration.

Why do we need a migration?

Migrating to a Mimir cluster with autoscaling requires a few intermediate steps to ensure that there are no disruptions to traffic. The major risk is that enabling autoscaling also removed the replicas field from Deployments. If KEDA/HPA hasn't started autoscaling the Deployment, then k8s interprets no replicas as meaning 1 replica, which can cause an outage.

Migration in a nutshel

Support a distributor.kedaAutoscaling.preserveReplicas: true field in the helm chart which doesn't delete the replicas field from the rendered manifests (feat(helm): Adding preserveReplicas option for kedaAutoscaling to preserve replicas even if keda is enabled #7431)
Set preserveReplicas: true, deploy the chart.
Wait for a couple of minutes for the HPA to start running (perhaps there's a concrete way to make sure everything is set up properly?)
Remove preserveReplicas from values.yaml and deploy the chart
Done

Internal docs

I'm also pasing Grafana Labs-internal documentation that's specific to our deployment tooling with FluxCD. Perhaps it can be used by folks running FluxCD or as a starting point for proper docs:

Autoscaling existing deployments

Who should use this and when

This tool and documentation helps describe how to transition from a manually
set number of replicas in a Kubernetes Deployment to using a Horizontal Pod
Autoscaler (HPA). It specifically highlights a pitfall of the migration and
how to work around it.

Not covered in this documentation is how to implement the HPA definition.
However, some examples can be found in existing jsonnet.

Problem

When autoscaling an existing Kubernetes Deployment there is a conflict between
what fluxcd wants to reconcile and what the HPA has set the replicas to.

Because flux is trying to reconcile the state we want to undefine any replicas
from the spec. Otherwise flux will reset the number of replicas back to the
supplied amount in conflict with what the autoscaler may have set them to.
This causes a continuous loop whereafter the HPA would again scale up/down and
then flux would reset again and so on.

Removing the replicas from the spec is straightforward enough. However, when
the updated spec is applied, Kubernetes notices the now missing item from the
spec and therefore applies the default of 1 replica*.
Unless the HPA has scaled the deployment, the next reconciliation of flux
that removes .spec.replicas will result in the deployment being scaled
down. Sometimes this is not noticeable as the HPA will then kick in and scale
back up, but in other cases this can cause disruptions.

*This is not actually a bug despite the link and is intended behavour.

Solution

The goal is to be able to roll out autoscalers without the Deployment ever
being reset to 1 replica (even for a brief time). We also don't want to
maintain a state that is in conflict with reality. (ie we still want our
jsonnet to be truthful and thus remove the replicas from the spec).

What is happening internally in Kubernetes is to do with server side applies
and field ownership.
The TL;DR for us is to tell Kubernetes that the HPA owns the replicas field,
and to not perform any action when it sees the replicas disappear from the
spec applied by flux.

We have observed that if the HPA modifies the number of replicas, it will
take ownership of the .spec.replicas field; after-which it is safe to remove
the replicas from our jsonnet definitions. However, if the HPA does not make
any modifications to the replicas (for example, when the correct number of
replicas already exist), or if you want to remove the jsonnet replicas
definition sooner, we need to tell Kubernetes who is managing the field.

We do this by removing the replicas from flux's managedFields metadata on the
Kubernetes object. This makes Kubernetes think that the replicas were never a
part of the spec (at least from flux), and thus the next time flux is applied
without the replicas defined it does not default back to 1. (This is because
the HPA is now the only owner of the field).

The helper script remove_managed_replicas.sh assists with editing this
metadata.

Process for switching to autoscalers

The process for switching an existing Deployment to use an autoscaler is as
follows:

Add a flux ignore to the object. For example:
./scripts/flux/ignore.sh --context dev-us-east-0 --namespace cortex-dev-05 --kind Deployment --reason "Transitioning to HPA".

Open and merge a PR with the change that installs the HPA (potentially via
KEDA), as well as removes the .spec.replicas from the object.

Remove the replicas from flux's managedFields metadata using
remove_managed_replicas.sh (see below). If the HPA has already scaled the
object (in either direction), the field will already be correctly updated.
Either way, it is still safe to run the script which will confirm this for you
as a no-op.

Tidy up by removing the flux ignore on the object.

When flux reconciles afterwards it should be a noop as .spec.replicas is
not defined and the spec is otherwise unchanged from the object.

Verify that flux has ran and not defaulted back to 1 replica.

Verify that the HPA is deployed and working as expected.

How to use remove_managed_replicas.sh
Usage: ./remove_managed_replicas.sh [ -c | --context ]
                                    [ -n | --namespace ]
                                    [ -o | --object ]
                                    [ -d | --dry-run ]
                                    [ -h | --help  ]
Outputs a diff of changes made to the object.
Example:

./remove_managed_replicas.sh -c dev-eu-west-2 -n mimir-dev-10 -o Deployment/distributor --dry-run

The script will output a diff of the changes made to the object. (If --dry-run
is set, no actual changes are applied).

remove_managed_replicas.sh

#!/usr/bin/env bash

set -euo pipefail

help()
{
    echo "Usage: ./remove_managed_replicas.sh [ -c | --context ]
                                    [ -n | --namespace ]
                                    [ -o | --object ]
                                    [ -d | --dry-run ]
                                    [ -h | --help  ]
Outputs a diff of changes made to the object.
    "
    exit 2
}

VALID_ARGUMENTS=$#

if [ "$VALID_ARGUMENTS" -eq 0 ]; then
  help
fi

CONTEXT=""
NAMESPACE=""
OBJECT=""
DRY_RUN=false
DRY_RUN_ARG=""

while [ "$#" -gt 0 ]
do
  case "$1" in
    -c | --context )
      CONTEXT="--context=${2}"
      shift 2
      ;;
    -n | --namespace )
      NAMESPACE="$2"
      shift 2
      ;;
    -o | --object )
      OBJECT="$2"
      shift 2
      ;;
    -d | --dry-run )
      DRY_RUN=true
      DRY_RUN_ARG="--dry-run=server"
      shift 1
      ;;
    -h | --help)
      help
      ;;
    --)
      shift;
      break
      ;;
    *)
      echo "Unexpected option: ${1}"
      help
      ;;
  esac
done

if [ -z "${NAMESPACE}" ]
then
  echo "Must supply a namespace."
  exit 1
fi

if [ -z "${OBJECT}" ]
then
  echo "Must supply a kubernetes object (such as \`-o Deployment/example\`)."
  exit 1
fi

KC="kubectl ${CONTEXT} --namespace=${NAMESPACE}"

BEFORE=$(${KC} get "${OBJECT}" -o yaml --show-managed-fields=true)
BEFORE_JSON=$(${KC} get "${OBJECT}" -o json --show-managed-fields=true)

INDEX=$(echo "${BEFORE_JSON}" | jq '.metadata.managedFields | map(.manager == "kustomize-controller") | index(true)')

# Check we can find the position of flux's entry in managedFields:
if ! [[ $INDEX =~ ^[0-9]+$ ]]
then
  echo "Unable to find \`kustomize-controller\` (flux) in the managedFields metadata for ${OBJECT}."
  echo "Has flux not ran on this object before?"
  echo "This may happen if you have deployed the object manually."
  echo "It should be safe to continue removing the flux-ignore as the object was never managed by flux."
  exit 1
fi

# Check that `.spec.replicas` is set in the managedFields entry
CHECK=$(echo "${BEFORE_JSON}" | jq ".metadata.managedFields[${INDEX}].fieldsV1.\"f:spec\".\"f:replicas\"")

if [ "${CHECK}" = "null" ]
then
  echo "Unable to find \`.spec.replicas\` set in the managedFields metadata for \`kustomize-controller\`."
  echo "Has the field already been unset?"
  echo "This may happen if the HPA has already scaled the object."
  echo "It is safe to continue removing the flux-ignore on the object."
  exit 1
fi

AFTER=$(${KC} patch "${OBJECT}" -o yaml --show-managed-fields=true ${DRY_RUN_ARG} --type='json' -p "[{'op': 'remove', 'path': '/metadata/managedFields/${INDEX}/fieldsV1/f:spec/f:replicas'}]")

diff -u <(echo "${BEFORE}") <(echo "${AFTER}") || :

if [ "${DRY_RUN}" = true ]
then
  echo ""
  echo "Dry run only. No changes have been applied."
fi

exit 0

The text was updated successfully, but these errors were encountered:

dimitarvdimitrov · 2024-02-13T09:49:07Z

@beatkind tagging you because I don't think I can assign you without you participating in the issue. Can you assign yourself by any chance? From the github docs

You can assign multiple people to each issue or pull request, including yourself, anyone who has commented on the issue or pull request, anyone with write permissions to the repository, and organization members with read permissions to the repository. For more information, see "Access permissions on GitHub."

beatkind · 2024-02-13T09:51:08Z

@dimitarvdimitrov And I need to actively write here :) to be participating - nope I am not able to assign myself, because I do not have any permissions inside the repo

dimitarvdimitrov · 2024-04-02T12:47:48Z

@beatkind I saw you marked this as fixed by #7431. I'd like to keep this issue open until we also document the migration procedure. Migration is now technically possible, but it's not a very user-friendly process because users needs to figure out the steps themselves.

beatkind · 2024-04-02T12:48:40Z

@dimitarvdimitrov thanks for reopening, this was simply a mistake :) - I will add some documentation with my next PR

KyriosGN0 · 2024-08-01T09:01:10Z

hey @beatkind @dimitarvdimitrov , i just tried to follow Migration in a nutshel guide, i found 2 issues

after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1
after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff

dimitarvdimitrov · 2024-08-01T14:24:01Z

1. after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1

how long after deploying the HPA did this happen? Was the HPA already working?

2. after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff

this looks like a problem with loadbalancing. The ring doesn't determine distributor load. Perhaps the reverse proxy in front of the distributor proxy didn't get the update quickly enough (or DNS was cached, etc.). What reverse proxy are you using - is it the nginx from the chart? Did this fix itself resolve eventually?

KyriosGN0 · 2024-08-01T14:30:42Z

almost immediately (up to maybe a minute later), and yes the HPA was working
i used a Servcie of type LoadBalancer (running GKE), so yeah i guess that GKE LB didn't route this correctly, i solved this manually by the only pod that received traffic

dimitarvdimitrov · 2024-08-02T10:35:08Z

almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

beatkind · 2024-08-02T11:04:15Z

almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

Good point, will validate it again with a test env

skinlayers · 2024-11-07T21:54:33Z

Just wanted to leave a note that we also encountered our replicas scaling down to 1 when we set preserveReplicas to false for mimir querier in our test environment. KEDA had been deployed several days before. So it seem like step 2. of the migration isn't adequate (unless I'm missing something). We can't afford to have our queriers drop to 1 in production, so we would like some guidance on how to safely proceed. Or is KEDA autoscaling still too experimental, and should be avoided in production?

  querier:
    replicas: 3
    kedaAutoscaling:
      enabled: true
      preserveReplicas: false
      minReplicaCount: 3
      maxReplicaCount: 40

dimitarvdimitrov added type/docs Improvements or additions to documentation helm help wanted Extra attention is needed labels Feb 12, 2024

This was referenced Feb 12, 2024

helm: promote autoscaling to stable #7368

Open

feat(helm): Adding KEDA autoscaling support #7282

Merged

dimitarvdimitrov assigned beatkind Feb 13, 2024

dimitarvdimitrov mentioned this issue Feb 13, 2024

Add support for HPA in mimir-distributed #3430

Closed

beatkind mentioned this issue Feb 20, 2024

feat(helm): Adding preserveReplicas option for kedaAutoscaling to preserve replicas even if keda is enabled #7431

Merged

4 tasks

dimitarvdimitrov closed this as completed in #7431 Apr 2, 2024

dimitarvdimitrov reopened this Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helm: autoscaling migration procedure #7367

helm: autoscaling migration procedure #7367

dimitarvdimitrov commented Feb 12, 2024 •

edited

Loading

Autoscaling existing deployments

Who should use this and when

Problem

Solution

Process for switching to autoscalers

How to use remove_managed_replicas.sh

dimitarvdimitrov commented Feb 13, 2024 •

edited

Loading

beatkind commented Feb 13, 2024 •

edited

Loading

dimitarvdimitrov commented Apr 2, 2024

beatkind commented Apr 2, 2024 •

edited

Loading

KyriosGN0 commented Aug 1, 2024

dimitarvdimitrov commented Aug 1, 2024

KyriosGN0 commented Aug 1, 2024

dimitarvdimitrov commented Aug 2, 2024

beatkind commented Aug 2, 2024

skinlayers commented Nov 7, 2024 •

edited

Loading

helm: autoscaling migration procedure #7367

helm: autoscaling migration procedure #7367

Comments

dimitarvdimitrov commented Feb 12, 2024 • edited Loading

Why do we need a migration?

Migration in a nutshel

Internal docs

Autoscaling existing deployments

Who should use this and when

Problem

Solution

Process for switching to autoscalers

How to use remove_managed_replicas.sh

dimitarvdimitrov commented Feb 13, 2024 • edited Loading

beatkind commented Feb 13, 2024 • edited Loading

dimitarvdimitrov commented Apr 2, 2024

beatkind commented Apr 2, 2024 • edited Loading

KyriosGN0 commented Aug 1, 2024

dimitarvdimitrov commented Aug 1, 2024

KyriosGN0 commented Aug 1, 2024

dimitarvdimitrov commented Aug 2, 2024

beatkind commented Aug 2, 2024

skinlayers commented Nov 7, 2024 • edited Loading

dimitarvdimitrov commented Feb 12, 2024 •

edited

Loading

dimitarvdimitrov commented Feb 13, 2024 •

edited

Loading

beatkind commented Feb 13, 2024 •

edited

Loading

beatkind commented Apr 2, 2024 •

edited

Loading

skinlayers commented Nov 7, 2024 •

edited

Loading