Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

Merged
merged 1 commit into from
Aug 25, 2021

Conversation

rsecob
Copy link
Contributor

@rsecob rsecob commented Aug 19, 2021

Description of the change

After upgrading to the latest Redis chart version from 12.0.1 we noticed that sentinel behaviour was broken.

Initially, we made another PR that tackled one specific issue we wanted to fix. However, we then realized that the current sentinel implementation is much more fragile than we thought it was.

In the current sentinel configuration there is an option to use staticID: true or not. This leads to two different behaviours to ensure the system's stability and resiliency.

Having different, functionally identical behaviours can introduce fragility. Specifically, supporting two incompatible configurations adds an additional burden to ensure that new bugs are not introduced to the other path.

  • When NOT using staticID: true, new sentinel nodes will boot up with a randomly generated ID. And as per the sentinel’s documentation, Sentinels never forget already seen Sentinels, even if they are not reachable for a long time. In order to forget dead nodes, we need to command running nodes to clean their list of sentinel peers. Which is currently the case.

  • When using staticID: true, new sentinel nodes will boot up with a fixed ID based on their hostname which is static when using a Statefulset. When a new node sentinel boots up, it reaches out to running sentinels on the network and announces itself with a constant ID. Because the other sentinels recognize this ID they update the IP of the sentinel node rather than registering it as a new sentinel node. This effectively alleviates the need to command sentinels to clean their list of sentinel peers.

Although both systems work, we believe using staticID: true is the superior solution for the following reason:

  • Command sentinels to clean their list of sentinels, is done in a loop and a sleep command is run at each iteration. The sleepDelay value is defined in the values file, default is 5 seconds. Sentinel nodes run with a readinessProbe with a default timeout of 45secs before it fails and restarts the container. This sets a limit to the number of replicas your deployment can have before you need to manually override the readinessProbe values. And also, the more replicas, the longer it takes to boot up a new node. We believe it's not the most scalable system.

This PR harmonized the sentinel behaviours by making staticID: true as the default behaviour. The value is removed from the values file and cannot be overridden anymore. It offers a much more resilient and stable sentinel system which should allow for self-recovery in all scenarios.

Benefits

  • Removes the need to command running sentinels on the network to clean their list of sentinel peers every time a new node joins the cluster. A sleep command was run at each iteration. Removing this makes the system more robust and scalable.
  • Removes the need to check for master availability, which currently leads to a CrashLoopBackOff state when we lose enough nodes so that the quorum is not met to elect & promote a new master.

Possible drawbacks

  • When a new node joins the cluster and the previous master is down and failover has not been performed yet, it will still consider it as the current master. When the next failover runs and the quorum is met, a new master will be elected and promoted. It might be considered as a drawback, however, there is no difference between a CrashLoopBackOff state and this, but now we can expect a full recovery. How fast a new master will be elected will depend on your sentinel.downAfterMilliseconds and sentinel.failoverTimeout values.

Applicable issues

Additional information

  • Removed sentinel.staticID and sentinel.cleanDelaySeconds from the values file. Should not be disruptive in any way.

Checklist

  • Chart version bumped in Chart.yaml according to semver.
  • Title of the PR starts with chart name (e.g. [bitnami/<name_of_the_chart>])

@github-actions
Copy link

github-actions bot commented Aug 19, 2021

Great PR! Please pay attention to the following items before merging:

Files matching bitnami/*/values.yaml:

  • Is the PR adding a new container? Please reviewer, add it to the models (internal process)
  • Is the PR adding a new parameter? Please, ensure it’s documented in the README.md

This is an automatically generated QA checklist based on modified files

@rafariossaa
Copy link
Contributor

Hi,
Thank you very much for this PR and your detailed explanation.
Please, allow us some days to test it and provide feedback.
Also, I will bring in other team mate(s) to take a look to it.

@rafariossaa
Copy link
Contributor

I discussed internally this change with @miguelaeh and we agree on the change.

@rsecob , as this is changing the default behavior and removing a setting in the values, we consider this as a major change.
Could you change the version in Chart.yaml ?

@rsecob rsecob force-pushed the rb/improve-sentinel-resiliency branch from f758918 to 94ec328 Compare August 24, 2021 15:57
@rsecob
Copy link
Contributor Author

rsecob commented Aug 24, 2021

Great news, thank you @rafariossaa for taking the time with the team to review this.

I just bumped the version to 15.0.0

@rafariossaa
Copy link
Contributor

Hi,
Thanks.

Copy link
Contributor

@rafariossaa rafariossaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rafariossaa rafariossaa merged commit 9559497 into bitnami:master Aug 25, 2021
@marcandrews
Copy link

Thank you ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes
3 participants