[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

rsecob · 2021-08-19T20:18:26Z

Description of the change

After upgrading to the latest Redis chart version from 12.0.1 we noticed that sentinel behaviour was broken.

Initially, we made another PR that tackled one specific issue we wanted to fix. However, we then realized that the current sentinel implementation is much more fragile than we thought it was.

In the current sentinel configuration there is an option to use staticID: true or not. This leads to two different behaviours to ensure the system's stability and resiliency.

Having different, functionally identical behaviours can introduce fragility. Specifically, supporting two incompatible configurations adds an additional burden to ensure that new bugs are not introduced to the other path.

When NOT using staticID: true, new sentinel nodes will boot up with a randomly generated ID. And as per the sentinel’s documentation, Sentinels never forget already seen Sentinels, even if they are not reachable for a long time. In order to forget dead nodes, we need to command running nodes to clean their list of sentinel peers. Which is currently the case.
When using staticID: true, new sentinel nodes will boot up with a fixed ID based on their hostname which is static when using a Statefulset. When a new node sentinel boots up, it reaches out to running sentinels on the network and announces itself with a constant ID. Because the other sentinels recognize this ID they update the IP of the sentinel node rather than registering it as a new sentinel node. This effectively alleviates the need to command sentinels to clean their list of sentinel peers.

Although both systems work, we believe using staticID: true is the superior solution for the following reason:

Command sentinels to clean their list of sentinels, is done in a loop and a sleep command is run at each iteration. The sleepDelay value is defined in the values file, default is 5 seconds. Sentinel nodes run with a readinessProbe with a default timeout of 45secs before it fails and restarts the container. This sets a limit to the number of replicas your deployment can have before you need to manually override the readinessProbe values. And also, the more replicas, the longer it takes to boot up a new node. We believe it's not the most scalable system.

This PR harmonized the sentinel behaviours by making staticID: true as the default behaviour. The value is removed from the values file and cannot be overridden anymore. It offers a much more resilient and stable sentinel system which should allow for self-recovery in all scenarios.

Benefits

Removes the need to command running sentinels on the network to clean their list of sentinel peers every time a new node joins the cluster. A sleep command was run at each iteration. Removing this makes the system more robust and scalable.
Removes the need to check for master availability, which currently leads to a CrashLoopBackOff state when we lose enough nodes so that the quorum is not met to elect & promote a new master.

Possible drawbacks

When a new node joins the cluster and the previous master is down and failover has not been performed yet, it will still consider it as the current master. When the next failover runs and the quorum is met, a new master will be elected and promoted. It might be considered as a drawback, however, there is no difference between a CrashLoopBackOff state and this, but now we can expect a full recovery. How fast a new master will be elected will depend on your sentinel.downAfterMilliseconds and sentinel.failoverTimeout values.

Applicable issues

fixes [bitnami/redis] All pods in CrashLoopBackOff status after rebooting worker nodes #5181

Additional information

Removed sentinel.staticID and sentinel.cleanDelaySeconds from the values file. Should not be disruptive in any way.

Checklist

Chart version bumped in Chart.yaml according to semver.
Title of the PR starts with chart name (e.g. [bitnami/<name_of_the_chart>])

github-actions · 2021-08-19T20:18:45Z

Great PR! Please pay attention to the following items before merging:

Files matching bitnami/*/values.yaml:

Is the PR adding a new container? Please reviewer, add it to the models (internal process)
Is the PR adding a new parameter? Please, ensure it’s documented in the README.md

This is an automatically generated QA checklist based on modified files

rafariossaa · 2021-08-20T08:00:23Z

Hi,
Thank you very much for this PR and your detailed explanation.
Please, allow us some days to test it and provide feedback.
Also, I will bring in other team mate(s) to take a look to it.

rafariossaa · 2021-08-24T15:55:02Z

I discussed internally this change with @miguelaeh and we agree on the change.

@rsecob , as this is changing the default behavior and removing a setting in the values, we consider this as a major change.
Could you change the version in Chart.yaml ?

…sing staticID as default behaviour

rsecob · 2021-08-24T15:58:47Z

Great news, thank you @rafariossaa for taking the time with the team to review this.

I just bumped the version to 15.0.0

rafariossaa · 2021-08-25T08:45:46Z

Hi,
Thanks.

rafariossaa

LGTM

marcandrews · 2021-08-29T12:58:36Z

Thank you ❤️

rsecob mentioned this pull request Aug 19, 2021

[bitnami/redis]: fix CrashLoopBackOff with sentinel enabled #7124

Closed

2 tasks

redis: Enhance sentinel resiliency, harmozine sentinel behaviour by u…

94ec328

…sing staticID as default behaviour

rsecob force-pushed the rb/improve-sentinel-resiliency branch from f758918 to 94ec328 Compare August 24, 2021 15:57

rafariossaa approved these changes Aug 25, 2021

View reviewed changes

rafariossaa merged commit 9559497 into bitnami:master Aug 25, 2021

anessi mentioned this pull request Sep 2, 2021

Redis Master Node restart issue on Sentinel Mode #6971

Closed

glrf mentioned this pull request Sep 9, 2021

[redis] Failure to elect new leader appuio/charts#353

Open

jsecchiero mentioned this pull request Oct 11, 2021

[BUG] Sentinels are not able to elect a new master redis/redis#9625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

rsecob commented Aug 19, 2021

github-actions bot commented Aug 19, 2021 •

edited

Loading

rafariossaa commented Aug 20, 2021

rafariossaa commented Aug 24, 2021

rsecob commented Aug 24, 2021

rafariossaa commented Aug 25, 2021

rafariossaa left a comment

marcandrews commented Aug 29, 2021

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278

Conversation

rsecob commented Aug 19, 2021

github-actions bot commented Aug 19, 2021 • edited Loading

rafariossaa commented Aug 20, 2021

rafariossaa commented Aug 24, 2021

rsecob commented Aug 24, 2021

rafariossaa commented Aug 25, 2021

rafariossaa left a comment

Choose a reason for hiding this comment

marcandrews commented Aug 29, 2021

github-actions bot commented Aug 19, 2021 •

edited

Loading