[bitnami/redis]: Enhance sentinel resiliency, harmonize sentinel behaviour by using staticID as default behaviour #7278
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of the change
After upgrading to the latest Redis chart version from 12.0.1 we noticed that sentinel behaviour was broken.
Initially, we made another PR that tackled one specific issue we wanted to fix. However, we then realized that the current sentinel implementation is much more fragile than we thought it was.
In the current sentinel configuration there is an option to use
staticID: true
or not. This leads to two different behaviours to ensure the system's stability and resiliency.Having different, functionally identical behaviours can introduce fragility. Specifically, supporting two incompatible configurations adds an additional burden to ensure that new bugs are not introduced to the other path.
When NOT using
staticID: true
, new sentinel nodes will boot up with a randomly generated ID. And as per the sentinel’s documentation,Sentinels never forget already seen Sentinels, even if they are not reachable for a long time
. In order to forget dead nodes, we need to command running nodes to clean their list of sentinel peers. Which is currently the case.When using
staticID: true
, new sentinel nodes will boot up with a fixed ID based on their hostname which is static when using a Statefulset. When a new node sentinel boots up, it reaches out to running sentinels on the network and announces itself with a constant ID. Because the other sentinels recognize this ID they update the IP of the sentinel node rather than registering it as a new sentinel node. This effectively alleviates the need to command sentinels to clean their list of sentinel peers.Although both systems work, we believe using
staticID: true
is the superior solution for the following reason:sleep
command is run at each iteration. ThesleepDelay
value is defined in the values file, default is 5 seconds. Sentinel nodes run with a readinessProbe with a default timeout of 45secs before it fails and restarts the container. This sets a limit to the number of replicas your deployment can have before you need to manually override the readinessProbe values. And also, the more replicas, the longer it takes to boot up a new node. We believe it's not the most scalable system.This PR harmonized the sentinel behaviours by making
staticID: true
as the default behaviour. The value is removed from the values file and cannot be overridden anymore. It offers a much more resilient and stable sentinel system which should allow for self-recovery in all scenarios.Benefits
sleep
command was run at each iteration. Removing this makes the system more robust and scalable.CrashLoopBackOff
state when we lose enough nodes so that the quorum is not met to elect & promote a new master.Possible drawbacks
CrashLoopBackOff
state and this, but now we can expect a full recovery. How fast a new master will be elected will depend on yoursentinel.downAfterMilliseconds
andsentinel.failoverTimeout
values.Applicable issues
Additional information
sentinel.staticID
andsentinel.cleanDelaySeconds
from the values file. Should not be disruptive in any way.Checklist
Chart.yaml
according to semver.