docs: Fix rounding for determining max number of failing zones #6896

aknuds1 · 2023-12-12T10:45:00Z

What this PR does

In docs, fix formula for determining max number of failing zones. The current phrasing combined with downwards rounding (floor) means you may end up requiring too few failing zones. E.g., if RF is 3, we would require fewer than 1 failing zone, while it should be max 1.

Examples:

replication factor 3, max 1 (floor(3 / 2)) failing zone
replication factor 6, max 3 (floor(6/2)) failing zones
replication factor 9, max 4 (floor(9/2)) failing zones

Which issue(s) this PR fixes or relates to

Checklist

[na] Tests updated.
Documentation added.
[na] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
[na] about-versioning.md updated with experimental features.

56quarters · 2023-12-12T15:00:57Z

docs/sources/mimir/configure/configure-zone-aware-replication.md

@@ -69,7 +69,7 @@ With a replication factor of 3, which is the default, deploy the Grafana Mimir c
 Deploying Grafana Mimir clusters to more zones than the configured replication factor does not have a negative impact.
 Deploying Grafana Mimir clusters to fewer zones than the configured replication factor can cause writes to the replica to be missed, or can cause writes to fail completely.

-If there are fewer than `floor(replication factor / 2)` zones with failing replicas, reads and writes can withstand zone failures.
+If there are fewer than `ceil(replication factor / 2)` zones with failing replicas, reads and writes can withstand zone failures.


This entire phrasing is confusing and backwards to me. Instead of saying "fewer than X zones with failing replicas" can we say "There can be at most X zones with failing replicas otherwise reads and writes will fail"?

Agree w/ your sentiment @56quarters. I'll try to revise into something more straightforward.

Rewrote it according to your suggestion, PTAL.

56quarters · 2024-01-16T15:02:08Z

docs/sources/mimir/configure/configure-zone-aware-replication.md

@@ -69,7 +69,7 @@ With a replication factor of 3, which is the default, deploy the Grafana Mimir c
 Deploying Grafana Mimir clusters to more zones than the configured replication factor does not have a negative impact.
 Deploying Grafana Mimir clusters to fewer zones than the configured replication factor can cause writes to the replica to be missed, or can cause writes to fail completely.

-If there are fewer than `floor(replication factor / 2)` zones with failing replicas, reads and writes can withstand zone failures.
+There can be at most `floor(replication factor / 2)` zones with failing replicas, otherwise reads and writes will fail.


I think this formula is still wrong. We need a majority of zones available to accept reads and writes.

With a replication factor = 3, this formula gives us floor(3 / 2) = 1 one zone that can fail -- that's correct. However, with replication factor = 4, this formula gives us floor(4 / 2) = 2 two zones that can fail -- that's not correct. We need 3 zones with a replication factor of 4 (even though we don't recommend even replication factors). With replication factor = 5, this formula gives us floor(5 / 2) = 2 two zones that can fail -- correct.

I believe the correct formula is "there can be at most floor((replication factor - 1) / 2) zones with failing replicas".

Repeating the above scenario:

With a replication factor = 3, this formula gives us floor((3 - 1) / 2) = 1 one zone that can fail -- that's correct. With replication factor = 4, this formula gives us floor((4 - 1) / 2) = 1 one zone that can fail -- still correct. With replication factor = 5, this formula gives us floor((5 - 1) / 2) = 2 two zones that can fail -- still correct.

Please double check my reasoning and math.

I've always thought of Mimir failure tolerance in terms of quorum because I first encountered this operating etcd. I remembered their docs being pretty good about this aspect so I dug them out:

https://etcd.io/docs/v3.3/faq/#why-an-odd-number-of-cluster-members:

An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state.
For a cluster with n members, quorum is (n/2)+1.

https://etcd.io/docs/v3.3/faq/#what-is-failure-tolerance

Cluster Size Majority Failure Tolerance

1 1 0

2 2 0

3 2 1

4 3 1

5 3 2

6 4 2

7 4 3

8 5 3

9 5 4

I think the table is a really handy way of covering both sides (quorum and failure tolerance) and is easy to read for those less comfortable with the formula.

Signed-off-by: Arve Knudsen <[email protected]>

56quarters · 2024-11-01T20:36:02Z

Superseded by #9512

aknuds1 requested review from a team as code owners December 12, 2023 10:45

aknuds1 added type/docs Improvements or additions to documentation bug Something isn't working labels Dec 12, 2023

56quarters reviewed Dec 12, 2023

View reviewed changes

aknuds1 force-pushed the arve/docs-failing-replicas branch from 478f8f4 to cd3891d Compare January 16, 2024 10:09

aknuds1 requested a review from 56quarters January 16, 2024 10:09

56quarters reviewed Jan 16, 2024

View reviewed changes

aknuds1 force-pushed the arve/docs-failing-replicas branch from cd3891d to b825793 Compare March 5, 2024 07:31

aknuds1 requested a review from jdbaldry as a code owner March 5, 2024 07:31

aknuds1 added 2 commits June 19, 2024 17:42

docs: Fix rounding for determining max number of failing zones

78893c6

Signed-off-by: Arve Knudsen <[email protected]>

Revise wording

aed7de7

Signed-off-by: Arve Knudsen <[email protected]>

aknuds1 force-pushed the arve/docs-failing-replicas branch from b825793 to aed7de7 Compare June 19, 2024 15:42

56quarters closed this Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Fix rounding for determining max number of failing zones #6896

docs: Fix rounding for determining max number of failing zones #6896

aknuds1 commented Dec 12, 2023 •

edited

Loading

56quarters Dec 12, 2023

aknuds1 Dec 12, 2023 •

edited

Loading

aknuds1 Jan 16, 2024

56quarters Jan 16, 2024 •

edited

Loading

jdbaldry Mar 5, 2024

56quarters commented Nov 1, 2024

docs: Fix rounding for determining max number of failing zones #6896

docs: Fix rounding for determining max number of failing zones #6896

Conversation

aknuds1 commented Dec 12, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

56quarters Dec 12, 2023

Choose a reason for hiding this comment

aknuds1 Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

aknuds1 Jan 16, 2024

Choose a reason for hiding this comment

56quarters Jan 16, 2024 • edited Loading

Choose a reason for hiding this comment

jdbaldry Mar 5, 2024

Choose a reason for hiding this comment

56quarters commented Nov 1, 2024

aknuds1 commented Dec 12, 2023 •

edited

Loading

aknuds1 Dec 12, 2023 •

edited

Loading

56quarters Jan 16, 2024 •

edited

Loading