[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

zuiderkwast · 2025-01-21T21:41:20Z

The problem/use-case that the feature addresses

The cluster bus traffic overhead is huge for huge clusters because:

All nodes ping each other in a full mesh
Voting nodes = all primaries with slots

Description of the feature

Decouple right-to-vote from primary-serving-slots, so any node can be a voting or non-voting node.

This can allow non-voting primaries, voting replicas and voting empty nodes. Modifying these details may be easier than to rewrite the cluster protocol to Raft (#384).

By having only a few voting nodes (such as five) in a large cluster with hundreds of nodes, voting becomes faster.

Then we can also limit the ping-pong traffic between non-voting nodes. For example, non-voting nodes don't need to ping other nodes. They can just reply to pings from other nodes. This makes the voting nodes a team of leaders and the non-voters a team of followers.

The difficulty seems to be for all the nodes to agree about which nodes can vote and how to count the votes, and to do the switch in a rolling upgrade.

How to introduce it in a rolling upgrade

Idea: Once all the voting nodes in a cluster support the new rules, then automatically switch to the new voting rules. Add some flag bits to the cluster bus header:

Flag bit "I support the new voting rules" (yes/no)
Flag bit "I want to be in the quorum" (yes/no)
Flag bit "New voting rules are activated" (yes/no)

When a node wins a failover election, it checks if all nodes in the current quorum support the new voting rules. If they do, the new node activates the new voting rules and sets the 3rd bit "New voting rules are activated". This is spread to all nodes using broadcast and gossip.

Alternatives you've considered

Raft consensus protocol

Additional information

This is just an idea. We need to consider edge cases...

zuiderkwast · 2025-01-21T21:43:38Z

@enjoy-binbin Are the voting empty primaries the same idea as your arbiter nodes? You say @madolson doesn't like the idea of arbiter nodes? @PingXie also invited to discuss. Do you see any obvious catches with this approach?

enjoy-binbin · 2025-01-22T02:56:13Z

Are the voting empty primaries the same idea as your arbiter nodes?

Yes, it is the same idea, i would like to open source it if we accept.

You say @madolson doesn't like the idea of arbiter nodes?

madolson mention this thought in here #634 (comment). (maybe i got she wrong)

The entire issue is consistent with my idea, i am going to take a moment to talk about it. We actually have three mode:

arbiter mode, this was an early model of Tencent Cloud, both empty primaries and voting primarys has the voting right. We have one shard cluster, since we only have one primary, there is not enough vote, so we will add two arbiter node, that for one shard cluster, if the primary down, the remain two arbiter has the right to vote. We also use it to handle multi-AZ issue, like in this image, if the main AZ down, we still have enough right to promote the other replicas.

instance arbiter mode. As we can see in arbiter mode, as the number of shards increases, the number of arbiters also needs to increase, which brings a lot of overhead. So we add a instance atbier mode, in this mode, only instance arbiter has the right to vote, the non-empty primaries has no right to vote. In this case, we can only need a few instance arbiter nodes like you said in the top comment.

Then we can also limit the ping-pong traffic between non-voting nodes. For example, non-voting nodes don't need to ping other nodes. They can just reply to pings from other nodes. This makes the voting nodes a team of leaders and the non-voters a team of followers.

This is the last one, actually we are doing the about same thing, i was working on it last year and i have a demo branch in the internal fork, but then got stuck with other things and nerver finish it (haven't tested it widely). The main idea is, since we now have instance abiter, we will let arbiter node to ping other node to exchange the infomation. So that the other nodes does not need to ping other node, something like that, i may not describe it well, but the idea is similar to it.

madolson · 2025-01-22T06:23:31Z

madolson mention this thought in here #634 (comment). (maybe i got she wrong)

I agree with the general idea of adding non-voting notes into quorum. I'm concerned that getting these types of algorithms "correct" is really hard, and we should really use formal methods like TLA+ to verify the correctness. My main concern in the thread that was linked was that we should increase the epoch to change quorum (adding arbiters or other nodes), otherwise you can end up in split brains with divergent opinions about the cluster topology.

When a node wins a failover election, it checks if all nodes in the current quorum support the new voting rules.

For example, there is no easy way to do this since quorum can change in the distributed system while polling everyone. We can introduce a mechanism to get everyone to indicate their status at an epoch and retry on failures.

I was originally going to propose something like all nodes broadcast out "I support other nodes voting" and then a node can nominate for a new list of "additional voters" that include themselves. If a node in the "other voters" is marked as failed, nodes can vote to replace it. Nodes would get a new config for whether or not they are allowed to nominate themselves into the "other voters" list, or maybe can do it by default. If a conflicting failover happens, (we can still only vote for one failover per epoch) then we backoff and retry later.

So that the other nodes does not need to ping other node, something like that, i may not describe it well, but the idea is similar to it.

This seems like a good idea as well to limit the message passed around in the ping/pong messages about state.

hpatro · 2025-01-22T18:17:04Z

This can allow non-voting primaries, voting replicas and voting empty nodes. Modifying these details may be easier than to rewrite the cluster protocol to Raft (#384).

This has been a general concern I have heard from multiple folks at AWS around getting the Raft implementation right.

I'm currently reading a paper around gossip protocol for failure detection https://www.cs.cornell.edu/projects/Quicksilver/public_pdfs/SWIM.pdf which would likely reduce the ping-pong transfer across the clusterbus. Yet to apply the learnings to our implementation. After failure detection, I also think we need a smaller group as quorum compared to our existing quorum of all primaries in a cluster for failover.

hpatro · 2025-01-22T19:09:43Z

@enjoy-binbin With the arbiter mode approach, I have a few questions.

At what scale, do we start benefitting from it?
What's the avg. CPU utilization on the arbiter node with X server nodes?
What would be the mode of deployment for an end user if we build it in open? Do they need to provision additional capacity beyond the nodes to serve traffic and .

enjoy-binbin · 2025-01-23T04:14:57Z

At what scale, do we start benefitting from it?

i think it would benefit in every scale, one shard cluster or some cluster cross multi AZ. I didn't consider the failover or vote time here, this is more about the architectural benefits.

What's the avg. CPU utilization on the arbiter node with X server nodes?
What would be the mode of deployment for an end user if we build it in open? Do they need to provision additional capacity beyond the nodes to serve traffic and .

The arbiter node does not handle user request, artbier does not own any slots, so arbiter only do the gossip thing. The end user, if they want to get high availability of a single-shard cluster, or high availability at the multi-AZ level, they can use arbiter deployment. They don't need anything extra. The arbiter does not need additional capacity to serve traffic (except for gossip overhead). The cpu, i looked at one cluster (redis5 128 shards, 128 replica, 256 arbiters, so 512 nodes in total), the arbiter CPU avg usage is < 1% (while the primary is 30% since it is a in-use cluster), the cluster-timeout is the default, that is 15s.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

zuiderkwast commented Jan 21, 2025

zuiderkwast commented Jan 21, 2025 •

edited

Loading

enjoy-binbin commented Jan 22, 2025 •

edited

Loading

madolson commented Jan 22, 2025 •

edited

Loading

hpatro commented Jan 22, 2025

hpatro commented Jan 22, 2025

enjoy-binbin commented Jan 23, 2025 •

edited

Loading

[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

Comments

zuiderkwast commented Jan 21, 2025

zuiderkwast commented Jan 21, 2025 • edited Loading

enjoy-binbin commented Jan 22, 2025 • edited Loading

madolson commented Jan 22, 2025 • edited Loading

hpatro commented Jan 22, 2025

hpatro commented Jan 22, 2025

enjoy-binbin commented Jan 23, 2025 • edited Loading

zuiderkwast commented Jan 21, 2025 •

edited

Loading

enjoy-binbin commented Jan 22, 2025 •

edited

Loading

madolson commented Jan 22, 2025 •

edited

Loading

enjoy-binbin commented Jan 23, 2025 •

edited

Loading