-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PartitionSelectorException in Lettuce Client Triggered by Addition of New Slave Nodes in Redis Cluster #2769
Comments
For your case, I suggest subclassing |
Thank you for your response. I understand that PartitionsConsensus is primarily aimed at avoiding split-brain scenarios. However, adding slave nodes is a quite common operation for Redis clusters, and encountering PartitionSelectorException during this process is possible, even though the chance is quite small. (This happened twice in my company, during the expansion of our Redis cluster from a 1 master & 1 slave setup to a 1 master & 2 slaves configuration; one of our application servers experienced PartitionSelectorException for 15 seconds, which corresponds to our refresh period time.) Given this, I believe enhancing Lettuce to better handle the addition of nodes during topology refreshes could benefit many users. Since PartitionsConsensus is aimed at preventing partition splits, if we could add some filter logic in determinePartitions, something like this: protected Partitions determinePartitions(Partitions current, Map<RedisURI, Partitions> topologyViews) {
// Filter out invalid topology views where master nodes have 0 slots.
Map<RedisURI, Partitions> filteredTopologyViews = new HashMap<>();
for (Map.Entry<RedisURI, Partitions> entry : topologyViews.entrySet()) {
Partitions partitions = entry.getValue();
boolean isValid = true;
for (RedisClusterNode node : partitions) {
if (node.is(RedisClusterNode.NodeFlag.UPSTREAM) && node.getSlots().isEmpty()) {
isValid = false;
break;
}
}
if (isValid) {
filteredTopologyViews.put(entry.getKey(), partitions);
}
}
// PartitionsConsensus codes...
} In my testing, after implementing such modifications, the PartitionSelectorException never occurred again, even in the midst of refreshing the topology upon adding nodes. I hope this suggestion might be helpful, and I look forward to your thoughts. |
A node can be valid if it doesn't hold any slots, e.g. for Pub/Sub usage. I wondered whether it would make sense to make Alternatively, you can enable adaptive topology refresh. Before the exception is thrown, the adaptive trigger |
I see.
It would be better to prevent the client from getting incorrect cluster nodes information from the newly added nodes, moreover, it could potentially become the final determined partition.
We have enabled the adaptive topology refresh by calling enableAllAdaptiveRefreshTriggers(), but it seems it didn't work when the PartitionSelectorException was triggered by periodic refresh. I'm not sure if this is because they share the same timeout setting (we set both to 15s). |
Bug Report
Current Behavior
A PartitionSelectorException occurs under specific conditions when a Redis cluster is adding a new slave node. If the Lettuce client's periodic topology refresh coincides with the addition of the new slave node, there's a small chance of encountering the io.lettuce.core.cluster.PartitionSelectorException: Cannot determine a partition to ...
Stack trace
Input Code
the code that is responsible for the bug
Input Code
Expected behavior/code
The expected behavior is for the Lettuce client to handle the topology refresh seamlessly without throwing a PartitionSelectorException, even when a new slave node is being added to the Redis cluster.
Environment
Possible Solution
The proposed solution involves modifying the KnownMajority class within PartitionsConsensus. The adjustment aims to account for master nodes with 0 assigned slots, which occur transiently when a new node is added to the cluster. This change ensures that these nodes are not incorrectly voted for during topology consensus, preventing the PartitionSelectorException.
Additional context
The issue arises due to the two-step process of adding a slave node to a Redis cluster, where the node is initially added as a master before its role changes to a slave. If a topology refresh occurs during this period and connects to the newly added node, the returned cluster nodes information may be incorrect. This incorrect information leads to a situation where a master node might be reported with 0 assigned slots, triggering the PartitionSelectorException. The proposed solution in the KnownMajority class's logic aims to prevent this by adjusting the voting mechanism to disregard these transient states.
The result in the red box will cause the slot size of the 192.168.21.32:6479(master) in the topology obtained by Lettuce to be empty.(slots size = 0)
How to reproduce the error
a.sh script removes a node from the cluster, stops the Redis server, cleans up data, and then re-adds the node as a slave to a specific master.
cron.sh script repeatedly executes a.sh to simulate the node addition process multiple times, introducing potential timing conflicts with the topology refresh.
My two shell scripts:
a.sh
cron.sh
The text was updated successfully, but these errors were encountered: