You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the feature and found no similar feature requirement.
Description
Now the engine is implemented by hazelcast, and hazelcast has a split-brain problem when the network is partitioned. How our engine solves this problem.
For example, there is currently a cluster of 5 nodes.
At this time, due to a network failure, the network is partitioned. Network disconnection between nodes 1 2 and nodes 3 4 5,this creates two clusters resulting in a split brain.
This will cause the same task to be started in both clusters, which will be problematic
Two solutions that come to mind:
1、Use the minimum number of running nodes
The cluster will have a minimum number of running nodes, which is the total number of nodes/2+1. Only the number of cluster nodes is greater than this number to run. This ensures that at most one cluster will function properly in the event of a network partition.
A disadvantage of this is that if the number of running nodes is less than 1/2 of the total number, it cannot continue to run. That is to say, if there are 5 nodes, when more than 2 nodes fail, the entire cluster cannot continue to run. And what we want is to work even if only one node is left.
2、Use an external file system
We will use an external file system to store checkpoint data in cluster mode,So we can use an external filesystem。When a network partition occurs, multiple clusters will be generated,At this time, each cluster reports its own number of nodes to the file system. The file system selects the cluster with the largest number of nodes to keep it running,and notify other clusters to stop running. This also ensures that only one of the largest sub-clusters is running in the event of a network partition.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.
Search before asking
Description
Now the engine is implemented by hazelcast, and hazelcast has a split-brain problem when the network is partitioned. How our engine solves this problem.
For example, there is currently a cluster of 5 nodes.

At this time, due to a network failure, the network is partitioned. Network disconnection between nodes 1 2 and nodes 3 4 5,this creates two clusters resulting in a split brain.
This will cause the same task to be started in both clusters, which will be problematic
Two solutions that come to mind:
1、Use the minimum number of running nodes
The cluster will have a minimum number of running nodes, which is the total number of nodes/2+1. Only the number of cluster nodes is greater than this number to run. This ensures that at most one cluster will function properly in the event of a network partition.
A disadvantage of this is that if the number of running nodes is less than 1/2 of the total number, it cannot continue to run. That is to say, if there are 5 nodes, when more than 2 nodes fail, the entire cluster cannot continue to run. And what we want is to work even if only one node is left.
2、Use an external file system
We will use an external file system to store checkpoint data in cluster mode,So we can use an external filesystem。When a network partition occurs, multiple clusters will be generated,At this time, each cluster reports its own number of nodes to the file system. The file system selects the cluster with the largest number of nodes to keep it running,and notify other clusters to stop running. This also ensures that only one of the largest sub-clusters is running in the event of a network partition.
Usage Scenario
No response
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: