Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force new cluster, learner to leader #13213

Closed

Conversation

yangxuanjia
Copy link
Contributor

When the business has cross-computer room disaster recovery requirements for ETCD, we hope to build an ETCD cluster in one computer room, and set up a separate Learner node in another computer room as a cross-computer room disaster recovery node. When the host room fails, we can manually set the learner The node is forcibly upgraded to a leader node to provide services to the business and ensure the high availability of the entire cluster in the event of a computer room-level failure.Of course, the reduction of data consistency is acceptable for the business.
But when we use the ETCD3.5 version to do the solution, we found that when the leader and follower nodes are down, when the learner passes new_force_cluster, it happens.

panic: removed all voters

goroutine 149 [running]:
go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0xc00014e000, 0x0, 0xc000736350, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/go.etcd.io/etcd/raft/raft.go:1633 +0x21a
go.etcd.io/etcd/raft/v3.(*node).run(0xc0001a3440)
github.com/go.etcd.io/etcd/raft/node.go:360 +0x856
created by go.etcd.io/etcd/raft/v3.RestartNode
github.com/go.etcd.io/etcd/raft/node.go:244 +0x330

This panic prevented me from forcibly promoting learner to leader to complete disaster recovery across computer rooms.

@yangxuanjia
Copy link
Contributor Author

@ptabor

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Do you mind explaining what you meant by the following:

when the learner passes new_force_cluster, it happens.

What are some of the downsides of adding this?

@@ -646,5 +643,50 @@ func createConfigChangeEnts(lg *zap.Logger, ids []uint64, self uint64, term, ind
next++
}

promoteNodeFunc := func(id uint64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a big change to me, I would expect some tests at least for this.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Jul 22, 2021

Thanks for the PR! Do you mind explaining what you meant by the following:

when the learner passes new_force_cluster, it happens.

What are some of the downsides of adding this?

I want to promote learner to leader when the other nodes all down. So I use --force-new-cluster when restart the learner. I hope the role of the node from learner to leader, and it runs ok. But when I restart the node I accept a panic.

panic: removed all voters

But follower can run this commond ok, so I think it's a bug here.

@yangxuanjia
Copy link
Contributor Author

I have 4 shell scripts, maybe it can help you test it.
1.start_cluster.sh

#!/bin/bash 

mkdir -p default.etcd/data
mkdir -p default.etcd/log

./bin/etcd --name infra0 --initial-advertise-peer-urls http://127.0.0.1:23800 \
  --listen-peer-urls http://127.0.0.1:23800 \
  --listen-client-urls http://127.0.0.1:23790 \
  --advertise-client-urls http://127.0.0.1:23790 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node1' \
  --log-outputs 'default.etcd/log/node1.log' \
  --initial-cluster-state new &
 
./bin/etcd --name infra1 --initial-advertise-peer-urls http://127.0.0.1:23801 \
  --listen-peer-urls http://127.0.0.1:23801 \
  --listen-client-urls http://127.0.0.1:23791 \
  --advertise-client-urls http://127.0.0.1:23791 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node2/' \
  --log-outputs 'default.etcd/log/node2.log' \
  --initial-cluster-state new &
 
./bin/etcd --name infra2 --initial-advertise-peer-urls http://127.0.0.1:23802 \
  --listen-peer-urls http://127.0.0.1:23802 \
  --listen-client-urls http://127.0.0.1:23792 \
  --advertise-client-urls http://127.0.0.1:23792 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node3/' \
  --log-outputs 'default.etcd/log/node3.log' \
  --initial-cluster-state new &
  1. add_learner.sh
#!/bin/bash

./bin/etcdctl --endpoints http://localhost:23790 member add infra100 --learner=true --peer-urls="http://127.0.0.1:23803"

./bin/etcd --name infra100 --initial-advertise-peer-urls http://127.0.0.1:23803 \
  --listen-peer-urls http://127.0.0.1:23803 \
  --listen-client-urls http://127.0.0.1:23793 \
  --advertise-client-urls http://127.0.0.1:23793 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node4/' \
  --log-outputs 'default.etcd/log/node4.log' \
  --initial-cluster-state existing &

3.learner_force_new_cluster.sh

#!/bin/bash

./bin/etcd --name infra100 --initial-advertise-peer-urls http://127.0.0.1:23803 \
  --listen-peer-urls http://127.0.0.1:23803 \
  --listen-client-urls http://127.0.0.1:23793 \
  --advertise-client-urls http://127.0.0.1:23793 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster learner1=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node4/' \
  --log-outputs 'default.etcd/log/node4.log' \
  --initial-cluster-state existing \
  --force-new-cluster &
  1. add_other.sh
#!/bin/bash 

rm -rf default.etcd/data/node1
rm -rf default.etcd/log/node1.log
rm -rf default.etcd/data/node2
rm -rf default.etcd/log/node2.log
rm -rf default.etcd/data/node3
rm -rf default.etcd/log/node3.log

./bin/etcdctl --endpoints http://localhost:23793 member add infra0 --peer-urls="http://127.0.0.1:23800"
./bin/etcd --name infra0 --initial-advertise-peer-urls http://127.0.0.1:23800 \
  --listen-peer-urls http://127.0.0.1:23800 \
  --listen-client-urls http://127.0.0.1:23790 \
  --advertise-client-urls http://127.0.0.1:23790 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node1' \
  --log-outputs 'default.etcd/log/node1.log' \
  --initial-cluster-state existing &

./bin/etcdctl --endpoints http://localhost:23793 member add infra1 --peer-urls="http://127.0.0.1:23801"
./bin/etcd --name infra1 --initial-advertise-peer-urls http://127.0.0.1:23801 \
  --listen-peer-urls http://127.0.0.1:23801 \
  --listen-client-urls http://127.0.0.1:23791 \
  --advertise-client-urls http://127.0.0.1:23791 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node2/' \
  --log-outputs 'default.etcd/log/node2.log' \
  --initial-cluster-state existing &

@stale
Copy link

stale bot commented Jan 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jan 12, 2022
@stale stale bot closed this Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants