force new cluster, learner to leader #13213

yangxuanjia · 2021-07-14T02:22:06Z

When the business has cross-computer room disaster recovery requirements for ETCD, we hope to build an ETCD cluster in one computer room, and set up a separate Learner node in another computer room as a cross-computer room disaster recovery node. When the host room fails, we can manually set the learner The node is forcibly upgraded to a leader node to provide services to the business and ensure the high availability of the entire cluster in the event of a computer room-level failure.Of course, the reduction of data consistency is acceptable for the business.
But when we use the ETCD3.5 version to do the solution, we found that when the leader and follower nodes are down, when the learner passes new_force_cluster, it happens.

panic: removed all voters

goroutine 149 [running]:
go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0xc00014e000, 0x0, 0xc000736350, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
github.com/go.etcd.io/etcd/raft/raft.go:1633 +0x21a
go.etcd.io/etcd/raft/v3.(*node).run(0xc0001a3440)
github.com/go.etcd.io/etcd/raft/node.go:360 +0x856
created by go.etcd.io/etcd/raft/v3.RestartNode
github.com/go.etcd.io/etcd/raft/node.go:244 +0x330

This panic prevented me from forcibly promoting learner to leader to complete disaster recovery across computer rooms.

yangxuanjia · 2021-07-14T02:23:38Z

@ptabor

lilic

Thanks for the PR! Do you mind explaining what you meant by the following:

when the learner passes new_force_cluster, it happens.

What are some of the downsides of adding this?

lilic · 2021-07-21T14:17:33Z

server/etcdserver/raft.go

@@ -646,5 +643,50 @@ func createConfigChangeEnts(lg *zap.Logger, ids []uint64, self uint64, term, ind
 		next++
 	}

+	promoteNodeFunc := func(id uint64) {


This seems like a big change to me, I would expect some tests at least for this.

yangxuanjia · 2021-07-22T01:45:39Z

Thanks for the PR! Do you mind explaining what you meant by the following:

when the learner passes new_force_cluster, it happens.

What are some of the downsides of adding this?

I want to promote learner to leader when the other nodes all down. So I use --force-new-cluster when restart the learner. I hope the role of the node from learner to leader, and it runs ok. But when I restart the node I accept a panic.

panic: removed all voters

But follower can run this commond ok, so I think it's a bug here.

yangxuanjia · 2021-07-22T01:56:26Z

I have 4 shell scripts, maybe it can help you test it.
1.start_cluster.sh

#!/bin/bash 

mkdir -p default.etcd/data
mkdir -p default.etcd/log

./bin/etcd --name infra0 --initial-advertise-peer-urls http://127.0.0.1:23800 \
  --listen-peer-urls http://127.0.0.1:23800 \
  --listen-client-urls http://127.0.0.1:23790 \
  --advertise-client-urls http://127.0.0.1:23790 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node1' \
  --log-outputs 'default.etcd/log/node1.log' \
  --initial-cluster-state new &
 
./bin/etcd --name infra1 --initial-advertise-peer-urls http://127.0.0.1:23801 \
  --listen-peer-urls http://127.0.0.1:23801 \
  --listen-client-urls http://127.0.0.1:23791 \
  --advertise-client-urls http://127.0.0.1:23791 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node2/' \
  --log-outputs 'default.etcd/log/node2.log' \
  --initial-cluster-state new &
 
./bin/etcd --name infra2 --initial-advertise-peer-urls http://127.0.0.1:23802 \
  --listen-peer-urls http://127.0.0.1:23802 \
  --listen-client-urls http://127.0.0.1:23792 \
  --advertise-client-urls http://127.0.0.1:23792 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802 \
  --data-dir 'default.etcd/data/node3/' \
  --log-outputs 'default.etcd/log/node3.log' \
  --initial-cluster-state new &

add_learner.sh

#!/bin/bash

./bin/etcdctl --endpoints http://localhost:23790 member add infra100 --learner=true --peer-urls="http://127.0.0.1:23803"

./bin/etcd --name infra100 --initial-advertise-peer-urls http://127.0.0.1:23803 \
  --listen-peer-urls http://127.0.0.1:23803 \
  --listen-client-urls http://127.0.0.1:23793 \
  --advertise-client-urls http://127.0.0.1:23793 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra2=http://127.0.0.1:23802,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node4/' \
  --log-outputs 'default.etcd/log/node4.log' \
  --initial-cluster-state existing &

3.learner_force_new_cluster.sh

#!/bin/bash

./bin/etcd --name infra100 --initial-advertise-peer-urls http://127.0.0.1:23803 \
  --listen-peer-urls http://127.0.0.1:23803 \
  --listen-client-urls http://127.0.0.1:23793 \
  --advertise-client-urls http://127.0.0.1:23793 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster learner1=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node4/' \
  --log-outputs 'default.etcd/log/node4.log' \
  --initial-cluster-state existing \
  --force-new-cluster &

add_other.sh

#!/bin/bash 

rm -rf default.etcd/data/node1
rm -rf default.etcd/log/node1.log
rm -rf default.etcd/data/node2
rm -rf default.etcd/log/node2.log
rm -rf default.etcd/data/node3
rm -rf default.etcd/log/node3.log

./bin/etcdctl --endpoints http://localhost:23793 member add infra0 --peer-urls="http://127.0.0.1:23800"
./bin/etcd --name infra0 --initial-advertise-peer-urls http://127.0.0.1:23800 \
  --listen-peer-urls http://127.0.0.1:23800 \
  --listen-client-urls http://127.0.0.1:23790 \
  --advertise-client-urls http://127.0.0.1:23790 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node1' \
  --log-outputs 'default.etcd/log/node1.log' \
  --initial-cluster-state existing &

./bin/etcdctl --endpoints http://localhost:23793 member add infra1 --peer-urls="http://127.0.0.1:23801"
./bin/etcd --name infra1 --initial-advertise-peer-urls http://127.0.0.1:23801 \
  --listen-peer-urls http://127.0.0.1:23801 \
  --listen-client-urls http://127.0.0.1:23791 \
  --advertise-client-urls http://127.0.0.1:23791 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra0=http://127.0.0.1:23800,infra1=http://127.0.0.1:23801,infra100=http://127.0.0.1:23803 \
  --data-dir 'default.etcd/data/node2/' \
  --log-outputs 'default.etcd/log/node2.log' \
  --initial-cluster-state existing &

stale · 2022-01-12T01:14:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

yangxuanjia added 2 commits July 14, 2021 09:48

force new cluster, learner to leader

47f71ff

force new cluster, learner to leader

37235c7

force new cluster, learner to leader

abb119e

lilic reviewed Jul 21, 2021

View reviewed changes

fix learner to leader bug

b105063

yangxuanjia mentioned this pull request Nov 10, 2021

after delete 3 nodes, other node restart cause panic #13466

Closed

stale bot added the stale label Jan 12, 2022

stale bot closed this Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

force new cluster, learner to leader #13213

force new cluster, learner to leader #13213

yangxuanjia commented Jul 14, 2021

yangxuanjia commented Jul 14, 2021

lilic left a comment

lilic Jul 21, 2021

yangxuanjia commented Jul 22, 2021 •

edited

Loading

yangxuanjia commented Jul 22, 2021

stale bot commented Jan 12, 2022

force new cluster, learner to leader #13213

force new cluster, learner to leader #13213

Conversation

yangxuanjia commented Jul 14, 2021

yangxuanjia commented Jul 14, 2021

lilic left a comment

Choose a reason for hiding this comment

lilic Jul 21, 2021

Choose a reason for hiding this comment

yangxuanjia commented Jul 22, 2021 • edited Loading

yangxuanjia commented Jul 22, 2021

stale bot commented Jan 12, 2022

yangxuanjia commented Jul 22, 2021 •

edited

Loading