Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2nd etcd node not communicating with 1st etcd node: cluster ID mismatch error #13453

Closed
prakashmirji opened this issue Oct 29, 2021 · 8 comments
Labels

Comments

@prakashmirji
Copy link

Etcd version: 3.5.0
Platform: SLES 15 SP2
Deployed as systemd

configuration:

systemctl cat etcd
# /etc/systemd/system/etcd.service
[Unit]
Description=etcd
Documentation=https://github.com/coreos
[Service]
Restart=on-failure
RestartSec=5
LimitNOFILE=40000
TimeoutStartSec=0
EnvironmentFile=/opt/ezkube/bootstrap/systemd/10-etcd.env
ExecStart=/usr/bin/etcd \
  --advertise-client-urls=https://${INTERNAL_IP}:${ETCD_PORT} \
  --cert-file=/etc/kubernetes/pki/etcd/server.crt \
  --client-cert-auth=true \
  --data-dir=/var/lib/etcd \
  --initial-advertise-peer-urls=https://${INTERNAL_IP}:${ETCD_PEER_PORT} \
  --initial-cluster=${INITIAL_CLUSTER} \
  --key-file=/etc/kubernetes/pki/etcd/server.key \
  --listen-client-urls=https://127.0.0.1:${ETCD_PORT},https://${INTERNAL_IP}:${ETCD_PORT} \
  --listen-peer-urls=https://${INTERNAL_IP}:${ETCD_PEER_PORT} \
  --name=${NAME} \
  --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt \
  --peer-client-cert-auth=true \
  --peer-key-file=/etc/kubernetes/pki/etcd/peer.key \
  --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt \
  --snapshot-count=${SNAPSHOT_COUNT} \
  --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

[Install]
WantedBy=multi-user.target

etcd log

we see a lot of these log messages

-- Logs begin at Wed 2021-10-27 07:20:37 PDT, end at Fri 2021-10-29 04:30:52 PDT. --
Oct 28 23:14:28 etcdnod1.net etcd[26374]: {"level":"warn","ts":"2021-10-28T23:14:28.128-0700","caller":"raf
thttp/stream.go:653","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"abc63b82495af4b1"
,"remote-peer-cluster-id":"8c300cb900906703","local-member-id":"840c5a8fcf5a4b8e","local-member-cluster-id":"6599178285423ae9","err
or":"cluster ID mismatch"}
Oct 28 23:14:28 etcdnod1.net etcd[26374]: {"level":"warn","ts":"2021-10-28T23:14:28.231-0700","caller":"raf
thttp/stream.go:653","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"abc63b82495af4b1"
,"remote-peer-cluster-id":"8c300cb900906703","local-member-id":"840c5a8fcf5a4b8e","local-member-cluster-id":"6599178285423ae9","err
or":"cluster ID mismatch"}

output of : systemctl status etcd

● etcd.service - etcd
   Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2021-10-29 02:27:12 PDT; 2h 8min ago
     Docs: https://github.com/coreos
 Main PID: 29931 (etcd)
    Tasks: 7
   CGroup: /system.slice/etcd.service
           └─29931 /usr/bin/etcd --advertise-client-urls=https://16.0.14.118:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --cli
ent-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://16.0.14.118:2380 --initial-cluster=mip-bd-vm659.mip.s
torage.hpecorp.net=https://16.0.14.117:2380,mip-bd-vm660.mip.storage.hpecorp.net=https://16.0.14.118:2380 --key-file=/etc/kubernetes/pk
i/etcd/server.key --listen-client-urls=https://127.0.0.1:2379,https://16.0.14.118:2379 --listen-peer-urls=https://16.0.14.118:2380 --na
me=mip-bd-vm660.mip.storage.hpecorp.net --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt --peer-client-cert-auth=true --peer-key-file
=/etc/kubernetes/pki/etcd/peer.key --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt --snapshot-count=10000 --trusted-ca-file=/etc
/kubernetes/pki/etcd/ca.crt

Oct 29 04:35:31 mip-bd-vm660.mip.storage.hpecorp.net etcd[29931]: {"level":"error","ts":"2021-10-29T04:35:31.382-0700","caller":"raftht
tp/util.go:99","msg":"request sent was ignored due to cluster ID mismatch","remote-peer-id":"2cb93120384b98dc","remote-peer-cluster-id"
:"8c300cb900906703","local-member-cluster-id":"633c49ff49c16784","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.checkP
ostResponse\n\t/home/prow/go/src/github.hpe.com/hpe/ezkube/projects/etcd/etcd/server/etcdserver/api/rafthttp/util.go:99\ngo.etcd.io/etc
d/server/v3/etcdserver/api/rafthttp.(*pipeline).post\n\t/home/prow/go/src/github.hpe.com/hpe/ezkube/projects/etcd/etcd/server/etcdserve
r/api/rafthttp/pipeline.go:163\ngo.etcd.io/etcd/server/v3/etcdserver/api/rafthttp.(*pipeline).handle\n\t/home/prow/go/src/github.hpe.co
m/hpe/ezkube/projects/etcd/etcd/server/etcdserver/api/rafthttp/pipeline.go:100"}
Oct 29 04:35:31 mip-bd-vm660.mip.storage.hpecorp.net etcd[29931]: {"level":"warn","ts":"2021-10-29T04:35:31.430-0700","caller":"rafthtt
p/stream.go:653","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"2cb93120384b98dc","remote

We are planning to set up 2nd etcd node and join to the existing 1st etcd node. Our use case is to expand the etcd cluster. Any pointers on how to handle the above errors.

@ahrtr
Copy link
Member

ahrtr commented Nov 5, 2021

I see lots of people asked this question. So it's worthwhile to deliver a summary for this.

Firstly, you need to understand how the cluster ID is generated. The workflow is depicted in the diagram.
Screen Shot 2021-11-06 at 6 57 05 AM

Secondly, once you understand the above diagram/workflow, then the flag "--initial-cluster-state" is the key point. If there is local data, then it doesn't matter what the value for the flag. But if there is no local data, such as for a brand new member, then it matters. Usually when joining into an existing member, you should set "--initial-cluster-state existing".

@stale
Copy link

stale bot commented Feb 6, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 6, 2022
@serathius
Copy link
Member

@ahrtr do you think it would be worth documenting --initial-cluster-state so it will be easier to understand for users?

@stale stale bot removed the stale label Feb 8, 2022
@ahrtr
Copy link
Member

ahrtr commented Feb 8, 2022

@ahrtr do you think it would be worth documenting --initial-cluster-state so it will be easier to understand for users?

Yes, it makes sense. But I am not sure where is the best place to document this. Probably I can write a blog post? There is already a FAQ item on What does the etcd warning “request ignored (cluster ID mismatch)” mean? , we can add the blog post link into the FAQ item, what do you think?

@hmilkovi
Copy link

I also have the same issue with docker when I try to bootstrap new cluster:

{"level":"warn","ts":"2022-02-15T14:43:59.005Z","caller":"rafthttp/stream.go:653","msg":"request sent was ignored by remote peer due to cluster ID mismatch","remote-peer-id":"2c506e52bc451d34","remote-peer-cluster-id":"dba92b3a17cfe072","local-member-id":"3e2342fa21204127","local-member-cluster-id":"905641018400954b","error":"cluster ID mismatch"}

@ahrtr
Copy link
Member

ahrtr commented Feb 18, 2022

@hmilkovi Please provide detailed reproduce steps.

@ahrtr
Copy link
Member

ahrtr commented Mar 11, 2022

@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@stale stale bot closed this as completed Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants