Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dropped internal Raft message since sending buffer is full (overloaded network) #19635

Open
4 tasks done
rahulbapumore opened this issue Mar 21, 2025 · 2 comments
Open
4 tasks done
Labels

Comments

@rahulbapumore
Copy link
Contributor

Bug report criteria

What happened?

logs.txt

We are facing below error messages and etcd is restarting because it could not maintain quorum.
2025-03-18T12:21:38.350+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.451+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.551+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.651+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.751+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.850+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.951+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:38.972+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.041+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.151+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.251+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.351+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.451+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.550+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:39.850+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.275+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.277+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.351+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.651+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.751+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.851+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:40.950+01:00 dropped internal Raft message since sending buffer is full (overloaded network) 2025-03-18T12:21:41.050+01:00 dropped internal Raft message since sending buffer is full (overloaded network

From ETCD documentation , we found that its happening because too many client requests creating congestion in network, delaying peer communication.
https://etcd.io/docs/v3.5/tuning/

There are few manual steps given to set traffic priority, But we need some internal solution/WA like some parameter if that set then we wont see any restarts in ETCD

Could you please help with query?

Thanks in advance

What did you expect to happen?

No restart in etcd

How can we reproduce it (as minimally and precisely as possible)?

ETCD is deployed in the from of container controlled by statefulset, 3 replicas are setup.
We are upgrading our chart by changing certificates for ETCD, we are setting PEER_AUTO_TLS_ENABLED variable from true to false during upgrade.
When pod-2 is restarted due to upgrade and it started trusting siptls cert, it is not able to join older cluster because pod-0 and pod1 are trusting self-signed certs. So, pod-2 is out of the cluster, and pod-0/pod-1 are continuously flooding with peer connection request in order to have pod-2 inside existing cluster. This is the expected behavior from DCED but due to the high traffic during the upgrade and flooding of the peer requests, buffer is getting full inside the DCED pod-1 which restarted the etcd process inside pod-1 .

Anything else we need to know?

No response

Etcd version (please run commands below)

bash-4.4$ etcd --version
etcd Version: 3.5.15
Git SHA: 9a55333
Go Version: go1.21.12
Go OS/Arch: linux/amd64
bash-4.4$ etcdctl version
etcdctl version: 3.5.15
API version: 3.5
bash-4.4$

Etcd configuration (command line flags or environment variables)

No response

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

No response

Relevant log output

@rahulbapumore
Copy link
Contributor Author

rahulbapumore commented Mar 25, 2025

Hi @kumarlokesh @ahrtr @jmhbnz
We are already using etcd v3.5.12, and in that version pipelineBufSize is already set to 64, but still we are facing above error.
So do you mean after that parameter is made dynamically available[through https://github.com//pull/19663] to set through etcd config, we need to increase pipelineBufSize further more?

Thanks

@rahulbapumore
Copy link
Contributor Author

Hi @ahrtr @jmhbnz @kumarlokesh
Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant