-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden memory increase in ETCD members #7964
Comments
The ETCD server is on |
Can you please provide etcd server log starting at 19:30 to 23:00 for all your members in the cluster? |
The log is UTC time. The screenshot is Beijing time. |
@xiang90 I think this can be the root solution: #7981 (comment) |
There are quite a few lines on the graph. What are they? And I can see interleave logs. Were there two etcd concurrently writing into the same log file? It is very hard to understand what was going on from the log. But, yea, there are tons of warnings in your log. You need to make sure the warning rate is 0 most of the time. Or your etcd cluster will never be happy. |
No. There's only one etcd running in the Node, and the etcd container is started by kubelet.
The warnings are caused by disk sync I guess. And we have the local SSD and give ETCD enough resources. |
@xiang90 I use local SSD and start a whole new ETCD cluster without any load , there're still bunch of warnings about applying entries.
|
why raft index is so high? does any other application share the same SSD? if yes, move them away. or there is an issue with your SSD. probably ask your hardware guys to get it fixed. |
No. There're only another 2 applications running on the same node. One is prometheus which is using network volume. The other one is grafana dashboard which is not using any disk IO. |
Try to kill your another two applications. If the problem still exists, then you probably need to ask your hardware guy about what is going on. if etcd under no loads, apply should always be one or zero simple disk hit. |
I'll check the hardware level things about the performance issue. |
@zbindenren I believe your issues should be fixed if you upgrade to 3.1.5+. There is a bug in the etcd release > 3.1 but < 3.1.5. I am going to close this issue since it is inactive and we do not have enough information to help you. |
@xiang90 Thanks. And I'll upgrade our cluster to v3.1.9 . |
Problem is fixed with 3.1.9 |
We got into sudden memory increase on ETCD members in k8s cluster, and the ETCD pods got OOM killed.
I did some investigation on the members, and found that the member's DISK(Network Volume provided by SolidFire) was a little slow, and the Memory of the member kept increasing until OOM killed.
Then I switch the volume from network volume to local disk, and the OOM things gone.
Is it expected?
The text was updated successfully, but these errors were encountered: