Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden memory increase in ETCD members #7964

Closed
armstrongli opened this issue May 22, 2017 · 15 comments
Closed

Sudden memory increase in ETCD members #7964

armstrongli opened this issue May 22, 2017 · 15 comments
Assignees

Comments

@armstrongli
Copy link

We got into sudden memory increase on ETCD members in k8s cluster, and the ETCD pods got OOM killed.

I did some investigation on the members, and found that the member's DISK(Network Volume provided by SolidFire) was a little slow, and the Memory of the member kept increasing until OOM killed.
Then I switch the volume from network volume to local disk, and the OOM things gone.

Is it expected?

screen shot 2017-05-22 at 10 45 13 pm

@armstrongli
Copy link
Author

The ETCD server is on v3.0.15

@xiang90
Copy link
Contributor

xiang90 commented May 22, 2017

Can you please provide etcd server log starting at 19:30 to 23:00 for all your members in the cluster?

@armstrongli
Copy link
Author

The log is UTC time. The screenshot is Beijing time.
etcd.log.2017.05.22.1.tar.gz

@armstrongli
Copy link
Author

@xiang90 I think this can be the root solution: #7981 (comment)

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

There are quite a few lines on the graph. What are they?

And I can see interleave logs. Were there two etcd concurrently writing into the same log file? It is very hard to understand what was going on from the log.

But, yea, there are tons of warnings in your log. You need to make sure the warning rate is 0 most of the time. Or your etcd cluster will never be happy.

@armstrongli
Copy link
Author

Were there two etcd concurrently writing into the same log file?

No. There's only one etcd running in the Node, and the etcd container is started by kubelet.

there are tons of warnings in your log

The warnings are caused by disk sync I guess. And we have the local SSD and give ETCD enough resources.
Do you have some recommendations to reduce the warnings or ease it out?

@armstrongli
Copy link
Author

@xiang90 I use local SSD and start a whole new ETCD cluster without any load , there're still bunch of warnings about applying entries.

/var/log # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                   ENDPOINT                   |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://tess-node-xq2sm-5915.51.tess.io:4001 | 5902a07919e43cdf | 3.0.15  | 25 kB   | true      |         4 |     108016 |
| https://tess-node-2o9hu-4886.51.tess.io:4001 | b620b4c395187fad | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-873t0-8911.51.tess.io:4001 | c2cb0e006c421dd3 | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-vh8tr-6888.51.tess.io:4001 | d49c2ef31cf07365 | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-ba93c-3871.51.tess.io:4001 | df3a73af6b1c2179 | 3.0.15  | 25 kB   | false     |         4 |     108017 |
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/var/log # tail etcd.log
2017-05-26 04:48:07.465530 W | etcdserver: apply entries took too long [70.226991ms for 1 entries]
2017-05-26 04:48:07.465587 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 04:51:10.465522 W | etcdserver: apply entries took too long [69.872088ms for 1 entries]
2017-05-26 04:51:10.465573 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 04:51:11.465581 W | etcdserver: apply entries took too long [70.156267ms for 1 entries]
2017-05-26 04:51:11.465624 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 05:01:11.965522 W | etcdserver: apply entries took too long [70.038266ms for 1 entries]
2017-05-26 05:01:11.965589 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 05:02:46.965542 W | etcdserver: apply entries took too long [28.357591ms for 1 entries]
2017-05-26 05:02:46.965601 W | etcdserver: avoid queries with large range/delete range!
/var/log #

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

108016

why raft index is so high?

does any other application share the same SSD? if yes, move them away. or there is an issue with your SSD. probably ask your hardware guys to get it fixed.

@armstrongli
Copy link
Author

does any other application share the same SSD?

No. There're only another 2 applications running on the same node. One is prometheus which is using network volume. The other one is grafana dashboard which is not using any disk IO.

@xiang90
Copy link
Contributor

xiang90 commented May 26, 2017

@armstrongli

Try to kill your another two applications. If the problem still exists, then you probably need to ask your hardware guy about what is going on. if etcd under no loads, apply should always be one or zero simple disk hit.

@armstrongli
Copy link
Author

I'll check the hardware level things about the performance issue.

@zbindenren
Copy link
Contributor

Hi

We have a similar problem with version 3.1.3.
memory

leader_elections

It is always the leader that consumes the memory.

We have 3 Nodes with around 1200 clients connected.

@xiang90
Copy link
Contributor

xiang90 commented Jun 14, 2017

@zbindenren I believe your issues should be fixed if you upgrade to 3.1.5+. There is a bug in the etcd release > 3.1 but < 3.1.5.

@armstrongli

I am going to close this issue since it is inactive and we do not have enough information to help you.

@xiang90 xiang90 closed this as completed Jun 14, 2017
@armstrongli
Copy link
Author

@xiang90 Thanks. And I'll upgrade our cluster to v3.1.9 .

@zbindenren
Copy link
Contributor

Problem is fixed with 3.1.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants