Sudden memory increase in ETCD members #7964

armstrongli · 2017-05-22T14:50:06Z

We got into sudden memory increase on ETCD members in k8s cluster, and the ETCD pods got OOM killed.

I did some investigation on the members, and found that the member's DISK(Network Volume provided by SolidFire) was a little slow, and the Memory of the member kept increasing until OOM killed.
Then I switch the volume from network volume to local disk, and the OOM things gone.

Is it expected?

armstrongli · 2017-05-22T14:50:44Z

The ETCD server is on v3.0.15

xiang90 · 2017-05-22T16:25:47Z

Can you please provide etcd server log starting at 19:30 to 23:00 for all your members in the cluster?

armstrongli · 2017-05-24T17:04:13Z

The log is UTC time. The screenshot is Beijing time.
etcd.log.2017.05.22.1.tar.gz

armstrongli · 2017-05-24T17:46:02Z

@xiang90 I think this can be the root solution: #7981 (comment)

xiang90 · 2017-05-26T03:43:48Z

There are quite a few lines on the graph. What are they?

And I can see interleave logs. Were there two etcd concurrently writing into the same log file? It is very hard to understand what was going on from the log.

But, yea, there are tons of warnings in your log. You need to make sure the warning rate is 0 most of the time. Or your etcd cluster will never be happy.

armstrongli · 2017-05-26T05:09:55Z

Were there two etcd concurrently writing into the same log file?

No. There's only one etcd running in the Node, and the etcd container is started by kubelet.

there are tons of warnings in your log

The warnings are caused by disk sync I guess. And we have the local SSD and give ETCD enough resources.
Do you have some recommendations to reduce the warnings or ease it out?

armstrongli · 2017-05-26T05:12:13Z

@xiang90 I use local SSD and start a whole new ETCD cluster without any load , there're still bunch of warnings about applying entries.

/var/log # etcdctl --endpoints [$(echo $member_urls | sed -e "s/ /,/g")] --cacert=/etc/ssl/kubernetes/ca.crt endpoint status -w table;
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
|                   ENDPOINT                   |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://tess-node-xq2sm-5915.51.tess.io:4001 | 5902a07919e43cdf | 3.0.15  | 25 kB   | true      |         4 |     108016 |
| https://tess-node-2o9hu-4886.51.tess.io:4001 | b620b4c395187fad | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-873t0-8911.51.tess.io:4001 | c2cb0e006c421dd3 | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-vh8tr-6888.51.tess.io:4001 | d49c2ef31cf07365 | 3.0.15  | 25 kB   | false     |         4 |     108016 |
| https://tess-node-ba93c-3871.51.tess.io:4001 | df3a73af6b1c2179 | 3.0.15  | 25 kB   | false     |         4 |     108017 |
+----------------------------------------------+------------------+---------+---------+-----------+-----------+------------+
/var/log # tail etcd.log
2017-05-26 04:48:07.465530 W | etcdserver: apply entries took too long [70.226991ms for 1 entries]
2017-05-26 04:48:07.465587 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 04:51:10.465522 W | etcdserver: apply entries took too long [69.872088ms for 1 entries]
2017-05-26 04:51:10.465573 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 04:51:11.465581 W | etcdserver: apply entries took too long [70.156267ms for 1 entries]
2017-05-26 04:51:11.465624 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 05:01:11.965522 W | etcdserver: apply entries took too long [70.038266ms for 1 entries]
2017-05-26 05:01:11.965589 W | etcdserver: avoid queries with large range/delete range!
2017-05-26 05:02:46.965542 W | etcdserver: apply entries took too long [28.357591ms for 1 entries]
2017-05-26 05:02:46.965601 W | etcdserver: avoid queries with large range/delete range!
/var/log #

xiang90 · 2017-05-26T05:13:46Z

108016

why raft index is so high?

does any other application share the same SSD? if yes, move them away. or there is an issue with your SSD. probably ask your hardware guys to get it fixed.

armstrongli · 2017-05-26T05:27:00Z

does any other application share the same SSD?

No. There're only another 2 applications running on the same node. One is prometheus which is using network volume. The other one is grafana dashboard which is not using any disk IO.

xiang90 · 2017-05-26T05:29:02Z

@armstrongli

Try to kill your another two applications. If the problem still exists, then you probably need to ask your hardware guy about what is going on. if etcd under no loads, apply should always be one or zero simple disk hit.

armstrongli · 2017-05-26T05:37:24Z

I'll check the hardware level things about the performance issue.

zbindenren · 2017-05-31T07:55:18Z

Hi

We have a similar problem with version 3.1.3.

It is always the leader that consumes the memory.

We have 3 Nodes with around 1200 clients connected.

xiang90 · 2017-06-14T17:52:50Z

@zbindenren I believe your issues should be fixed if you upgrade to 3.1.5+. There is a bug in the etcd release > 3.1 but < 3.1.5.

@armstrongli

I am going to close this issue since it is inactive and we do not have enough information to help you.

armstrongli · 2017-06-16T08:10:26Z

@xiang90 Thanks. And I'll upgrade our cluster to v3.1.9 .

zbindenren · 2017-06-16T10:19:14Z

Problem is fixed with 3.1.9

heyitsanthony assigned xiang90 May 23, 2017

xiang90 closed this as completed Jun 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden memory increase in ETCD members #7964

Sudden memory increase in ETCD members #7964

armstrongli commented May 22, 2017

armstrongli commented May 22, 2017

xiang90 commented May 22, 2017

armstrongli commented May 24, 2017

armstrongli commented May 24, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

armstrongli commented May 26, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

zbindenren commented May 31, 2017

xiang90 commented Jun 14, 2017

armstrongli commented Jun 16, 2017

zbindenren commented Jun 16, 2017

Sudden memory increase in ETCD members #7964

Sudden memory increase in ETCD members #7964

Comments

armstrongli commented May 22, 2017

armstrongli commented May 22, 2017

xiang90 commented May 22, 2017

armstrongli commented May 24, 2017

armstrongli commented May 24, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

armstrongli commented May 26, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

xiang90 commented May 26, 2017

armstrongli commented May 26, 2017

zbindenren commented May 31, 2017

xiang90 commented Jun 14, 2017

armstrongli commented Jun 16, 2017

zbindenren commented Jun 16, 2017