-
Notifications
You must be signed in to change notification settings - Fork 841
CPU usage increased dramatically 0.8.1-RC1 -> master #1497
Comments
Turns out that I was using 0.22.0 as a base image, will try with 0.22.1. |
Nah, it's still bad: Elasticsearch has hot_threads api, do you have something similar so I can give more meaningful data? |
You can use poor man's profiling tool: Could you use jstack a couple of times on the Marathon process and send us the stack traces? > jstack <MARATHON_PID> >stackX.txt
> jstack <MARATHON_PID> >stackX.txt
> jstack <MARATHON_PID> >stackX.txt
... You might have to enter the process space with |
@bobrik I couldn't reproduce this. I ran v0.8.2-RC2 in a docker and started 100 tasks without the Marathon process using considerably more than 10% CPU. What does your setup look like? |
@bobrik Woud it be possible for you to change the health checks to COMMAND checks that call curl and see if the CPU usage is still that high then? |
I tried, but it didn't work: healthChecks:
- protocol: COMMAND
command:
value: "curl -f -X GET http://$HOST:$PORT0/?n=marathon_healthcheck"
gracePeriodSeconds: 15
maxConsecutiveFailures: 300
intervalSeconds: 2
timeoutSeconds: 5
No more info is provided to resolve the issue. Task is healthy and works with http check. Can be related to #1380. |
Probably worth mentioning: each of 3 marathons receives 1 rps for |
I guess these queries are more or less the only thing I see in the stack traces that sticks out:
I don't know yet why this would have changed recently. |
Flamegraphs: https://gist.github.com/bobrik/969d322bb28c6a649cf7 https://github.com/jrudolph/perf-map-agent 0.8.1-RC1: 0.8.2-SNAPSHOT: Blue line, higher is 0.8.2: |
@drexin, were you using OpenJDK when you tried reproducing this? I wonder if that had something to do with it. https://gist.github.com/bobrik/87b8903cc3d502afe888 suggests that these numbers are all using that. |
I'm using Sun Java 1.7 not openjdk if that makes a difference |
Hi @bobrik, BTW I hope that it is clear that we really appreciate your detailed reporting. Unfortunately, we can't reproduce it so far. I could imagine that it has something to do with the Mesos Library changes. Did you try 0.8.1-RC1 with the 0.22.1 mesos libraries by any chance? I am not sure if that is a supported configuration but if that exhibits the same CPU pattern, the reason could lie in the new Mesos Library version. I think it might make sense to implement #1539 soon and check if your problems persist. What do you think? |
Should I just try 0.8.1-RC1 tag on top of Removing native code sounds like a good idea, too much is happening there. |
Hi @bobrik, if it's not a big hassle (at least in comparison to the things you have already done), trying 0.8.1-RC1 on top of mesosphere/mesos:0.22.1 would be grand. 👍 |
Ok, I'll try to collect metrics from master on 0.8.1-RC1 and 0.8.2-RC3 with Meanwhile, can you tell me what is needed from zk when I ask for |
Hi @bobrik, actually, reads currently go to Zookeeper as well. We want to change that. Basically, it is also a trade-off between looking at current user problems (which needs time) and rewriting some of the code (which needs time) which might actually solve these issues anyway. So, without wanting to sound smart, analyzing this issue actually prevents me from rewriting code. But I do not like to release 0.8.2 before we understand the implications. |
Hi @bobrik, I cannot really make sense of your graphs. What's the old, what's the new version? What makes you suspicious? Can you actually tell us the configuration parameters you start marathon with? I assume, they are the same between the old and the new version? Thanks. |
Sorry for not making it clear. Environment vars for marathon (in an ansible playbook): MARATHON_MASTER: zk://web488:2181,web489:2181,web490:2181/mesos
MARATHON_ZK: zk://web488:2181,web489:2181,web490:2181/marathon-new
MARATHON_ZK_MAX_VERSIONS: 10
MARATHON_HOSTNAME: "{{ inventory_hostname }}" They are the same for both versions. Now to the graphs, new ones this time, hope they are more clear. Here I ran 0.8.2-RC3 for 40 minutes, then 0.8.1-RC1 for 40 minutes, then 0.8.1-RC1 on top of 0.22.1 libs for 10 minutes. Marathon cluster: Zookeeper cluster for marathon and mesos, same time: Enormous difference in the used bandwidth to zookeeper is suspicious: 200 kb/s vs 5000 kb/s. CPU load and packet rate are alos higher with 0.8.2-RC3. Metrics for this, for 0.8.2 with Does it make sense now? Thank you for your patience. |
….statuses and make MarathonHealthCheckManager data structures more efficient
….statuses and make MarathonHealthCheckManager data structures more efficient
Hi @bobrik, thanks to your extensive reporting, we found the offender. The metrics in the gist helped us out. If you are really adventurous, you can checkout the |
Thanks, I'll test it tomorrow. Is there an issue to remove unnecessary zk read requests? |
Hi @bobrik, that is surprising. Can you export the metrics for us again? |
Metrics after 10 minutes: https://gist.github.com/bobrik/bb8b852eb1156624a3b8 |
I'll try to summarize the findings (correct me if I am wrong). When
you see that
So the new version still uses more CPU than 0.8.1-RC1 but is otherwise fine. The increased CPU could be potentially explained by more requests. At least in the last comparison with full metrics for some reason we saw significantly more requests against the new version, maybe because of faster response times. The new version has a mean response time of 35ms for AppResource.index compared to 126ms for the old version. If that is correct, we would like to release a new RC with the fix. |
…_efficient Fixes #1497 - Do not query app versions in MarathonHealthCheckManager
0.8.1-RC1:
PR 1568:
Much better than master, but still worse than 0.8.1-RC1. With reduced background usage (only 0.8.1-RC1:
PR 1568:
Go ahead with RC, i'll reduce the load with label selectors, sse and probably mesos api as the source of truth. |
Hi @bobrik, the strange thing is that the metrics that you gave us earlier told a different story. If you want, you can still send us the related metrics and I'll have a look. We will not release master as an RC but the old RC with only this single fix. Maybe it works better, maybe not. |
….statuses and make MarathonHealthCheckManager data structures more efficient
I built and deployed 964e430. Running against 0.22.1 masters:
Upgrade started at 16:20, last node was updated at 16:31. Revert to 0.8.1-RC1 happened at 16:41.
I mentioned performance in #1472 as well.
The text was updated successfully, but these errors were encountered: