Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes appear and disappear intermittently #897

Closed
2opremio opened this issue Feb 1, 2016 · 13 comments
Closed

Nodes appear and disappear intermittently #897

2opremio opened this issue Feb 1, 2016 · 13 comments
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it performance Excessive resource usage and latency; usually a bug or chore
Milestone

Comments

@2opremio
Copy link
Contributor

2opremio commented Feb 1, 2016

While testing #889 (b103f93) with the ECS demo, after letting it run for a few hours I see containers appearing a disappearing intermittently.

There are 3 httpserver containers and 3 dataproducer containers which should appear at the same time in the UI but they come and go erratically:

Here's the report, although I am not sure it will help here: https://gist.github.com/2opremio/b3b4f435b568fb836306

I don't see a particularly high CPU or memory consumption but the UI is sloppy and loading the topologies takes a considerable time when loading the UI.

The ECS demo uses very small AWS instances (t2.micro) but Scope works just fine after it's freshly spawned.

@2opremio
Copy link
Contributor Author

2opremio commented Feb 1, 2016

Related: #869 #827 (and maybe #854)

@2opremio 2opremio changed the title Nodes appear in a disappear intermittently Nodes appear and disappear intermittently Feb 3, 2016
@2opremio 2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label Feb 26, 2016
@2opremio 2opremio added this to the 0.14.0 milestone Mar 4, 2016
@2opremio 2opremio assigned 2opremio and unassigned 2opremio Apr 14, 2016
@paulbellamy
Copy link
Contributor

paulbellamy commented Apr 15, 2016

Possibly the probes on different hosts are being slow and missing their deadlines.

@tomwilkie tomwilkie modified the milestones: 0.14.0, Pre-1.0 Apr 15, 2016
@tomwilkie
Copy link
Contributor

Paul can't reproduce.

@tomwilkie
Copy link
Contributor

@2opremio please try and reproduce.

@tomwilkie tomwilkie modified the milestones: 0.15.0, Pre-1.0 Apr 19, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Apr 25, 2016

I can reproduce.

Possibly the probes on different hosts are being slow and missing their deadlines.

I think @paulbellamy is right.

Looking at the logs I find a lot of:

<probe> WARN: 2016/04/25 16:25:22.817941 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:25:37.092299 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:38.651785 Docker reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:42.024260 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:25:55.008294 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:58.244352 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:26:11.398667 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:21:15.602084 docker container: dropping stats.
<app> ERRO: 2016/04/25 16:21:16.178981 Error on websocket: websocket: close 1006 (abnormal closure): unexpected EOF

Full logs: logs.txt.gz

It seems that the AWS micro instances used for the demo are not powerful enough for Scope. They used to work fine, so there must have been a performance degradation.

@2opremio
Copy link
Contributor Author

Yep, the CPU consumption of the probe/app is pretty high considering the small amount of containers and that there are only 3 hosts:

screen shot 2016-04-25 at 5 36 11 pm

Probe profile: pprof.localhost:4041.samples.cpu.002.pb.gz

probe

App profile: pprof.localhost:4040.samples.cpu.001.pb.gz

app

The garbage collector is dominating the CPU consumption.

Related: #1010

@2opremio 2opremio added the performance Excessive resource usage and latency; usually a bug or chore label Apr 25, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Apr 25, 2016

So, the solution here is to:

  • Improve performance (obviously)
  • Maybe use larger instances for the ECS demos (I would like to avoid this if possible since larger instances are not covered by the free tier)
  • Notify users in the UI when probes are not meeting their deadlines. I have created Notify users when probes are not meeting their deadlines #1379

@tomwilkie
Copy link
Contributor

I think #1418 should help a lot towards CPU usage of the app. We could probably close this one.

@tomwilkie
Copy link
Contributor

#1418 is in, so I think we can close this.

@2opremio
Copy link
Contributor Author

2opremio commented May 13, 2016

Reopening since it still happens with the 0.15 candidate:

All the probes are dropping reports:

[ec2-user@ip-172-31-0-6 ~]$ docker logs --since="5m"  weavescope |& grep Dropping | tail -n 20 && date
<probe> ERRO: 2016/05/13 10:45:36.602905 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:45:36.602948 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:41.777678 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:46.752431 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:50.243380 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:50.243410 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:54.513633 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:59.237180 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:45:59.237238 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:01.406421 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:01.406500 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:01.406560 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:06.290269 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:06.290325 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:06.290377 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:09.527878 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:09.527907 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:13.921125 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:18.355134 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:18.355182 Dropping report to 127.0.0.1:4040
Fri May 13 10:46:22 UTC 2016

The apps are consuming ~40% CPU in two of the nodes just like it did at the beginning of this issue.

screen shot 2016-05-13 at 11 47 53 am

Which means that #1418 doesn't seem to be helping.

App profile: pprof.localhost:4040.samples.cpu.001.pb.gz

app_cpu

@2opremio 2opremio reopened this May 13, 2016
@2opremio 2opremio modified the milestones: Pre-1.0, 0.15.0 May 13, 2016
@2opremio
Copy link
Contributor Author

Related #1457

@2opremio
Copy link
Contributor Author

2opremio commented Aug 2, 2016

This may already be fixed after the recent CPU-consumption improvements. Worth reviewing again.

@2opremio 2opremio modified the milestones: August2016, July2016 Aug 2, 2016
@rade rade modified the milestones: 0.18/1.0, October2016 Sep 15, 2016
@rade rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jan 11, 2017
@rade
Copy link
Member

rade commented Apr 13, 2017

Let's close this; I don't see anything in this issue that points to causes other than cpu usage, which is already covered by numerous issues.

@rade rade closed this as completed Apr 13, 2017
@rade rade modified the milestones: n/a, Backlog Apr 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it performance Excessive resource usage and latency; usually a bug or chore
Projects
None yet
Development

No branches or pull requests

4 participants