Nodes appear and disappear intermittently #897

2opremio · 2016-02-01T20:31:55Z

While testing #889 (b103f93) with the ECS demo, after letting it run for a few hours I see containers appearing a disappearing intermittently.

There are 3 httpserver containers and 3 dataproducer containers which should appear at the same time in the UI but they come and go erratically:

Here's the report, although I am not sure it will help here: https://gist.github.com/2opremio/b3b4f435b568fb836306

I don't see a particularly high CPU or memory consumption but the UI is sloppy and loading the topologies takes a considerable time when loading the UI.

The ECS demo uses very small AWS instances (t2.micro) but Scope works just fine after it's freshly spawned.

The text was updated successfully, but these errors were encountered:

2opremio · 2016-02-01T20:33:52Z

Related: #869 #827 (and maybe #854)

paulbellamy · 2016-04-15T13:26:32Z

Possibly the probes on different hosts are being slow and missing their deadlines.

tomwilkie · 2016-04-15T15:04:32Z

Paul can't reproduce.

tomwilkie · 2016-04-19T15:25:31Z

@2opremio please try and reproduce.

2opremio · 2016-04-25T16:30:13Z

I can reproduce.

Possibly the probes on different hosts are being slow and missing their deadlines.

I think @paulbellamy is right.

Looking at the logs I find a lot of:

<probe> WARN: 2016/04/25 16:25:22.817941 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:25:37.092299 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:38.651785 Docker reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:42.024260 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:25:55.008294 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:25:58.244352 Topology tagger took longer than 1s
<probe> WARN: 2016/04/25 16:26:11.398667 Endpoint reporter took longer than 1s
<probe> WARN: 2016/04/25 16:21:15.602084 docker container: dropping stats.
<app> ERRO: 2016/04/25 16:21:16.178981 Error on websocket: websocket: close 1006 (abnormal closure): unexpected EOF

Full logs: logs.txt.gz

It seems that the AWS micro instances used for the demo are not powerful enough for Scope. They used to work fine, so there must have been a performance degradation.

2opremio · 2016-04-25T17:41:11Z

Yep, the CPU consumption of the probe/app is pretty high considering the small amount of containers and that there are only 3 hosts:

Probe profile: pprof.localhost:4041.samples.cpu.002.pb.gz

App profile: pprof.localhost:4040.samples.cpu.001.pb.gz

The garbage collector is dominating the CPU consumption.

Related: #1010

2opremio · 2016-04-25T17:45:38Z

So, the solution here is to:

Improve performance (obviously)
Maybe use larger instances for the ECS demos (I would like to avoid this if possible since larger instances are not covered by the free tier)
Notify users in the UI when probes are not meeting their deadlines. I have created Notify users when probes are not meeting their deadlines #1379

tomwilkie · 2016-05-05T09:41:02Z

I think #1418 should help a lot towards CPU usage of the app. We could probably close this one.

tomwilkie · 2016-05-05T10:22:30Z

#1418 is in, so I think we can close this.

2opremio · 2016-05-13T11:08:40Z

Reopening since it still happens with the 0.15 candidate:

All the probes are dropping reports:

[ec2-user@ip-172-31-0-6 ~]$ docker logs --since="5m"  weavescope |& grep Dropping | tail -n 20 && date
<probe> ERRO: 2016/05/13 10:45:36.602905 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:45:36.602948 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:41.777678 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:46.752431 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:50.243380 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:45:50.243410 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:54.513633 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:45:59.237180 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:45:59.237238 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:01.406421 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:01.406500 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:01.406560 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:06.290269 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:06.290325 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:06.290377 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:09.527878 Dropping report to 127.0.0.1:4040
<probe> ERRO: 2016/05/13 10:46:09.527907 Dropping report to 10.32.0.2:4040
<probe> ERRO: 2016/05/13 10:46:13.921125 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:18.355134 Dropping report to 10.36.0.2:4040
<probe> ERRO: 2016/05/13 10:46:18.355182 Dropping report to 127.0.0.1:4040
Fri May 13 10:46:22 UTC 2016

The apps are consuming ~40% CPU in two of the nodes just like it did at the beginning of this issue.

Which means that #1418 doesn't seem to be helping.

App profile: pprof.localhost:4040.samples.cpu.001.pb.gz

2opremio · 2016-05-13T11:28:15Z

Related #1457

2opremio · 2016-08-02T10:03:19Z

This may already be fixed after the recent CPU-consumption improvements. Worth reviewing again.

rade · 2017-04-13T22:25:09Z

Let's close this; I don't see anything in this issue that points to causes other than cpu usage, which is already covered by numerous issues.

2opremio mentioned this issue Feb 1, 2016

Deadlock in probe reporter #881

Closed

2opremio changed the title ~~Nodes appear in a disappear intermittently~~ Nodes appear and disappear intermittently Feb 3, 2016

2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label Feb 26, 2016

2opremio added this to the 0.14.0 milestone Mar 4, 2016

2opremio assigned 2opremio and unassigned 2opremio Apr 14, 2016

tomwilkie modified the milestones: 0.14.0, Pre-1.0 Apr 15, 2016

tomwilkie modified the milestones: 0.15.0, Pre-1.0 Apr 19, 2016

tomwilkie assigned 2opremio Apr 19, 2016

2opremio added the performance Excessive resource usage and latency; usually a bug or chore label Apr 25, 2016

2opremio mentioned this issue Apr 25, 2016

Notify users when probes are not meeting their deadlines #1379

Open

2opremio removed their assignment Apr 28, 2016

2opremio mentioned this issue Apr 29, 2016

Add performance tests to CI #1406

Open

tomwilkie closed this as completed May 5, 2016

2opremio reopened this May 13, 2016

2opremio mentioned this issue May 13, 2016

Release 0.15.0 #1443

Merged

2opremio modified the milestones: Pre-1.0, 0.15.0 May 13, 2016

2opremio mentioned this issue Jun 7, 2016

t2.micro instances in ECS guide are too anaemic weaveworks-guides/weave-net-legacy#181

Closed

2opremio modified the milestones: Pre-1.0, July2016 Jun 30, 2016

2opremio modified the milestones: August2016, July2016 Aug 2, 2016

2opremio mentioned this issue Aug 15, 2016

Release 0.17.0 #1791

Merged

rade modified the milestones: 0.18/1.0, October2016 Sep 15, 2016

rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jan 11, 2017

rade closed this as completed Apr 13, 2017

rade modified the milestones: n/a, Backlog Apr 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes appear and disappear intermittently #897

Nodes appear and disappear intermittently #897

2opremio commented Feb 1, 2016

2opremio commented Feb 1, 2016

paulbellamy commented Apr 15, 2016 •

edited

Loading

tomwilkie commented Apr 15, 2016

tomwilkie commented Apr 19, 2016

2opremio commented Apr 25, 2016 •

edited

Loading

2opremio commented Apr 25, 2016

2opremio commented Apr 25, 2016 •

edited

Loading

tomwilkie commented May 5, 2016

tomwilkie commented May 5, 2016

2opremio commented May 13, 2016 •

edited

Loading

2opremio commented May 13, 2016

2opremio commented Aug 2, 2016 •

edited

Loading

rade commented Apr 13, 2017

Nodes appear and disappear intermittently #897

Nodes appear and disappear intermittently #897

Comments

2opremio commented Feb 1, 2016

2opremio commented Feb 1, 2016

paulbellamy commented Apr 15, 2016 • edited Loading

tomwilkie commented Apr 15, 2016

tomwilkie commented Apr 19, 2016

2opremio commented Apr 25, 2016 • edited Loading

2opremio commented Apr 25, 2016

2opremio commented Apr 25, 2016 • edited Loading

tomwilkie commented May 5, 2016

tomwilkie commented May 5, 2016

2opremio commented May 13, 2016 • edited Loading

2opremio commented May 13, 2016

2opremio commented Aug 2, 2016 • edited Loading

rade commented Apr 13, 2017

paulbellamy commented Apr 15, 2016 •

edited

Loading

2opremio commented Apr 25, 2016 •

edited

Loading

2opremio commented Apr 25, 2016 •

edited

Loading

2opremio commented May 13, 2016 •

edited

Loading

2opremio commented Aug 2, 2016 •

edited

Loading