Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

alph486 · 2015-12-31T19:10:05Z

Background

This is similar to #1184 (and #1455 according to @squaremo ) . The troubleshooting history for this is here: https://groups.google.com/a/weave.works/forum/#!topic/weave-users/jYXOGyf3SOA.

Summary

When leveraging a container that mounts in /var/run/docker.sock ( Ex. Cadvisor, Logspout...), ARPs will not be recieved by containers in the Weave network, resulting in stale Mac addresses and ultimately ConnectionRefused exceptions. According to the below google groups convo, this may have been fixed in later kernel versions, but after upgrading the problem still occurs.

Detail and Reproduction

I have a multi-host cluster with ServiceA, ServiceB, and Logspout all configured and launched by docker-compose. Services A/B are based on tag 5.1 of this image. Logspout is progrium/logspout and the configuration in compose is:

logspout:
  image: progrium/logspout
  volumes:
    - /var/run/docker.sock:/tmp/docker.sock
  command: "<some syslog things>"

All of these containers are on the same host in the cluster. Each weave node was started with weave launch --ipalloc-range=15.0.0.0/16 <other hosts>. All services are launched with docker-compose having the DOCKER_HOST env var set to: DOCKER_HOST=unix:///var/run/weave/weave.sock.

After launching:
1 - Run docker exec ServiceB ip addr and get the following for ethwe

978: ethwe@if979: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 06:de:15:5c:3f:07 brd ff:ff:ff:ff:ff:ff
    inet 15.0.0.9/16 scope global ethwe
       valid_lft forever preferred_lft forever
    inet6 fe80::4de:15ff:fe5c:3f07/64 scope link 
       valid_lft forever preferred_lft forever

2 - Run docker exec ServiceB ip neigh show and get the following for ServiceB:

15.0.0.9 dev ethwe lladdr 06:de:15:5c:3f:07 STALE

All is right with the world.

3 - After some time (sometimes also seen after restarting / recreating), the cache of serviceB becomes out of sync with ServiceA and the two will be different. (I assume this can be correlated to something happening in docker logs weave.

4 - At this point, one can enter ServiceA's container: docker exec -it ServiceA bash and curl ServiceB:port and will get an error about NoRouteToHost or ConnectionRefused.

5 - Temporarily relieving the issue is done by restarting or recreating the ServiceB container. Repeat step 4 and it will work properly.

If there is a better way to more reliably have the system issue ARPs that have the possibility of missing, I'm all ears. But this is the way I've had to discover it.

Environment

Ubuntu 15.10, kernel 4.2.0-18-generic
Weave 1.3.1
Docker 1.8.3
Docker-Compose 1.5.1
Docker Hosts 5
Containers ~ 25

Conclusion

It's obvious, but the behavior I would like is that I could leverage some of these tools / containers that need to leverage the docker socket as well. Logspout, Cadvisor, and others are common and prevalent tools in the docker ecosystem as is Ubuntu 15.x Linux Distro.

From my conversations on Google Groups, I guess that this has to do with the container having the mounted /var/run/docker.sock ACKing the messages before the weaveproxy can handle broadcasting any updated MACs.

Please let me know if any more info is needed or if there is an obvious workaround or fix for this situation.

Thanks!

The text was updated successfully, but these errors were encountered:

bboreham · 2016-01-07T12:10:43Z

@alph486 thank you for this report. Unfortunately, without specific information on how to reproduce it (e.g. valid substitutes for ServiceA and ServiceB) it would be somewhat hit-and-miss for us to troubleshoot.

The previous conversation on weave-users was all about the "connection refused" symptom. We have a script, listed at #1455 (comment), which will print out all the addresses and the namespaces using them; this let us track down a similar issue in the past. If you can recreate on your set-up, run the script on all hosts involved, and post the result, this may help to give some hints.

The "no route to host" symptom is different; similar to #1184, but we have (mostly) stopped using pcap since then so we need a new theory. If you can recreate this, doing weave status connections, weave report, and running the weave router with --log-level=debug may give some clues. (Beware that debug-level logging will be very verbose if the two routers cannot establish a "fast datapath" connection)

alph486 · 2016-01-11T15:12:27Z

@bboreham Thank you for the response! I generalized ServiceA and ServiceB because in the configuration mentioned above, nearly every container in my stack (Python APIs using flask, NodeJS Apps, Mongodb, ElasticSearch...The list goes on) has experienced the issue at one time or another. This leads me to believe it is technology agnostic.

On that note, the only fully common thing my images have in common is that they all are derived from ubuntu official.

Regarding the script you mentioned - Does my cluster need to be currently exhibiting the "Connection Refused" behavior for the script output to be useful?

bboreham · 2016-01-11T15:23:11Z

Does my cluster need to be currently exhibiting the "Connection Refused" behavior for the script output to be useful

yes

rade added the bug label May 3, 2016

rade added the resolution/irreproducible label Jun 25, 2016

rade added this to the n/a milestone Jun 25, 2016

rade closed this as completed Jun 25, 2016

bboreham mentioned this issue Dec 16, 2016

weave attach --rewrite-hosts sporadically fails #2559

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

alph486 commented Dec 31, 2015

Containers ~ 25

bboreham commented Jan 7, 2016

alph486 commented Jan 11, 2016

bboreham commented Jan 11, 2016

Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

Comments

alph486 commented Dec 31, 2015

Background

Summary

Detail and Reproduction

Environment

Containers ~ 25

Conclusion

bboreham commented Jan 7, 2016

alph486 commented Jan 11, 2016

bboreham commented Jan 11, 2016