Skip to content
This repository was archived by the owner on Jun 20, 2024. It is now read-only.

Using containers that mount /var/run/docker.sock causes No Route To Host in others #1846

Closed
alph486 opened this issue Dec 31, 2015 · 3 comments

Comments

@alph486
Copy link

alph486 commented Dec 31, 2015

Background

This is similar to #1184 (and #1455 according to @squaremo ) . The troubleshooting history for this is here: https://groups.google.com/a/weave.works/forum/#!topic/weave-users/jYXOGyf3SOA.

Summary

When leveraging a container that mounts in /var/run/docker.sock ( Ex. Cadvisor, Logspout...), ARPs will not be recieved by containers in the Weave network, resulting in stale Mac addresses and ultimately ConnectionRefused exceptions. According to the below google groups convo, this may have been fixed in later kernel versions, but after upgrading the problem still occurs.

Detail and Reproduction

I have a multi-host cluster with ServiceA, ServiceB, and Logspout all configured and launched by docker-compose. Services A/B are based on tag 5.1 of this image. Logspout is progrium/logspout and the configuration in compose is:

logspout:
  image: progrium/logspout
  volumes:
    - /var/run/docker.sock:/tmp/docker.sock
  command: "<some syslog things>"

All of these containers are on the same host in the cluster. Each weave node was started with weave launch --ipalloc-range=15.0.0.0/16 <other hosts>. All services are launched with docker-compose having the DOCKER_HOST env var set to: DOCKER_HOST=unix:///var/run/weave/weave.sock.

After launching:
1 - Run docker exec ServiceB ip addr and get the following for ethwe

978: ethwe@if979: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1410 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 06:de:15:5c:3f:07 brd ff:ff:ff:ff:ff:ff
    inet 15.0.0.9/16 scope global ethwe
       valid_lft forever preferred_lft forever
    inet6 fe80::4de:15ff:fe5c:3f07/64 scope link 
       valid_lft forever preferred_lft forever

2 - Run docker exec ServiceB ip neigh show and get the following for ServiceB:

15.0.0.9 dev ethwe lladdr 06:de:15:5c:3f:07 STALE

All is right with the world.

3 - After some time (sometimes also seen after restarting / recreating), the cache of serviceB becomes out of sync with ServiceA and the two will be different. (I assume this can be correlated to something happening in docker logs weave.

4 - At this point, one can enter ServiceA's container: docker exec -it ServiceA bash and curl ServiceB:port and will get an error about NoRouteToHost or ConnectionRefused.

5 - Temporarily relieving the issue is done by restarting or recreating the ServiceB container. Repeat step 4 and it will work properly.

If there is a better way to more reliably have the system issue ARPs that have the possibility of missing, I'm all ears. But this is the way I've had to discover it.

Environment

  • Ubuntu 15.10, kernel 4.2.0-18-generic
  • Weave 1.3.1
  • Docker 1.8.3
  • Docker-Compose 1.5.1
  • Docker Hosts 5
  • Containers ~ 25

Conclusion

It's obvious, but the behavior I would like is that I could leverage some of these tools / containers that need to leverage the docker socket as well. Logspout, Cadvisor, and others are common and prevalent tools in the docker ecosystem as is Ubuntu 15.x Linux Distro.

From my conversations on Google Groups, I guess that this has to do with the container having the mounted /var/run/docker.sock ACKing the messages before the weaveproxy can handle broadcasting any updated MACs.

Please let me know if any more info is needed or if there is an obvious workaround or fix for this situation.

Thanks!

@bboreham
Copy link
Contributor

bboreham commented Jan 7, 2016

@alph486 thank you for this report. Unfortunately, without specific information on how to reproduce it (e.g. valid substitutes for ServiceA and ServiceB) it would be somewhat hit-and-miss for us to troubleshoot.

The previous conversation on weave-users was all about the "connection refused" symptom. We have a script, listed at #1455 (comment), which will print out all the addresses and the namespaces using them; this let us track down a similar issue in the past. If you can recreate on your set-up, run the script on all hosts involved, and post the result, this may help to give some hints.

The "no route to host" symptom is different; similar to #1184, but we have (mostly) stopped using pcap since then so we need a new theory. If you can recreate this, doing weave status connections, weave report, and running the weave router with --log-level=debug may give some clues. (Beware that debug-level logging will be very verbose if the two routers cannot establish a "fast datapath" connection)

@alph486
Copy link
Author

alph486 commented Jan 11, 2016

@bboreham Thank you for the response! I generalized ServiceA and ServiceB because in the configuration mentioned above, nearly every container in my stack (Python APIs using flask, NodeJS Apps, Mongodb, ElasticSearch...The list goes on) has experienced the issue at one time or another. This leads me to believe it is technology agnostic.

On that note, the only fully common thing my images have in common is that they all are derived from ubuntu official.

Regarding the script you mentioned - Does my cluster need to be currently exhibiting the "Connection Refused" behavior for the script output to be useful?

@bboreham
Copy link
Contributor

Does my cluster need to be currently exhibiting the "Connection Refused" behavior for the script output to be useful

yes

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants