Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

[solo] Fix skydns - two wrongs make a right #899

Closed
wants to merge 3 commits into from

Conversation

davidxia
Copy link
Contributor

@davidxia davidxia commented Apr 6, 2016

No description provided.

davidxia added 2 commits April 6, 2016 17:01
Skydns now needs to be compiled with go 1.5+.
TL;DR When two DNS servers don't work, add one more!

When running some integration tests with HeliosSoloDeployment on Docker
hosts that use a local unbound instance as its DNS resolver (i.e.
specified in `/etc/resolv.conf` on the Docker host),
we saw tests failures due to failed SRV queries to skydns. Skydns is
running in the solo container and forwards DNS queries it doesn't know
about to the unbound instance via logic in `start.sh`.

The skydns error output from the helios solo container spawned by
HeliosSoloDeployment looked like:

```
skydns: failure to forward request "dns: failed to unpack truncated
message"
```

Our guess is that large UDP responses from the upstream unbound
have the "Message Truncated" DNS flag set. When this type of response
reaches skydns, skydns blows up and doesn't tell the client about the
error. The client times out without retrying in TCP mode. The client
would've retried if it had received an error message from skydns.

Running `dig` against skydns works. We think this is because `dig` adds
an OPT record to its query that sets "udp payload size: 4096".

Here's an outstanding issue in skydns that seem related:

* skynetservices/skydns#242
* skynetservices/skydns#45

Solution:

We start an unbound instance in the solo container and have it speak
only TCP to the upstream skydns (in the same container) with
`tcp-upstream: yes`. This forces skydns to speak only TCP with its
upstream. No UDP truncation shit. Things are fixed. :)

We admit this is super funky, but there's no way to force skydns to
speak only TCP right now.
@davidxia
Copy link
Contributor Author

davidxia commented Apr 6, 2016

@gimaker
Copy link
Contributor

gimaker commented Apr 6, 2016

👍

@davidxia
Copy link
Contributor Author

davidxia commented Apr 6, 2016

We should somehow pin our skydns build artifact. Right now skydns is rebuilt from master here https://github.com/skynetservices/skydns whenever someone manually builds helios solo base image. We have no way of knowing what kind of skydns we've built later.

@codecov-io
Copy link

Current coverage is 48.96%

Merging #899 into master will increase coverage by +0.05% as of 0d0e5c9

@@            master    #899   diff @@
======================================
  Files          271     271       
  Stmts        12706   12706       
  Branches      1634    1634       
  Methods          0       0       
======================================
+ Hit           6215    6221     +6
- Partial        470     473     +3
+ Missed        6021    6012     -9

Review entire Coverage Diff as of 0d0e5c9

Powered by Codecov. Updated on successful CI builds.

@mattnworb
Copy link
Member

Nice explanation in the commit message 👍🏻

Is it meant to commit the -verbose flag to skydns in here? I wonder if the output would become large if solo was running for a while and sending lots of DNS requests (I'm not sure how docker logs stores things).

For the unpredictable nature of building SkyDNS, is it available as a package to install with apt-get?

@gimaker
Copy link
Contributor

gimaker commented Apr 6, 2016

We admit this is super funky, but there's no way to force skydns to
speak only TCP right now.

And it shouldn't. What it should do is properly forward the response with the truncation flag set to the requester (who would then retry using TCP). Instead it just blows up and never responds to the requester at all :(

We add yet another unbound between skydns and the upstream DNS resolver.
We saw that the previous commit broke docker-machine on my OS X. For
some reason, the upstream DNS resolver doesn't speak TCP. So this second
unbound translates TCP from skydns back to UDP for this upstream.

I can't even...
@davidxia davidxia force-pushed the dxia/fix-solo-skydns branch from aade797 to e006705 Compare April 6, 2016 22:30
@davidxia
Copy link
Contributor Author

davidxia commented Apr 6, 2016

This doesn't work on my mac now. See #900 for an alternative approach

@davidxia davidxia closed this Apr 6, 2016
@davidxia davidxia deleted the dxia/fix-solo-skydns branch April 6, 2016 23:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants