Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS issues with dnsmasq on OpenShift #321

Closed
mikz opened this issue Mar 21, 2017 · 0 comments · Fixed by #324
Closed

DNS issues with dnsmasq on OpenShift #321

mikz opened this issue Mar 21, 2017 · 0 comments · Fixed by #324
Assignees

Comments

@mikz
Copy link
Contributor

mikz commented Mar 21, 2017

Sometimes the internal DNS resolver can't resolve a Service even though it should be accessible for some period of time (30s before, 10s now, see #318).

The reason this happens is dnsmasq and race conditions.

When OpenShift deploys a service it reloads the internal DNS server. That will cause first queries to take more time. There are two DNS servers available: the cluster DNS and the global DNS (for public records).
dnsmasq will return the first response from whatever server it receives the answer. In case when the cluster dns takes more time it is going to return the response from public server that is SOA record:

;; QUESTION SECTION:
;doo.bar.local.			IN	A
;; AUTHORITY SECTION:
.			86397	IN	SOA	a.root-servers.net. nstld.verisign-grs.com. 2017032002 1800 900 604800 86400

which dnsmasq considers as valid reply (even though it does not match the question section and returns it. So when cluster DNS server is taking more time to reply than the public one the internal names can't be resolved.

Current Result

First reply that contains ANY answer is going to be used.

Expected Result

First reply that contains the same class as the query should be used.

Proposal

We could be fist querying dnsmasq and then the other servers defined in resolv.conf. That would make dnsmasq the first layer cache and in lua we can easily verify the answer matches the question or ignore the result and continue to query other servers. This would just introduce some latency but increase correctness.

@mikz mikz added this to the On-premise CR1 release milestone Mar 21, 2017
mikz added a commit that referenced this issue Mar 21, 2017
closes #321

we need to use all nameservers because we can't trust dnsmasq
as it returns the first answer even though it is not the same class
@ghost ghost assigned mikz Mar 21, 2017
@ghost ghost added the B-current label Mar 21, 2017
@octobot octobot added the T-obux label Mar 21, 2017
mikz added a commit that referenced this issue Mar 21, 2017
closes #321

we need to use all nameservers because we can't trust dnsmasq
as it returns the first answer even though it is not the same class
mikz added a commit that referenced this issue Mar 21, 2017
closes #321

we need to use all nameservers because we can't trust dnsmasq
as it returns the first answer even though it is not the same class
@mikz mikz closed this as completed in #324 Mar 21, 2017
@ghost ghost removed the B-current label Mar 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants