You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes the internal DNS resolver can't resolve a Service even though it should be accessible for some period of time (30s before, 10s now, see #318).
The reason this happens is dnsmasq and race conditions.
When OpenShift deploys a service it reloads the internal DNS server. That will cause first queries to take more time. There are two DNS servers available: the cluster DNS and the global DNS (for public records).
dnsmasq will return the first response from whatever server it receives the answer. In case when the cluster dns takes more time it is going to return the response from public server that is SOA record:
;; QUESTION SECTION:
;doo.bar.local. IN A
;; AUTHORITY SECTION:
. 86397 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2017032002 1800 900 604800 86400
which dnsmasq considers as valid reply (even though it does not match the question section and returns it. So when cluster DNS server is taking more time to reply than the public one the internal names can't be resolved.
Current Result
First reply that contains ANY answer is going to be used.
Expected Result
First reply that contains the same class as the query should be used.
Proposal
We could be fist querying dnsmasq and then the other servers defined in resolv.conf. That would make dnsmasq the first layer cache and in lua we can easily verify the answer matches the question or ignore the result and continue to query other servers. This would just introduce some latency but increase correctness.
The text was updated successfully, but these errors were encountered:
Sometimes the internal DNS resolver can't resolve a Service even though it should be accessible for some period of time (30s before, 10s now, see #318).
The reason this happens is dnsmasq and race conditions.
When OpenShift deploys a service it reloads the internal DNS server. That will cause first queries to take more time. There are two DNS servers available: the cluster DNS and the global DNS (for public records).
dnsmasq will return the first response from whatever server it receives the answer. In case when the cluster dns takes more time it is going to return the response from public server that is SOA record:
which dnsmasq considers as valid reply (even though it does not match the question section and returns it. So when cluster DNS server is taking more time to reply than the public one the internal names can't be resolved.
Current Result
First reply that contains ANY answer is going to be used.
Expected Result
First reply that contains the same class as the query should be used.
Proposal
We could be fist querying dnsmasq and then the other servers defined in resolv.conf. That would make dnsmasq the first layer cache and in lua we can easily verify the answer matches the question or ignore the result and continue to query other servers. This would just introduce some latency but increase correctness.
The text was updated successfully, but these errors were encountered: