Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS sync stops in presence of any Virtual Service that points to nonexistent gateway #3628

Closed
sedflix opened this issue May 24, 2023 · 4 comments · Fixed by #3686
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@sedflix
Copy link

sedflix commented May 24, 2023

What happened:

  1. We had some erroneous Virtual Service in our cluster that were pointing to istio gateway that doesn't even exist.
  2. We upgraded to external-dns v0.13.4.
  3. We started seeing the following error message:
time="2023-05-24T10:49:48Z" level=error msg="Failed retrieving gateway istio-system/istio-internal-game2 referenced by VirtualService podinfo-nomesh/podinfo-lab-wrong: gateways.networking.istio.io \"istio-internal-game2\" not found"                                                                                  time="2023-05-24T10:49:48Z" level=error msg="gateways.networking.istio.io \"istio-internal-game2\" not found"
  1. external_dns_source_endpoints_total became 0 and didn't increase. external_dns_controller_last_sync_timestamp_seconds was in 1970 and never increased.
  2. When we added a new valid virtual service, we got a debug log showing that a new virtual service was added. But the domain was never configured.
  3. After removing the erroneous Virtual Service, the domain was created again.

What you expected to happen:
One erroneous Virtual Service should not block the entire external-dns sync.

How to reproduce it (as minimally and precisely as possible):

  1. use external-dns v0.13.4.
  2. create a Virtual Service pointing to istio gateway that doesn't even exist
  3. create ingress or virtual service

Anything else we need to know?:

Environment:

  • External-DNS version (use external-dns --version): v0.13.4.
  • DNS provider: google
  • Others:
    • source: istio-virtualservice
    • istio version: 1.15
@sedflix sedflix added the kind/bug Categorizes issue or PR as related to a bug. label May 24, 2023
@sedflix
Copy link
Author

sedflix commented May 24, 2023

I assume this issue started coming from 0.13.2 due to this PR to https://github.com/kubernetes-sigs/external-dns/pull/3140/files#diff-5046d74abd634825be6a08257d8fa7655b7af289ec075c0ce348e2800fa0ee5eR294 ?

Before this PR, https://github.com/kubernetes-sigs/external-dns/pull/3140/files#diff-5046d74abd634825be6a08257d8fa7655b7af289ec075c0ce348e2800fa0ee5eR205 used to return an error as well in this scenario and the loop used to continue. In this case, it passed "err" to downstream and it returns.

cc: @ricoberger

@sedflix sedflix changed the title DNS sync stops in presence of any invalid Virtual Service(with non-existent gateway) DNS sync stops in presence of any Virtual Service that points to nonexistent gateway May 24, 2023
@brucec5
Copy link

brucec5 commented Jun 13, 2023

As of 0.13.5, this problem causes external-dns to exit here, which means a developer could accidentally take down external-dns by deploying a funky VirtualService.

@szuecs
Copy link
Contributor

szuecs commented Jun 15, 2023

@brucec5 @sedflix @rumstead then I would say the proper fix would be to change the fatal log or if all other sources are safe we should merge the linked PR.

@rumstead
Copy link
Contributor

rumstead commented Jun 15, 2023

@brucec5 @sedflix @rumstead then I would say the proper fix would be to change the fatal log or if all other sources are safe we should merge the linked PR.

I don't disagree that the log.Fatal seemed to break the "bubble up" interface for each provider but from the issue, it looks to have been added to prevent A record deletion when the K8s API is unavailable. You definitely have more background on the issue than I but wouldn't that regress the fix for the A record deletion?

EDIT: Taking a bit closer look at the code, even if we change back the log.Fatal do we want an incorrectly configured VirtualService to prevent other Gateways from being exposed? The not found "error" will bubble up and looks to prevent other VirtualServices from being processed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants