Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller first boot (or two) fails when combined with istio-proxy #1968

Closed
Stono opened this issue Jan 24, 2018 · 26 comments · Fixed by #2089
Closed

Controller first boot (or two) fails when combined with istio-proxy #1968

Stono opened this issue Jan 24, 2018 · 26 comments · Fixed by #2089

Comments

@Stono
Copy link
Contributor

Stono commented Jan 24, 2018

Hey,
I've observed this behaviour, which is the container failing to boot one or two times with every update we do.

❯ kn logs -p nginx-ingress-controller-58688765db-hxlzq nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.0
  Build:      git-0836cb30
  Repository: https://github.com/aledbf/ingress
-------------------------------------------------------------------------------

I0124 08:33:25.301191       6 main.go:181] Creating API client for https://10.178.0.1:443
F0124 08:33:25.303541       6 main.go:236] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.178.0.1:443/version: dial tcp 10.178.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

Then it'll crash, and come back to life:

❯ kn logs nginx-ingress-controller-58688765db-hxlzq nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.0
  Build:      git-0836cb30
  Repository: https://github.com/aledbf/ingress
-------------------------------------------------------------------------------

I0124 08:33:27.712235       7 main.go:181] Creating API client for https://10.178.0.1:443
I0124 08:33:27.723704       7 main.go:193] Running in Kubernetes Cluster version v1.8+ (v1.8.6-gke.0) - git (clean) commit ee9a97661f14ee0b1ca31d6edd30480c89347c79 - platform linux/amd64

Any ideas?

@aledbf
Copy link
Member

aledbf commented Jan 24, 2018

@Stono please add the flag --v=10 to increase the verbosity and be able to see the reason

@aledbf
Copy link
Member

aledbf commented Jan 24, 2018

@Stono usually this means there's some networking issue

@Stono
Copy link
Contributor Author

Stono commented Jan 24, 2018

Hey @aledbf - I added that -v=10, it does look like connectivity, i just don't understand why after a crash it then works?

It's like that initial call to the k8 master to get the version fails?

❯ kn logs -p nginx-ingress-controller-5d978f6589-d6jn6 nginx-ingress-controller --follow
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.0
  Build:      git-0836cb30
  Repository: https://github.com/aledbf/ingress
-------------------------------------------------------------------------------

I0124 14:12:54.035114      11 main.go:181] Creating API client for https://10.192.0.1:443
I0124 14:12:54.035922      11 round_trippers.go:417] curl -k -v -XGET  -H "Accept: application/vnd.kubernetes.protobuf, */*" -H "User-Agent: nginx-ingress-controller/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer **omitted**" https://10.192.0.1:443/version
I0124 14:12:54.036319      11 round_trippers.go:436] GET https://10.192.0.1:443/version  in 0 milliseconds
I0124 14:12:54.036707      11 round_trippers.go:442] Response Headers:
F0124 14:12:54.036739      11 main.go:236] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

@aledbf
Copy link
Member

aledbf commented Jan 24, 2018

Please check if the node where it works it is the same where it fails

@Stono
Copy link
Contributor Author

Stono commented Jan 24, 2018

@aledbf I've confirmed the node is the same

@Stono
Copy link
Contributor Author

Stono commented Feb 14, 2018

@aledbf I've finally got to the bottom of this.
I'm using istio on our cluster, which wraps nginx-ingress-controller in envoy, what basically happens is this:

  • proxy_init container starts and adds iptables rules to intercept outbound traffic
  • nginx-ingress-controller and envoy start at the same time
  • nginx-ingress-controller attempts to read kubernetes.default/version before istio-proxy has become available, resulting in the connection refused message.

I've actually worked around this by having a custom entrypoint on the nginx-ingress-controller, but it's a bit hacky and I'm thinking it'd be nice if nginx-ingress-controller itself implemented a retry when checking the version of kubernetes, what do you think?

#!/bin/bash
echo "Starting..."
sleep 5

MAX_TRIES=10
CURRENT_ATTEMPT=0
echo "Waiting for Kubernetes API to be available (istio-proxy bollocks)...."
until $(curl -k --output /dev/null --silent --fail https://kubernetes.default/version); do
  CURRENT_ATTEMPT=$((CURRENT_ATTEMPT+1))
  printf '.'
  sleep 2
    if [ "$CURRENT_ATTEMPT" = "$MAX_TRIES" ]; then
      echo "Failed to get kubernetes version after $MAX_TRIES"
      exit 1
    fi
done

echo "Booting nginx...($*)"
exec $*

@aledbf
Copy link
Member

aledbf commented Feb 14, 2018

@Stono please try quay.io/aledbf/nginx-ingress-controller:0.327

@Stono
Copy link
Contributor Author

Stono commented Feb 14, 2018

No joy I'm afriad @aledbf:

As you can see, 1 restart on both instances of my controller:

ingress-nginx-external-controller-9c4d5dbf6-t5nv8    3/3       Running   1          4m
ingress-nginx-external-controller-9c4d5dbf6-vkksc    3/3       Running   1          1m

Looking at the first try logs:

❯ kn logs ingress-nginx-external-controller-9c4d5dbf6-t5nv8 -c nginx-ingress-controller -p
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-c8f6af39
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0214 19:09:31.897256       7 flags.go:159] Watching for ingress class: nginx-external
W0214 19:09:31.897299       7 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0214 19:09:31.897919       7 main.go:183] Creating API client for https://10.192.0.1:443
F0214 19:09:31.898967       7 main.go:257] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

Then the second:

❯ kn logs ingress-nginx-external-controller-9c4d5dbf6-vkksc nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-c8f6af39
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0214 19:13:16.080976       7 flags.go:159] Watching for ingress class: nginx-external
W0214 19:13:16.081024       7 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0214 19:13:16.081594       7 main.go:183] Creating API client for https://10.192.0.1:443
I0214 19:13:16.088368       7 main.go:214] Running in Kubernetes Cluster version v1.9+ (v1.9.2-gke.1) - git (clean) commit 4ce7af72d8d343ea2f7680348852db641ff573af - platform linux/amd64

This is marginally better than before (typically I had a couple of pod crashes)! From looking at your code though if i'm reading it right you're only doing 50ms backoffs? Probably needs to be a bit longer to give the other istio pod time to boot + pull its configuration from pilot

@aledbf
Copy link
Member

aledbf commented Feb 14, 2018

@Stono try quay.io/aledbf/nginx-ingress-controller:0.328

@Stono
Copy link
Contributor Author

Stono commented Feb 14, 2018

Still the same :'( @aledbf

ingress-nginx-external-controller-865d96c779-8gkv5   3/3       Running   1          1m
ingress-nginx-external-controller-865d96c779-8rstv   3/3       Running   1          1m
❯ kn logs -p ingress-nginx-external-controller-865d96c779-8rstv nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-5cbf17b0
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0214 19:25:20.052413       8 flags.go:159] Watching for ingress class: nginx-external
W0214 19:25:20.052467       8 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0214 19:25:20.053087       8 main.go:183] Creating API client for https://10.192.0.1:443
F0214 19:25:20.054335       8 main.go:257] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

Looking at the time between the logs though Creating API client for was at 19:25:20.053087 and then the error was at 19:25:20.054335, i'm not sure that backoff is actually working? Looks like it's failing off the first try?

Perhaps some more logs showing "Trying..." or something?

@aledbf
Copy link
Member

aledbf commented Feb 14, 2018

@Stono please use quay.io/aledbf/nginx-ingress-controller:0.329 (log added)

@Stono
Copy link
Contributor Author

Stono commented Feb 14, 2018 via email

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

Yup, only seeing the error once @aledbf:

❯ kn logs ingress-nginx-external-controller-8469d889b6-bmpsp nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-0050b556
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0215 08:22:24.551884       9 flags.go:159] Watching for ingress class: nginx-external
W0215 08:22:24.551941       9 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0215 08:22:24.552528       9 main.go:183] Creating API client for https://10.192.0.1:443
I0215 08:22:24.553173       9 main.go:202] trying to discover Kubernetes version
E0215 08:22:24.553547       9 main.go:208] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
F0215 08:22:24.553576       9 main.go:259] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused

I'd expect to be seeing unexpected error discovering Kubernetes version multiple times

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

@Stono please use quay.io/aledbf/nginx-ingress-controller:0.330

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

and also quay.io/aledbf/nginx-ingress-controller:0.331

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

@Stono I think 0.331 is the one. I am sorry to bother with the tests but I cannot reproduce this issue

Edit: the log about the retries only appears if you use --v=2

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

It's OK @aledbf - no joy with 331 either, this is running with --v=2 too:

❯ kn logs ingress-nginx-external-controller-78c899b4fd-9sq7w -p nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-7b1ce6b8
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0215 17:43:32.061905       8 flags.go:159] Watching for ingress class: nginx-external
W0215 17:43:32.061990       8 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0215 17:43:32.062623       8 main.go:183] Creating API client for https://10.192.0.1:443
I0215 17:43:32.063314       8 main.go:195] trying to discover Kubernetes version
I0215 17:43:32.063717       8 main.go:201] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
F0215 17:43:32.063826       8 main.go:252] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

@Stono last test, I promise :)
quay.io/aledbf/nginx-ingress-controller:0.332

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

Your retry works now, it's just still a bit too fast haha.

I0215 18:01:28.731695      12 main.go:183] Creating API client for https://10.192.0.1:443
I0215 18:01:28.732774      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:28.733456      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 18:01:29.233693      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:29.234104      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 18:01:29.733624      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:29.734010      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 18:01:30.233654      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:30.233955      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 18:01:30.733637      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:30.733953      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 18:01:30.733973      12 main.go:196] trying to discover Kubernetes version
I0215 18:01:30.734077      12 main.go:204] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
F0215 18:01:30.734091      12 main.go:255] Error while initializing connection to Kubernetes apiserver. This most likely means that the cluster is misconfigured (e.g., it has invalid apiserver certificates or service accounts configuration). Reason: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
Refer to the troubleshooting guide for more information: https://github.com/kubernetes/ingress-nginx/blob/master/docs/troubleshooting.md

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

quay.io/aledbf/nginx-ingress-controller:0.333

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

I've gotta nip out for a few hours but will test that when i'm back - sorry!

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

It worked, yay! Nice work @aledbf

❯ kn logs ingress-nginx-external-controller-75cb9967b4-7zl82 -c nginx-ingress-controller
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:    0.10.2
  Build:      git-08392422
  Repository: https://github.com/aledbf/ingress-nginx
-------------------------------------------------------------------------------

I0215 21:55:19.010824      11 flags.go:159] Watching for ingress class: nginx-external
W0215 21:55:19.010919      11 flags.go:162] only Ingress with class "nginx-external" will be processed by this ingress controller
I0215 21:55:19.011609      11 main.go:183] Creating API client for https://10.192.0.1:443
I0215 21:55:19.012289      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:19.012642      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:20.099269      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:20.099665      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:21.652031      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:21.652529      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:24.006623      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:24.006988      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:27.660130      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:27.660612      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:33.170121      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:33.170525      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:41.219573      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:41.219952      11 main.go:211] unexpected error discovering Kubernetes version: Get https://10.192.0.1:443/version: dial tcp 10.192.0.1:443: getsockopt: connection refused
I0215 21:55:53.461305      11 main.go:203] trying to discover Kubernetes version
I0215 21:55:53.469122      11 main.go:218] Running in Kubernetes Cluster version v1.9+ (v1.9.2-gke.1) - git (clean) commit 4ce7af72d8d343ea2f7680348852db641ff573af - platform linux/amd6
4
I0215 21:55:53.471588      11 main.go:86] validated ingress-nginx/default-http-backend as the default backend
I0215 21:55:53.473840      11 main.go:107] service ingress-nginx/ingress-nginx-external validated as source of Ingress status
I0215 21:55:53.644723      11 stat_collector.go:77] starting new nginx stats collector for Ingress controller running in namespace  (class nginx-external)
I0215 21:55:53.644758      11 stat_collector.go:78] collector extracting information from port 18080
I0215 21:55:53.657144      11 nginx.go:245] starting Ingress controller

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

@Stono this takes too long. Is this "normal"?

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

@Stono if you remove the istio proxy this there is no problem right?

@Stono
Copy link
Contributor Author

Stono commented Feb 15, 2018

@aledbf actually that is a really good point, I didn't realise just how long that was (30 seconds). Maybe someone from the Istio team can offer us support (I've raised a cross issue)

  • Init container implements IPTables rules which route all traffic to istio-proxy:15001
  • nginx-ingress-controller boots and isn't able to speak to the kubernetes api server until istio-proxy is available to handle the traffic

But yes, to answer your question without istio-proxy it boots instantly, first time (we have two clusters, one is istio, one is not)

@aledbf
Copy link
Member

aledbf commented Feb 15, 2018

But yes, to answer your question without istio-proxy it boots instantly, first time (we have two clusters, one is istio, one is not)

👍

@Stono Stono changed the title Controller first boot (or two) usually fails Controller first boot (or two) fails when combined with istio-proxy Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants