Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nginx pod crashed when scaling up #4742

Closed
MyMirelHub opened this issue Nov 7, 2019 · 1 comment
Closed

Nginx pod crashed when scaling up #4742

MyMirelHub opened this issue Nov 7, 2019 · 1 comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@MyMirelHub
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.26.1

Kubernetes version (use kubectl version): v1.13.11-gke.9

Environment:

  • Cloud provider or hardware configuration: GKE

What happened:

When scaling the Nginx pods from one to 3 or more replicas the last pod would get stuck in a crash loop (3 replicas would have 2 working pods, and 4 replicas would have 3 working pods). The logs of the faulty pod would show:

requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused

Followed by the SIGTERM

W1105 15:11:19.292480       6 nginx_status.go:172] unexpected error obtaining nginx status info: Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused
2019/11/05 15:11:19 Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused
W1105 15:11:19.302976       6 nginx_status.go:172] unexpected error obtaining nginx status info: Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refusedthe prob
I1105 15:11:34.193295       6 controller.go:150] Backend successfully reloaded.
I1105 15:11:34.193339       6 controller.go:159] Initial sync, sleeping for 1 second.
W1105 15:11:35.295501       6 controller.go:177] Dynamic reconfiguration failed: Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
E1105 15:11:35.295543       6 controller.go:181] Unexpected failure reconfiguring NGINX:
Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
W1105 15:11:35.295554       6 queue.go:130] requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
I1105 15:11:35.296059       6 controller.go:134] Configuration changes detected, backend reload required.
I1105 15:11:37.591419       6 main.go:153] Received SIGTERM, shutting down
I1105 15:11:37.591460       6 nginx.go:390] Shutting down controller queues
I1105 15:11:37.591479       6 status.go:117] updating status of Ingress rules (remove)
E1105 15:11:37.991484       6 controller.go:146] Unexpected failure reloading the backend:

What you expected to happen:

When scaling the pods, If Nginx is failing to pick up the /configuration/backends I would expect either all the pods to fail or none of them.

How to reproduce it (as minimally and precisely as possible):
I was unable to reproduce the issue, before being able to troubleshoot in detail the next morning the pods started correctly.

NAME                                                 READY   STATUS    RESTARTS   AGE
nginx-ingress-controller-services-5f8777589b-g7fql   1/1     Running   0          80m
nginx-ingress-controller-services-5f8777589b-pdntg   1/1     Running   0          81m
nginx-ingress-controller-services-5f8777589b-xqg2j   1/1     Running   287        18h

This is the moment the faulty pod(287 restarts) became healthy.

W 2019-11-06T08:55:09.792863Z Dynamic reconfiguration failed: Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused 
E 2019-11-06T08:55:09.792902Z Unexpected failure reconfiguring NGINX: 
W 2019-11-06T08:55:09.792912Z requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused 
I 2019-11-06T08:55:09.793143Z Configuration changes detected, backend reload required. 
I 2019-11-06T08:55:12.791405Z Backend successfully reloaded. 
I 2019-11-06T08:55:12.791461Z Initial sync, sleeping for 1 second. 

No changes had been made to the Nginx deployment during this time period, but one of the service backends Nginx was picking up had a faulty docker tag. The period when we fixed it corresponds to the time the Nginx pod became healthy.

I tried redeploying it by the bad docker tag and got

W1105 15:11:35.295554       6 queue.go:130] requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused

But the Backend successfully reloaded. and I was unable to reproduce the issue.

Anything else we need to know:
Anything else we can view to further diagnose or reproduce this type of issue?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants