Nginx pod crashed when scaling up #4742

MyMirelHub · 2019-11-07T15:49:20Z

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.26.1

Kubernetes version (use kubectl version): v1.13.11-gke.9

Environment:

Cloud provider or hardware configuration: GKE

What happened:

When scaling the Nginx pods from one to 3 or more replicas the last pod would get stuck in a crash loop (3 replicas would have 2 working pods, and 4 replicas would have 3 working pods). The logs of the faulty pod would show:

requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused

Followed by the SIGTERM

W1105 15:11:19.292480       6 nginx_status.go:172] unexpected error obtaining nginx status info: Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused
2019/11/05 15:11:19 Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refused
W1105 15:11:19.302976       6 nginx_status.go:172] unexpected error obtaining nginx status info: Get http://127.0.0.1:10246/nginx_status: dial tcp 127.0.0.1:10246: connect: connection refusedthe prob
I1105 15:11:34.193295       6 controller.go:150] Backend successfully reloaded.
I1105 15:11:34.193339       6 controller.go:159] Initial sync, sleeping for 1 second.
W1105 15:11:35.295501       6 controller.go:177] Dynamic reconfiguration failed: Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
E1105 15:11:35.295543       6 controller.go:181] Unexpected failure reconfiguring NGINX:
Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
W1105 15:11:35.295554       6 queue.go:130] requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused
I1105 15:11:35.296059       6 controller.go:134] Configuration changes detected, backend reload required.
I1105 15:11:37.591419       6 main.go:153] Received SIGTERM, shutting down
I1105 15:11:37.591460       6 nginx.go:390] Shutting down controller queues
I1105 15:11:37.591479       6 status.go:117] updating status of Ingress rules (remove)
E1105 15:11:37.991484       6 controller.go:146] Unexpected failure reloading the backend:

What you expected to happen:

When scaling the pods, If Nginx is failing to pick up the /configuration/backends I would expect either all the pods to fail or none of them.

How to reproduce it (as minimally and precisely as possible):
I was unable to reproduce the issue, before being able to troubleshoot in detail the next morning the pods started correctly.

NAME                                                 READY   STATUS    RESTARTS   AGE
nginx-ingress-controller-services-5f8777589b-g7fql   1/1     Running   0          80m
nginx-ingress-controller-services-5f8777589b-pdntg   1/1     Running   0          81m
nginx-ingress-controller-services-5f8777589b-xqg2j   1/1     Running   287        18h

This is the moment the faulty pod(287 restarts) became healthy.

W 2019-11-06T08:55:09.792863Z Dynamic reconfiguration failed: Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused 
E 2019-11-06T08:55:09.792902Z Unexpected failure reconfiguring NGINX: 
W 2019-11-06T08:55:09.792912Z requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused 
I 2019-11-06T08:55:09.793143Z Configuration changes detected, backend reload required. 
I 2019-11-06T08:55:12.791405Z Backend successfully reloaded. 
I 2019-11-06T08:55:12.791461Z Initial sync, sleeping for 1 second.

No changes had been made to the Nginx deployment during this time period, but one of the service backends Nginx was picking up had a faulty docker tag. The period when we fixed it corresponds to the time the Nginx pod became healthy.

I tried redeploying it by the bad docker tag and got

W1105 15:11:35.295554       6 queue.go:130] requeuing initial-sync, err Post http://127.0.0.1:10246/configuration/backends: dial tcp 127.0.0.1:10246: connect: connection refused

But the Backend successfully reloaded. and I was unable to reproduce the issue.

Anything else we need to know:
Anything else we can view to further diagnose or reproduce this type of issue?

The text was updated successfully, but these errors were encountered:

fejta-bot · 2020-02-05T16:11:21Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2020

MyMirelHub closed this as completed Feb 5, 2020

steinarvk-oda mentioned this issue Apr 28, 2021

Fix buggy retry logic in syncIngress() #7086

Closed

8 tasks

davideshay mentioned this issue Mar 10, 2022

Fix for buggy ingress sync with retries #8325

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nginx pod crashed when scaling up #4742

Nginx pod crashed when scaling up #4742

MyMirelHub commented Nov 7, 2019

fejta-bot commented Feb 5, 2020

Nginx pod crashed when scaling up #4742

Nginx pod crashed when scaling up #4742

Comments

MyMirelHub commented Nov 7, 2019

fejta-bot commented Feb 5, 2020