Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NGINX 502 Bad Gateway when using a single replication #4375

Closed
BuddhiWathsala opened this issue Jul 29, 2019 · 9 comments
Closed

NGINX 502 Bad Gateway when using a single replication #4375

BuddhiWathsala opened this issue Jul 29, 2019 · 9 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@BuddhiWathsala
Copy link

BuddhiWathsala commented Jul 29, 2019

Is this a request for help? No

What keywords did you search in NGINX Ingress controller issues before filing this one? nginx-ingress, zero-downtime, single replica


Is this a BUG REPORT or FEATURE REQUEST? : BUG REPORT

NGINX Ingress controller version: 0.23.0

Kubernetes version (use kubectl version): v1.15.0

Environment: minikube version: v1.2.0

What happened:
I need to deploy an HTTP app with zero-downtime. Here I have a restriction of using single pod only. So I had problems with some of the HTTP requests get 502 bad gateway when I was using an NGINX ingress.

I followed these answers given in these two issues(#489 and #322). The answers to these issues work fine when I was using more than a single replica. But for a single replica, NGINX still have a slight downtime which is less than 1 millisecond.

The lifecycle spec and rolling updates spec of my deployment set as below according to the answers given by the above issues.

spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate
    ...
    spec:
        ....
        lifecycle:
          preStop:
            exec:
              command:
              - sleep
              - "30"

Note that I have config maps that mount to this deployment. I'm not sure that would affect to this downtime or not.

Also, I refer the following blogs but it didn't work for this single pod scenario.
[1]: https://blog.sebastian-daschner.com/entries/zero-downtime-updates-kubernetes
[2]: http://rahmonov.me/posts/zero-downtime-deployment-with-kubernetes/

What you expected to happen:
Pod receives HTTP request without a downtime.

How to reproduce it :

  1. Create a deployment with one pod which has HTTP receiver.
  2. Create a service with a Cluster IP.
  3. Set up NGINX ingress to access the above port.
  4. Send continuous requests.
  5. Change the deployment spec and apply that changes using kubectl apply. Then you will see missing HTTP requests.

Anything else we need to know:

According to the blog [2], we can achieve zero downtime even with single replication of pod without ingress. But why it cannot be achievable when I used NGINX ingress?

@dcherniv
Copy link

@BuddhiWathsala can you change the service to the loadbalancer type and run two tests simultaneously, one hitting the ingress the other hitting the loadbalancer endpoint.
It would curious to see whether the issue is with pod being unavailable or the nginx itself.

@BuddhiWathsala
Copy link
Author

@dcherniv I ran the two tests. When we are using single replica the LoadBalancer configuration also have a downtime. As far as I understand the problem resides in pod level. When we do some changes to the deployment, then the existing pod will start to terminate. The HTTP connections that currently established with that pod give 502 bad gateway because that pod can no longer be able to process requests.

But I have a confusion why this problem might not arise when I have multiple pods. When I have multiple pods, did NGINX controller intelligently redirect error requests to other available pods without returning an error to the user?

@dcherniv
Copy link

dcherniv commented Aug 1, 2019

@BuddhiWathsala
Is this a deployment or a statefulset?
The behavior you are describing applies to statefulsets where a new pod will not be brought up until old one is terminated. Looking at your spec you do have update strategy setup properly. This strategy will bring up a new pod and ONLY then terminate the old one. If the currently pod terminates before the new one is brought up then thats the problem. That being said i cannot reproduce with the below spec, in any case this is probably not the issue with the nginx ingress controller which we ruled out by hitting LoadBalancer directly

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: echoserver
spec:
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
    type: RollingUpdate   
  replicas: 1
  template:
    metadata:
      labels:
        app: echoserver
    spec:
      containers:
      - image: gcr.io/google_containers/echoserver:1.0
        imagePullPolicy: Always
        name: echoserver
        ports:
        - containerPort: 8080

with following spec

@BuddhiWathsala
Copy link
Author

@dcherniv I have a deployment. Also, I agree with your argument. As far as I understand now the problem might be in pod level. Which means at the termination time pod has some HTTP connections that already being established. When pod receives SIGTERM, pod terminate immediately and pod unable to send responses to the already established connections. Therefore we get 502 response.

Since this was not an issue in ingress-nginx I can close the issue.

But I have a thing to clarify as I asked previously. I don't understand why this 502 response did not receive when I have 2 pods?.

When I have 2 pods, did NGINX controller intelligently redirect error requests to another available pod without returning an error to the user?

@nic-6443
Copy link
Contributor

nic-6443 commented Aug 11, 2019

@BuddhiWathsala When idempotent request (for example GET method) fail, NGINX can config to retry. In that scenario ,upstream field in access log of retryed request will have multi value.You can check it.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 9, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants