Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy http health checks missing #16958

Closed
madsholden opened this issue Apr 21, 2023 · 8 comments
Closed

Envoy http health checks missing #16958

madsholden opened this issue Apr 21, 2023 · 8 comments

Comments

@madsholden
Copy link

Nomad version

1.5.0

Operating system and Environment details

Ubuntu 22.04.2
Consul 1.15.2

Issue

We have multiple web services running in a Nomad cluster, registering themselves with Consul, with an http health check. They use Consul Connect for http communication between themselves. We are using blue/green deployments in Nomad, by setting the canary count equal to the job count. Our services are configured to gracefully shut down when being sent a kill signal: They will start failing health checks right away, then wait for 10 seconds, then wait for all open connections to finish, then stop the process.

From what I understand from the Envoy docs, it uses the consul service directory to add instances to its routing table, but continues to route to instances which have simply disappeared from Consul. Instead it relies on failing health checks to remove them.

When doing a redeployment of a service, we see some 503 responses from the old service instances. Those responses come exactly when the old instances finish shutting down. I believe this is caused by missing health checks in Envoy, so the old instances aren't removed from Envoy's routing table before they are shut down and all requests fail.

Are my assumptions correct? Is there any way to fix this problem?

Job file (if appropriate)

job "sample-job" {
  type   = "service"
  region = "eu-west-1"

  update {
    max_parallel      = 3
    canary            = 3
    auto_promote      = true
    auto_revert       = true
    min_healthy_time  = "10s"
    healthy_deadline  = "2m"
    progress_deadline = "0"
  }

  group "sample-job" {
    count = 3

    service {
      name = "sample-job-admin"
      port = "admin"

      check {
        type = "http"
        port = "admin"
        path = "/healthcheck"
        interval = "1s"
        timeout = "2s"
      }
    }

    service {
      name = "sample-job"
      port = "service"

      check {
        type = "http"
        port = "admin"
        path = "/healthcheck"
        interval = "1s"
        timeout = "2s"
      }

      connect {
        sidecar_task {
          config {
            image = "envoyproxy/envoy:v1.25-latest"

            args = [
              "-c",
              "$${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
              "--config-yaml",
              "{admin: {address: {socket_address: {port_value: $${NOMAD_PORT_envoy}, address: '0.0.0.0'}}}}",
              "-l",
              "$${meta.connect.log_level}",
              "--concurrency",
              "$${meta.connect.proxy_concurrency}",
              "--disable-hot-restart"
            ]
          }
        }

        sidecar_service {
          proxy {
            upstreams {
              destination_name = "another-job"
              local_bind_port  = 8000
            }
          }
        }
      }
    }

    network {
      mode = "bridge"

      port "envoy" {}
      port "admin" {}
      port "service" {}
    }

    task "samplejob" {
      driver = "docker"

      kill_timeout = "40s"

      env {
        CONFIG_FORCE_server_port      = "${NOMAD_PORT_service}"
        CONFIG_FORCE_server_adminPort = "${NOMAD_PORT_admin}"
      }

      config {
        image = "docker-image:v2"
        ports = ["service", "admin"]
      }
    }
  }
}
@shoenig
Copy link
Contributor

shoenig commented Apr 24, 2023

Hi @madsholden, have you tried setting shutdown_delay on the service blocks?

https://developer.hashicorp.com/nomad/docs/job-specification/service#service-lifecycle

In doing so there is a gap between de-registration of the service and when the initial kill_signal is issued - meaning traffic should be redirected elsewhere further in advance rather than waiting on Consul to detect (and propagate) the failing check statuses.

@madsholden
Copy link
Author

Yes, I did try setting that as well, I had it at 20 seconds for a while. Unfortunately we saw the same thing then.

What I can see is that requests using new connections going through the proxies work fine, they go to the new instances. But our applications that use keep-alive will end up staying with an old instance until it stops completely,.

@shoenig
Copy link
Contributor

shoenig commented Apr 26, 2023

@madsholden talking with the Consul team, one thing to try would be to configure the upstream as an http service using a Service Defaults Config Entry: https://developer.hashicorp.com/consul/docs/connect/config-entries/service-defaults. In that case Envoy should be able to set a Connection: close or HTTP2 GOAWAY message: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining#draining

@madsholden
Copy link
Author

Thank you, that seems like it fixed it. I made a slightly different fix, but I guess it does the same thing. I changed this part of the job spec:

sidecar_service {
  proxy {
    config {
      protocol = "http"
    }
    upstreams {
      config {
        protocol = "http"
      }
      destination_name = "another-job"
      local_bind_port  = 8000
    }
  }
}

I'm not sure if I need to set the protocol for both the proxy and the upstreams?

@madsholden
Copy link
Author

madsholden commented Apr 28, 2023

After some more testing, it did indeed fix my 503 problem. However, when setting the protocol to http, websockets stopped working. I found this Consul issue which matches what I see. Looks like websockets aren't supported in Consul Connect at the moment, when using the http protocol.

hashicorp/consul#8283

@shoenig
Copy link
Contributor

shoenig commented Apr 28, 2023

Thanks for the followup @madsholden - I'll go ahead and close this issue since the source of 503's is understood. Be sure to give a 👍 on that Consul ticket, though it seems plenty of other folks are also asking for that feature ...

@shoenig shoenig closed this as completed Apr 28, 2023
@madsholden
Copy link
Author

Thanks for the help. Unfortunately I can't use Consul Connect at all because of this, I can't afford any downtime on redeployment.

Anyway, I would recommend adding both the protocol = "http" trick, and the missing websocket support, to the Nomad documentation. I spent quite a long time of testing before making this issue.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

2 participants