Envoy http health checks missing #16958

madsholden · 2023-04-21T11:28:27Z

Nomad version

1.5.0

Operating system and Environment details

Ubuntu 22.04.2
Consul 1.15.2

Issue

We have multiple web services running in a Nomad cluster, registering themselves with Consul, with an http health check. They use Consul Connect for http communication between themselves. We are using blue/green deployments in Nomad, by setting the canary count equal to the job count. Our services are configured to gracefully shut down when being sent a kill signal: They will start failing health checks right away, then wait for 10 seconds, then wait for all open connections to finish, then stop the process.

From what I understand from the Envoy docs, it uses the consul service directory to add instances to its routing table, but continues to route to instances which have simply disappeared from Consul. Instead it relies on failing health checks to remove them.

When doing a redeployment of a service, we see some 503 responses from the old service instances. Those responses come exactly when the old instances finish shutting down. I believe this is caused by missing health checks in Envoy, so the old instances aren't removed from Envoy's routing table before they are shut down and all requests fail.

Are my assumptions correct? Is there any way to fix this problem?

Job file (if appropriate)

job "sample-job" {
  type   = "service"
  region = "eu-west-1"

  update {
    max_parallel      = 3
    canary            = 3
    auto_promote      = true
    auto_revert       = true
    min_healthy_time  = "10s"
    healthy_deadline  = "2m"
    progress_deadline = "0"
  }

  group "sample-job" {
    count = 3

    service {
      name = "sample-job-admin"
      port = "admin"

      check {
        type = "http"
        port = "admin"
        path = "/healthcheck"
        interval = "1s"
        timeout = "2s"
      }
    }

    service {
      name = "sample-job"
      port = "service"

      check {
        type = "http"
        port = "admin"
        path = "/healthcheck"
        interval = "1s"
        timeout = "2s"
      }

      connect {
        sidecar_task {
          config {
            image = "envoyproxy/envoy:v1.25-latest"

            args = [
              "-c",
              "$${NOMAD_SECRETS_DIR}/envoy_bootstrap.json",
              "--config-yaml",
              "{admin: {address: {socket_address: {port_value: $${NOMAD_PORT_envoy}, address: '0.0.0.0'}}}}",
              "-l",
              "$${meta.connect.log_level}",
              "--concurrency",
              "$${meta.connect.proxy_concurrency}",
              "--disable-hot-restart"
            ]
          }
        }

        sidecar_service {
          proxy {
            upstreams {
              destination_name = "another-job"
              local_bind_port  = 8000
            }
          }
        }
      }
    }

    network {
      mode = "bridge"

      port "envoy" {}
      port "admin" {}
      port "service" {}
    }

    task "samplejob" {
      driver = "docker"

      kill_timeout = "40s"

      env {
        CONFIG_FORCE_server_port      = "${NOMAD_PORT_service}"
        CONFIG_FORCE_server_adminPort = "${NOMAD_PORT_admin}"
      }

      config {
        image = "docker-image:v2"
        ports = ["service", "admin"]
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

shoenig · 2023-04-24T17:37:57Z

Hi @madsholden, have you tried setting shutdown_delay on the service blocks?

https://developer.hashicorp.com/nomad/docs/job-specification/service#service-lifecycle

In doing so there is a gap between de-registration of the service and when the initial kill_signal is issued - meaning traffic should be redirected elsewhere further in advance rather than waiting on Consul to detect (and propagate) the failing check statuses.

madsholden · 2023-04-25T09:36:11Z

Yes, I did try setting that as well, I had it at 20 seconds for a while. Unfortunately we saw the same thing then.

What I can see is that requests using new connections going through the proxies work fine, they go to the new instances. But our applications that use keep-alive will end up staying with an old instance until it stops completely,.

shoenig · 2023-04-26T22:08:20Z

@madsholden talking with the Consul team, one thing to try would be to configure the upstream as an http service using a Service Defaults Config Entry: https://developer.hashicorp.com/consul/docs/connect/config-entries/service-defaults. In that case Envoy should be able to set a Connection: close or HTTP2 GOAWAY message: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining#draining

madsholden · 2023-04-27T11:12:32Z

Thank you, that seems like it fixed it. I made a slightly different fix, but I guess it does the same thing. I changed this part of the job spec:

sidecar_service {
  proxy {
    config {
      protocol = "http"
    }
    upstreams {
      config {
        protocol = "http"
      }
      destination_name = "another-job"
      local_bind_port  = 8000
    }
  }
}

I'm not sure if I need to set the protocol for both the proxy and the upstreams?

madsholden · 2023-04-28T11:56:14Z

After some more testing, it did indeed fix my 503 problem. However, when setting the protocol to http, websockets stopped working. I found this Consul issue which matches what I see. Looks like websockets aren't supported in Consul Connect at the moment, when using the http protocol.

hashicorp/consul#8283

shoenig · 2023-04-28T14:23:34Z

Thanks for the followup @madsholden - I'll go ahead and close this issue since the source of 503's is understood. Be sure to give a 👍 on that Consul ticket, though it seems plenty of other folks are also asking for that feature ...

madsholden · 2023-05-02T07:58:15Z

Thanks for the help. Unfortunately I can't use Consul Connect at all because of this, I can't afford any downtime on redeployment.

Anyway, I would recommend adding both the protocol = "http" trick, and the missing websocket support, to the Nomad documentation. I spent quite a long time of testing before making this issue.

github-actions · 2025-01-11T02:16:56Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

madsholden added the type/bug label Apr 21, 2023

shoenig added the stage/waiting-reply label Apr 24, 2023

shoenig added stage/needs-investigation theme/service-discovery/consul and removed stage/waiting-reply labels Apr 25, 2023

shoenig closed this as completed Apr 28, 2023

shoenig mentioned this issue May 2, 2023

connect: document envoy lifecycle behavior in relation to service type #17052

Open

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

github-actions bot locked as resolved and limited conversation to collaborators Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Envoy http health checks missing #16958

Envoy http health checks missing #16958

madsholden commented Apr 21, 2023

shoenig commented Apr 24, 2023

madsholden commented Apr 25, 2023

shoenig commented Apr 26, 2023

madsholden commented Apr 27, 2023

madsholden commented Apr 28, 2023 •

edited

Loading

shoenig commented Apr 28, 2023

madsholden commented May 2, 2023

github-actions bot commented Jan 11, 2025

Envoy http health checks missing #16958

Envoy http health checks missing #16958

Comments

madsholden commented Apr 21, 2023

Nomad version

Operating system and Environment details

Issue

Job file (if appropriate)

shoenig commented Apr 24, 2023

madsholden commented Apr 25, 2023

shoenig commented Apr 26, 2023

madsholden commented Apr 27, 2023

madsholden commented Apr 28, 2023 • edited Loading

shoenig commented Apr 28, 2023

madsholden commented May 2, 2023

github-actions bot commented Jan 11, 2025

madsholden commented Apr 28, 2023 •

edited

Loading