-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy http health checks missing #16958
Comments
Hi @madsholden, have you tried setting https://developer.hashicorp.com/nomad/docs/job-specification/service#service-lifecycle In doing so there is a gap between de-registration of the service and when the initial |
Yes, I did try setting that as well, I had it at 20 seconds for a while. Unfortunately we saw the same thing then. What I can see is that requests using new connections going through the proxies work fine, they go to the new instances. But our applications that use keep-alive will end up staying with an old instance until it stops completely,. |
@madsholden talking with the Consul team, one thing to try would be to configure the upstream as an |
Thank you, that seems like it fixed it. I made a slightly different fix, but I guess it does the same thing. I changed this part of the job spec:
I'm not sure if I need to set the protocol for both the proxy and the upstreams? |
After some more testing, it did indeed fix my 503 problem. However, when setting the protocol to http, websockets stopped working. I found this Consul issue which matches what I see. Looks like websockets aren't supported in Consul Connect at the moment, when using the http protocol. |
Thanks for the followup @madsholden - I'll go ahead and close this issue since the source of 503's is understood. Be sure to give a 👍 on that Consul ticket, though it seems plenty of other folks are also asking for that feature ... |
Thanks for the help. Unfortunately I can't use Consul Connect at all because of this, I can't afford any downtime on redeployment. Anyway, I would recommend adding both the |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
1.5.0
Operating system and Environment details
Ubuntu 22.04.2
Consul 1.15.2
Issue
We have multiple web services running in a Nomad cluster, registering themselves with Consul, with an http health check. They use Consul Connect for http communication between themselves. We are using blue/green deployments in Nomad, by setting the canary count equal to the job count. Our services are configured to gracefully shut down when being sent a kill signal: They will start failing health checks right away, then wait for 10 seconds, then wait for all open connections to finish, then stop the process.
From what I understand from the Envoy docs, it uses the consul service directory to add instances to its routing table, but continues to route to instances which have simply disappeared from Consul. Instead it relies on failing health checks to remove them.
When doing a redeployment of a service, we see some 503 responses from the old service instances. Those responses come exactly when the old instances finish shutting down. I believe this is caused by missing health checks in Envoy, so the old instances aren't removed from Envoy's routing table before they are shut down and all requests fail.
Are my assumptions correct? Is there any way to fix this problem?
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: