-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned #8254
Comments
Minor update: |
We've hit this issue and it's caused us a considerable amount of grief. Ended up moving to static envoy ports. |
Hey @Pavel-Chernikov and @dekimsey ! This issue may have been fixed in a recent consul version, may I ask what version you all are running? |
Hey @Amier3. We are running v1.10.3+ent and the issue was still there. We had to switch to static ports to work around it... |
In our workflows, this only happens to us during bulk registrations on new hosts which is currently pretty rare and I believe our workflows were updated to perform the registration one at a time to workaround the issue. So I cannot comment on whether its still a thing. I'm (somewhat) glad to see @Pavel-Chernikov noticed it in an up-to-date release! |
I think this is exactly it - the issue sporadically surfaces during bulk concurrent registrations. |
@Amier3 Just to circle back to this, we are in the process of rebuilding some of our stack and are running into this still in 1.11.2. |
Overview of the Issue
In our Consul Connect rollout we found that a subset of our proxies would look like they started (running, reporting healthy) but would not have any
public_listeners
listed (detected via envoy admin api,http//...:19000/listeners
). We found the envoy proxies were failing to bind to their ports,Address already in use
. Tracing it we found another legitimate proxy registered to the same agent was running on that port.Additionally, due to the basic TCP check, Consul was unawares the wrong proxy was responding on the port.
Reproduction Steps
Walking through the agent sidecar registration method, AgentRegisterService, we see roughly the following logic at play:
The port allocation occurs in ascending order (used to be random, making this more likely to happen now) and outside of the locking used to commit the service to the state. This makes it easy for two API calls to obtain the same assigned public_listening port.
So a couple of options here:
validateService() could test for proxy addr:port uniqueness and reject, perhaps requiring sidecarServiceFromNodeService() to reselect a port.
Port selection/assignment probably should be delegated to AddService so that it happens during the locked code.
I'm thinking the best solution would be to have AddService (perhaps via a callback to do so) to uniquely assign the connect proxy's port. Unfortunately right now, AddService isn't connect aware, but I don't see how to fix this without giving it some hint that this registration needs to happen during the stateLock.
Steps to reproduce this issue, e.g.:
/v1/agent/service/register
) in parallel using default connect sidecar settings:Local agents will show duplicate Port assignments for the connect proxies:
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
CentOS 7, x64.
Log Fragments
Envoy log excerpt:
Also reported here, though there is no activity on it at this time: https://discuss.hashicorp.com/t/consul-assigns-duplicate-public-listener-ports-to-envoy-proxies/11137
The text was updated successfully, but these errors were encountered: