Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned #8254

Closed
dekimsey opened this issue Jul 7, 2020 · 7 comments · Fixed by #14057
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected

Comments

@dekimsey
Copy link
Collaborator

dekimsey commented Jul 7, 2020

Overview of the Issue

In our Consul Connect rollout we found that a subset of our proxies would look like they started (running, reporting healthy) but would not have any public_listeners listed (detected via envoy admin api, http//...:19000/listeners). We found the envoy proxies were failing to bind to their ports, Address already in use. Tracing it we found another legitimate proxy registered to the same agent was running on that port.

Additionally, due to the basic TCP check, Consul was unawares the wrong proxy was responding on the port.

Reproduction Steps

Walking through the agent sidecar registration method, AgentRegisterService, we see roughly the following logic at play:

// AgentRegisterService()
// ...
// See if we have a sidecar to register too (`sidecarServiceFromNodeService`)
  -> // Allocate port if needed (min and max inclusive).  <-- duplicate port is chosen here.
// Add the service (`AddService` or `AddServiceAndReplaceChecks`)
  -> // Lock
  -> // addServiceLocked
  -> // Unlock
// ...

The port allocation occurs in ascending order (used to be random, making this more likely to happen now) and outside of the locking used to commit the service to the state. This makes it easy for two API calls to obtain the same assigned public_listening port.

So a couple of options here:
validateService() could test for proxy addr:port uniqueness and reject, perhaps requiring sidecarServiceFromNodeService() to reselect a port.
Port selection/assignment probably should be delegated to AddService so that it happens during the locked code.

I'm thinking the best solution would be to have AddService (perhaps via a callback to do so) to uniquely assign the connect proxy's port. Unfortunately right now, AddService isn't connect aware, but I don't see how to fix this without giving it some hint that this registration needs to happen during the stateLock.

Steps to reproduce this issue, e.g.:

  1. Register many Connect-enabled services to Consul API (/v1/agent/service/register) in parallel using default connect sidecar settings:
{"name": "foo", "connect": {"sidecar_service": {}}}

Local agents will show duplicate Port assignments for the connect proxies:

$ curl -s http://localhost:8500/v1/agent/services | python -m json.tool | grep '"Port": 21' -A5
        "Port": 21002,
        "Proxy": {
            "DestinationServiceID": "customer-merge-service",
            "DestinationServiceName": "customer-merge-service",
            "Expose": {},
            "LocalServiceAddress": "127.0.0.1",
--
        "Port": 21004,
        "Proxy": {
            "DestinationServiceID": "endpoint-common-service",
            "DestinationServiceName": "endpoint-common-service",
            "Expose": {},
            "LocalServiceAddress": "127.0.0.1",
--
        "Port": 21002,
        "Proxy": {
            "DestinationServiceID": "file-exchange-service",
            "DestinationServiceName": "file-exchange-service",
            "Expose": {},
            "LocalServiceAddress": "127.0.0.1",
--

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 75
        services = 60
build:
        prerelease =
        revision = 1200f25e
        version = 1.6.2
consul:
        acl = enabled
        known_servers = 5
        server = false
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 444
        max_procs = 2
        os = linux
        version = go1.12.13
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 187
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 19910
        members = 347
        query_queue = 0
        query_time = 1
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 1200f25e
        version = 1.6.2
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.70.249.210:8300
        server = true
raft:
        applied_index = 139808712
        commit_index = 139808712
        fsm_pending = 0
        last_contact = 24.054557ms
        last_log_index = 139808712
        last_log_term = 961
        last_snapshot_index = 139799574
        last_snapshot_term = 961
        latest_configuration = [{Suffrage:Voter ID:4d733416-bbb2-fd18-1c8f-12a71cd863ff Address:10.70.249.88:8300} {Suffrage:Voter ID:8fb426ab-8c42-aa6e-7950-4b3e1211d21a Address:10.70.249.210:8300} {Suffrage:Voter ID:083f7af1-6104-70f1-d119-073249808ec5 Address:10.70.249.173:8300} {Suffrage:Voter ID:abdb1913-5c78-ce4f-dadd-90e70ee9144b Address:10.70.249.89:8300} {Suffrage:Voter ID:1c378303-14d5-9962-3097-2389fb078442 Address:10.70.249.238:8300}]
        latest_configuration_index = 138808606
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 961
runtime:
        arch = amd64
        cpu_count = 2
        goroutines = 1114
        max_procs = 2
        os = linux
        version = go1.12.13
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 187
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 19910
        members = 347
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 103
        members = 5
        query_queue = 0
        query_time = 1

Operating system and Environment details

CentOS 7, x64.

Log Fragments

Envoy log excerpt:

[info][config] [external/envoy/source/server/listener_manager_impl.cc:761] all dependencies initialized. starting workers
[debug][config] [external/envoy/source/common/config/grpc_mux_impl.cc:105] Resuming discovery requests for type.googleapis.com/envoy.api.v2.RouteConfiguration
[warning][config] [external/envoy/source/common/config/grpc_mux_subscription_impl.cc:72] gRPC config for type.googleapis.com/envoy.api.v2.Listener rejected: Error adding/updating listener(s) public_listener:0.0.0.0:21000: cannot
 bind '0.0.0.0:21000': Address already in use
[debug][main] [external/envoy/source/server/worker_impl.cc:101] worker entering dispatch loop

Also reported here, though there is no activity on it at this time: https://discuss.hashicorp.com/t/consul-assigns-duplicate-public-listener-ports-to-envoy-proxies/11137

@dekimsey
Copy link
Collaborator Author

dekimsey commented Jul 7, 2020

Minor update: sidecarServiceFromNodeService makes an effort to not move the port around unnecessarily, so it's not possible to "reset" the assigned port assignment without clearing the connect settings or perhaps deleting the registration and resubmitting it.

@blake blake added the theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies label Jul 19, 2020
@pchernikov-sxm
Copy link

We've hit this issue and it's caused us a considerable amount of grief. Ended up moving to static envoy ports.

@Amier3 Amier3 added the type/bug Feature does not function as expected label Nov 8, 2021
@Amier3
Copy link
Contributor

Amier3 commented Nov 12, 2021

Hey @Pavel-Chernikov and @dekimsey !

This issue may have been fixed in a recent consul version, may I ask what version you all are running?

@pchernikov-sxm
Copy link

Hey @Amier3. We are running v1.10.3+ent and the issue was still there. We had to switch to static ports to work around it...

@dekimsey
Copy link
Collaborator Author

In our workflows, this only happens to us during bulk registrations on new hosts which is currently pretty rare and I believe our workflows were updated to perform the registration one at a time to workaround the issue. So I cannot comment on whether its still a thing. I'm (somewhat) glad to see @Pavel-Chernikov noticed it in an up-to-date release!

@pchernikov-sxm
Copy link

I think this is exactly it - the issue sporadically surfaces during bulk concurrent registrations.

@jkirschner-hashicorp jkirschner-hashicorp added the theme/envoy/xds Related to Envoy support label Nov 15, 2021
@Amier3 Amier3 changed the title [connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned GH-1280 [connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned Jan 11, 2022
@Amier3 Amier3 changed the title GH-1280 [connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned [connect] race-condition registering multiple proxies, duplicate public_listener ports are assigned Jan 11, 2022
@dekimsey
Copy link
Collaborator Author

dekimsey commented Mar 9, 2022

@Amier3 Just to circle back to this, we are in the process of rebuilding some of our stack and are running into this still in 1.11.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies theme/envoy/xds Related to Envoy support type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants