Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in how connections are handled for draining listeners from v1.19.1 to v1.20.x,v1.21.x #20113

Closed
shashankram opened this issue Feb 24, 2022 · 30 comments · Fixed by #20702

Comments

@shashankram
Copy link

shashankram commented Feb 24, 2022

Title: Difference in how connections are handled for draining listeners from v1.19.1 to v1.20.x,v1.21.x

Description:
I am observing a behavior difference in how HTTP connections are handled while the listener is draining in v1.20+ compared to v1.19.1.

Scenario: Egress listener is removed and is in draining state

In v1.19.1: curl http://foo.bar returns curl: (7) ... connection refused
In v1.21.1: curl http://foo.bar hangs, and returns only after listener drain interval.

See #20113 (comment) for more details.

What has changed between these 2 versions to result in a change in behavior for connections during listener drain?

UPDATE:
I can confirm that this issue seems to be introduced by the change ba474ac
I cannot reproduce this issue with the previous commit e0380ad

@shashankram shashankram added the triage Issue requires triage label Feb 24, 2022
@shashankram
Copy link
Author

@snowp @alyssawilk

@daixiang0
Copy link
Member

Could share config or steps to reproduce it?

@shashankram
Copy link
Author

@daixiang0 please refer to the following:

On v1.19.1:

  1. Remove outbound-listener on client (egress listener)
  2. From the client, curl the destination host service-ee146f3f-5f32-4625-8ed0-9c1fb1b55c13.server:
# curl http://service-ee146f3f-5f32-4625-8ed0-9c1fb1b55c13.server/
curl: (7) Failed to connect to service-ee146f3f-5f32-4625-8ed0-9c1fb1b55c13.server port 80: Connection refused

The outbound-listener is in draining state when the above request is made:

    {
     "name": "outbound-listener",
     "draining_state": {
      "version_info": "1",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "outbound-listener",
listener_manager.total_filter_chains_draining: 4
listener_manager.total_listeners_draining: 1

Config:
envoy_v1.19.1_dump.txt

Stats:
envoy_v1.19.1_stats.txt

On v1.21.1

  1. Remove outbound-listener on client (egress listener)
  2. From the client, curl the destination host service-5c209bc6-6fbb-448a-a341-c079c4060c96.server, it hangs:
# curl http://service-5c209bc6-6fbb-448a-a341-c079c4060c96.server/
^C

The outbound-listener is in draining state when the above request is made:

    {
     "name": "outbound-listener",
     "draining_state": {
      "version_info": "1",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "outbound-listener",
listener_manager.total_filter_chains_draining: 4
listener_manager.total_listeners_draining: 1

Config:
envoy_v1.21.1_dump.txt

Stats:
envoy_v1.21.1_stats.txt

@lambdai
Copy link
Contributor

lambdai commented Feb 25, 2022

Since the listener is updated filter chain by filter chain, it's probably hitting the issue I found and analyzed here

This PR address the minor issue of the below sequence

1. add listener 0.0.0.0:80. Open fd = {X}
2. complete an in place update listener, and the old listener is draining filter chain. This draining state last drain_timeout. The new listener create listen fd=Y. Now open fd = {X,Y}.
3. immediately stop the listener. Open fd = {X}, Closed fd = {Y}.
4. after the drain_timeout, Closed fd = {X,Y}

The expectation is that
after 3 but before 4, tcp client connect to :80 should see reset()

Actual
tcp client connect to :80 is successful because the old listener fd = X is open. It's a minor issue because envoy doesn't serve the tcp client.

@shashankram
Copy link
Author

shashankram commented Feb 26, 2022

Since the listener is updated filter chain by filter chain, it's probably hitting the issue I found and analyzed here

This PR address the minor issue of the below sequence

1. add listener 0.0.0.0:80. Open fd = {X}
2. complete an in place update listener, and the old listener is draining filter chain. This draining state last drain_timeout. The new listener create listen fd=Y. Now open fd = {X,Y}.
3. immediately stop the listener. Open fd = {X}, Closed fd = {Y}.
4. after the drain_timeout, Closed fd = {X,Y}

The expectation is that
after 3 but before 4, tcp client connect to :80 should see reset()

Actual
tcp client connect to :80 is successful because the old listener fd = X is open. It's a minor issue because envoy doesn't serve the tcp client.

@lambdai thanks for sharing this. Based on #18677 (comment), it doesn't seem like there is an issue related to the bug you were fixing. Are you certain it's the same issue as this bug? Also note that I do not see this issue in v1.19.1, but consistently hit this when using v1.21.1.

/cc @mattklein123

@soulxu
Copy link
Member

soulxu commented Feb 27, 2022

I can't reproduce this issue on 1.21.1 and main branch.

I use a very simple listener as below

 {
   "@type": "type.googleapis.com/envoy.admin.v3.ListenersConfigDump",
   "version_info": "1",
   "dynamic_listeners": [
    {
     "name": "listener_0",
     "active_state": {
      "version_info": "1",
      "listener": {
       "@type": "type.googleapis.com/envoy.config.listener.v3.Listener",
       "name": "listener_0",
       "address": {
        "socket_address": {
         "address": "0.0.0.0",
         "port_value": 10000
        }
       },
       "filter_chains": [
        {
         "filters": [
          {
           "name": "envoy.filters.network.http_connection_manager",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager",
            "stat_prefix": "http",
            "rds": {
             "config_source": {
              "api_config_source": {
               "api_type": "GRPC",
               "grpc_services": [
                {
                 "envoy_grpc": {
                  "cluster_name": "xds_cluster"
                 }
                }
               ],
               "set_node_on_first_message_only": true,
               "transport_api_version": "V3"
              },
              "resource_api_version": "V3"
             },
             "route_config_name": "local_route"
            },
            "http_filters": [
             {
              "name": "envoy.filters.http.router"
             }
            ]
           }
          }
         ]
        }
       ]
      },
      "last_updated": "2022-02-27T15:52:02.773Z"
     }
    }
   ]
  },

then I remove the listener_0.

Before reach to the drain time, I connect to the listener port, the connection isn't hang. It is refused directly.

$ curl -v http://127.0.0.1:10000
*   Trying 127.0.0.1:10000...
* TCP_NODELAY set
* connect to 127.0.0.1 port 10000 failed: Connection refused
* Failed to connect to 127.0.0.1 port 10000: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 10000: Connection refused

@shashankram not sure I understand you case clearly, let me know if I didn't test the correct case.

config.txt

@shashankram
Copy link
Author

shashankram commented Feb 28, 2022

listener_manager.total_filter_chains_draining: 4
listener_manager.total_listeners_draining: 1

@soulxu could you kindly use the exact build type and image, I noticed you are using a DEBUG build, please use the following in your test:

      "version": {
       "major_number": 1,
       "minor_number": 21,
       "patch": 1
      },
      "metadata": {
       "ssl.version": "BoringSSL",
       "revision.status": "Clean",
       "revision.sha": "af50070ee60866874b0a9383daf9364e884ded22",
       "build.type": "RELEASE"
      }

I am using the envoyproxy/envoy-alpine:v1.21.1@sha256:c2c0b7c136f9daf09c36ef550635eaf5db26465529a1354f725f11390e530879 image from Dockerhub.

I am also using https://github.com/envoyproxy/go-control-plane/releases/tag/v0.10.1 to program Envoy.

@lambdai
Copy link
Contributor

lambdai commented Feb 28, 2022

@shashankram

I don't believe that issue is fixed. The PR is closed because it's not critical, plus a more elegant patch is desired

@shashankram
Copy link
Author

@shashankram

I don't believe that issue is fixed. The PR is closed because it's not critical, plus a more elegant patch is desired

@lambdai Are you certain the issue I am seeing (connection hang when listener is draining) is related to the issue you pointed out? Also why is this not seen in v1.19.1? With v1.21.1, I can consistently reproduce this in my internal environment.

@shashankram
Copy link
Author

shashankram commented Feb 28, 2022

This seems like the issue described in #18616.
@lambdai is this issue not fully addressed in v1.21.1 by #18686? Is there a workaround for this?

@shashankram
Copy link
Author

Based on packet captures, I can conclude that the outbound listener on port 15001 does not refuse connections when the listener is in draining state.

client: 10.244.2.198
local envoy: 127.0.0.1:15001
server: 10.96.40.153

Packet capture after HTTP GET:
image

@shashankram shashankram changed the title Difference in how connections are handled for draining listeners from v1.19.1 to v1.21.1 Difference in how connections are handled for draining listeners from v1.19.1 to v1.20.x,v1.21.x Feb 28, 2022
@shashankram
Copy link
Author

I can confirm that this issue seems to be introduced by the change ba474ac (/cc @mattklein123)

I cannot reproduce this issue with the previous commit e0380ad

@mattklein123
Copy link
Member

As @lambdai said there is a known issue in the related issue. @lambdai I lost the thread on that since I thought you had fixed it. Should that issue be reopened so we can track it?

In terms of this issue I can't say whether it's the same or not. Can you provide a self contained repro and I can take a look at some point? Thanks.

@shashankram
Copy link
Author

shashankram commented Mar 1, 2022

I tried to reproduce in a standalone environment using filesystem based dynamic configuration. I used the example https://www.envoyproxy.io/docs/envoy/latest/start/quick-start/configuration-dynamic-filesystem and tried removing the listener so it starts to drain. I couldn't reproduce the issue exactly like in my original environment because I don't see the listener_manager.total_filter_chains_draining metric increment in the standalone environment.

As per #20113 (comment), when this issue happens, I see the following stats when the listener is removed:

listener_manager.total_filter_chains_draining: 4
listener_manager.total_listeners_draining: 1

While using filesystem based configuration, I see:

listener_manager.total_filter_chains_draining: 0
listener_manager.total_listeners_draining: 1

*Note: CDS and LDS configs for filesystem based configs are taken from https://www.envoyproxy.io/docs/envoy/latest/start/quick-start/configuration-dynamic-filesystem.

@mattklein123 Is there a way I can force the total_filter_chains_draining metric to increment in a standalone environment using the quick-start example?
Also were you able to spot anything suspicious from the config dumps and stats as noted in #20113 (comment)?

@shashankram
Copy link
Author

@mattklein123 I was able to repro this on a standalone environment.

Steps:

  1. Deploy Envoy locally exactly as described in https://www.envoyproxy.io/docs/envoy/latest/start/sandboxes/dynamic-configuration-filesystem

  2. Update the lds.yaml to have multiple filter_chains with filter_chain_match:

resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 10000
  filter_chains:
  - filter_chain_match:
      destination_port: 10000 
    filters:
      name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        http_filters:
        - name: envoy.filters.http.router
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: example_proxy_cluster
  - filter_chain_match:
      destination_port: 20000 
    filters:
      name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        http_filters:
        - name: envoy.filters.http.router
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: example_proxy_cluster

Ensure curl succeeds:

$ curl http://localhost:10000/
Request served by service2

HTTP/1.1 GET /

Host: localhost:10000
User-Agent: curl/7.68.0
Accept: */*
X-Forwarded-Proto: http
X-Request-Id: faeb81a4-2f1f-4022-981f-a5e253826666
X-Envoy-Expected-Rq-Timeout-Ms: 15000
  1. Remove listener_0 by changing lds.yaml to:
resources: []
  1. Confirm curl hangs:
$ curl http://localhost:10000/
^C

@mattklein123
Copy link
Member

OK thanks for taking the time to provide a repro. @lambdai are you interested in looking at this? Otherwise I can try to find time to take a look.

@lambdai
Copy link
Contributor

lambdai commented Mar 2, 2022

@mattklein123 Sorry I have too many things on my plate this week. I can look early next week

@shashankram
Copy link
Author

@mattklein123, @lambdai were you able to repro this?

@shashankram
Copy link
Author

@mattklein123, @lambdai were you able to repro this?

Still blocked on this to be able to update to the latest Envoy version. Please let me know if there is a workaround.

@soulxu
Copy link
Member

soulxu commented Mar 30, 2022

envoyproxy/envoy-alpine:v1.21.1

I got some bandwidth, but I still can't reproduce this yet.

Are you using this image directly https://hub.docker.com/r/envoyproxy/envoy-alpine/tags ?

@shashankram
Copy link
Author

envoyproxy/envoy-alpine:v1.21.1

I got some bandwidth, but I still can't reproduce this yet.

Are you using this image directly https://hub.docker.com/r/envoyproxy/envoy-alpine/tags ?

See #20113 (comment) for repro steps.

@soulxu
Copy link
Member

soulxu commented Mar 31, 2022

envoyproxy/envoy-alpine:v1.21.1

I got some bandwidth, but I still can't reproduce this yet.
Are you using this image directly https://hub.docker.com/r/envoyproxy/envoy-alpine/tags ?

See #20113 (comment) for repro steps.

Yes, I try that steps, but I can't reproduce them.

I update the config to remove listener

[2022-03-31 02:35:57.554][1136423][info][upstream] [source/server/lds_api.cc:58] lds: remove listener 'listener_0'
[2022-03-31 02:35:57.554][1136423][debug][config] [source/common/config/filesystem_subscription_impl.cc:88] Filesystem config update accepted for /home/hejiexu/go/src/github/soulxu/envoy-example/current.yaml:

Then the connection will be refused directly.

$ curl -v http://127.0.0.1:10000
*   Trying 127.0.0.1:10000...
* TCP_NODELAY set
* connect to 127.0.0.1 port 10000 failed: Connection refused
* Failed to connect to 127.0.0.1 port 10000: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 10000: Connection refused

@shashankram
Copy link
Author

I can still reproduce this using the latest-dev image exactly as described in #20113 (comment)

I can confirm that this issue seems to be introduced by the change ba474ac (/cc @mattklein123)

I cannot reproduce this issue with the previous commit e0380ad

@soulxu please confirm you are following the steps exactly as described, and please share the detailed steps you followed. Note that you need to have multiple filter chains on the listener first, and then remove the entire listener to reproduce the issue.

@shashankram
Copy link
Author

I can reproduce this issue when dynamic lds.yaml is changed from lds-1 -> lds-2 -> lds-3:

  1. lds-1.yaml:
resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 10000
  filter_chains:
  - filters:
      name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        http_filters:
        - name: envoy.filters.http.router
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: example_proxy_cluster

Request does not hang:

$ curl -sv http://localhost:10000
*   Trying 127.0.0.1:10000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 10000 (#0)
> GET / HTTP/1.1
> Host: localhost:10000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< date: Thu, 31 Mar 2022 16:41:51 GMT
< content-length: 215
< x-envoy-upstream-service-time: 6
< server: envoy
< 
Request served by service1

HTTP/1.1 GET /

Host: localhost:10000
User-Agent: curl/7.68.0
Accept: */*
X-Forwarded-Proto: http
X-Request-Id: 75f17df0-3cb1-4f45-a47f-17fd4f5da9e2
X-Envoy-Expected-Rq-Timeout-Ms: 15000
* Connection #0 to host localhost left intact
  1. lds-2.yaml
resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: listener_0
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 10000
  filter_chains:
  - filter_chain_match:
      destination_port: 10000 
    filters:
      name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        http_filters:
        - name: envoy.filters.http.router
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: example_proxy_cluster
  - filter_chain_match:
      destination_port: 20000 
    filters:
      name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        http_filters:
        - name: envoy.filters.http.router
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains:
            - "*"
            routes:
            - match:
                prefix: "/"
              route:
                cluster: example_proxy_cluster

Request does not hang:

$ curl -sv http://localhost:10000
*   Trying 127.0.0.1:10000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 10000 (#0)
> GET / HTTP/1.1
> Host: localhost:10000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< content-type: text/plain
< date: Thu, 31 Mar 2022 16:42:16 GMT
< content-length: 215
< x-envoy-upstream-service-time: 0
< server: envoy
< 
Request served by service1

HTTP/1.1 GET /

Host: localhost:10000
User-Agent: curl/7.68.0
Accept: */*
X-Forwarded-Proto: http
X-Request-Id: a123daa8-7774-4b72-8fae-5f280bdf84c4
X-Envoy-Expected-Rq-Timeout-Ms: 15000
* Connection #0 to host localhost left intact
  1. lds-3.yaml
resources: []

Request hangs:

$ curl -sv http://localhost:10000
*   Trying 127.0.0.1:10000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 10000 (#0)
> GET / HTTP/1.1
> Host: localhost:10000
> User-Agent: curl/7.68.0
> Accept: */*
> 
^C

@soulxu
Copy link
Member

soulxu commented Apr 6, 2022

Sorry for the reply late, also thanks for those reproduce steps! I can reproduce now.

Previously I only startup with lds-2.yaml, then remove the listener, it won't reproduce the problem. Now with one more step with lds-1.yaml, it can be reproduce now. I will dig into it.

@soulxu
Copy link
Member

soulxu commented Apr 6, 2022

/assign @soulxu

@shashankram
Copy link
Author

@soulxu, which release has this fix?

@shashankram
Copy link
Author

@soulxu, which release has this fix?

Seems like this is available on v1.22.0 and v1.22.2

@soulxu
Copy link
Member

soulxu commented Jun 17, 2022

@soulxu, which release has this fix?

Seems like this is available on v1.22.0 and v1.22.2

you are right, it is on v1.22.0, and greater than 1.22.0

shashankram added a commit to shashankram/osm that referenced this issue Jun 21, 2022
- Updates Envoy to its latest available version
  (v1.22.2 for Linux, v1.22.1 for Windows).
  The latest version includes the latest released
  security fix.
  We could not update Envoy previously due to a blocking bug:
  envoyproxy/envoy#20113

- Updates filter names to custom names as wellknown
  names are deprecated in Envoy (with 1 exception
  for the http.rbac filter). Envoy will use the
  TypeURL in the proto to determine which filter
  to use instead. Wellknown names are not required
  and using them is confusing because not all filters
  are defined in the legacy wellknown pkg (e.g.
  http.local_ratelimit).
  See:
  envoyproxy/envoy#21759
  envoyproxy/envoy#21763
  envoyproxy/go-control-plane#293
  envoyproxy/go-control-plane#552

- Uses the distroless image as the alpine image has been
  discontinued: envoyproxy/envoy#21758

- Updates tests to use custom filter names

- Adds `proto_types.go` to aid dynamic proto resolution
  for typed configs using `any.Any()`. This helps resolve
  protos where dynamic resolution is necessary.

- Updated Prometheus' ConfigMap to reflect changes to
  Envoy metrics prefixes

Signed-off-by: Shashank Ram <[email protected]>
shashankram added a commit to openservicemesh/osm that referenced this issue Jun 22, 2022
- Updates Envoy to its latest available version
  (v1.22.2 for Linux, v1.22.1 for Windows).
  The latest version includes the latest released
  security fix.
  We could not update Envoy previously due to a blocking bug:
  envoyproxy/envoy#20113

- Updates filter names to custom names as wellknown
  names are deprecated in Envoy (with 1 exception
  for the http.rbac filter). Envoy will use the
  TypeURL in the proto to determine which filter
  to use instead. Wellknown names are not required
  and using them is confusing because not all filters
  are defined in the legacy wellknown pkg (e.g.
  http.local_ratelimit).
  See:
  envoyproxy/envoy#21759
  envoyproxy/envoy#21763
  envoyproxy/go-control-plane#293
  envoyproxy/go-control-plane#552

- Uses the distroless image as the alpine image has been
  discontinued: envoyproxy/envoy#21758

- Updates tests to use custom filter names

- Adds `proto_types.go` to aid dynamic proto resolution
  for typed configs using `any.Any()`. This helps resolve
  protos where dynamic resolution is necessary.

- Updated Prometheus' ConfigMap to reflect changes to
  Envoy metrics prefixes

Signed-off-by: Shashank Ram <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants