Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy v1.20.0 slower to re-initialize dynamic listeners #18616

Closed
sunjayBhatia opened this issue Oct 13, 2021 · 13 comments · Fixed by #18686
Closed

Envoy v1.20.0 slower to re-initialize dynamic listeners #18616

sunjayBhatia opened this issue Oct 13, 2021 · 13 comments · Fixed by #18686
Assignees
Labels
area/listener investigate Potential bug that needs verification no stalebot Disables stalebot from closing an issue

Comments

@sunjayBhatia
Copy link
Member

Description:
It appears Envoy v1.20.0 is a bit slower to re-initialize dynamic listeners for which configuration has changed.

As part of regular dependency bumps, when bumping Contour's dependency on Envoy from v1.19.1 to v1.20.0 (see PR) we have seen our end-to-end tests fail/take quite a lot longer to run to completion. These test repeatedly create/delete k8s Ingress/Contour HTTPProxy/Gateway API *Route objects that the Contour controller programs Envoy dynamically with over xDS. The tests update these k8s resources and then assert requests to Envoy succeed. We've seen the time taken for requests to succeed when config is updated increase substantially when bumping to v1.20.0 with the same Contour controller code.

With a git bisect of commits betwen v1.19.0 and v1.20.0, it seems that we can narrow this slowdown to this commit that for one changes the default for Envoy to use the SO_REUSPORT socket option but also changes how listener updates work. Granted this is not the most exact/minimal test case but it appears this commit might be related.

Repro steps:
I can try to work up a smaller repro but for now, running the Contour e2e tests has been how we have experimented, capturing logs and fiddling with timeouts etc.:

Theories/mitigations

  • We are using the default Go HTTP client in tests and logging request errors in our test suite we see TLS handshake timeout errors for HTTPS requests/plain HTTP requests hang forever (Go default client has no request timeout, oops on our part using that!)
  • It appears the updated listener update/worker logic holds open connections that never succeed, maybe as part of what is described in the linked change in the effort to not drop connections on draining listeners
  • In the Contour tests, we can mitigate slowdowns by adding more strict timeouts to HTTP requests, though we still see listener updates as much slower anecdotally
@howardjohn
Copy link
Contributor

We are seeing the same in Istio. Essentially it looks like when a listener drains, some (maybe 50%, but I didn't get a large sample size to say exactly 50/50) hang. netstat shows them stuck in Recv-Q.

Suspecting the reuseport stuff, we disabled it via runtime flag. When we did this, we got a different error - NACKs from address already in use.

cc @mattklein123 @lambdai

@sunjayBhatia
Copy link
Member Author

sunjayBhatia commented Oct 13, 2021

Suspecting the reuseport stuff, we disabled it via runtime flag. When we did this, we got a different error - NACKs from address already in use.

Same for Contour, except set enable reuseport setting to false, rather than use runtime config

@lambdai
Copy link
Contributor

lambdai commented Oct 13, 2021

Regarding enable reuse port:

The underlying expectation is that kernel should requeue the socket of the dying passive socket(old listener) to another passive socket(new listener).
What's importantly, the libevent should somehow be notified and wake 1+ worker thread.

I am not sure which piece went broken.

@mattklein123
Copy link
Member

mattklein123 commented Oct 13, 2021

The underlying expectation is that kernel should requeue the socket of the dying passive socket(old listener) to another passive socket(new listener). What's importantly, the libevent should somehow be notified and wake 1+ worker thread.

Unfortunately this is not the way it works. Sockets are not re-queued. A lot of care was put into make sure that we don't wind up with dead sockets that are never accepted, and I'm surprised to hear this is happening. I'm happy to look into this if we can come up with a more self contained repro?

@mattklein123 mattklein123 added area/listener investigate Potential bug that needs verification no stalebot Disables stalebot from closing an issue and removed bug triage Issue requires triage labels Oct 13, 2021
@mattklein123 mattklein123 self-assigned this Oct 13, 2021
@lambdai
Copy link
Contributor

lambdai commented Oct 13, 2021

I am interested in the behavior of reuse port disabled.

Since 1.20 the underlying fd is duplicated anyway regardless the old listener's listen fd is closed or not. See also ListenerManagerImpl::setNewOrDrainingSocketFactory

@sunjayBhatia
Copy link
Member Author

sunjayBhatia commented Oct 14, 2021

a more self contained repro

working on something smaller, was able to get a little spike to do what is happening with Contour, was able to reproduce earlier in this smaller setup but now its not reproducible consistently enough to share yet 😅

@lambdai
Copy link
Contributor

lambdai commented Oct 15, 2021

This is what I watched today. Ideally we want new listener fd always refers the underlying sock as fd=64 and fd=58. However, I do see below unexpected scenario.
To avoid confusing, the concurrency is 4 and this is port reuse enabled.

istio-proxy@istio-ingressgateway-85bfd47c66-rfv2r:/$ ss -nltp
State            Recv-Q           Send-Q                     Local Address:Port                       Peer Address:Port           Process           
// hit less listener update                                                                             
LISTEN           0                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=64),("envoy",pid=16,fd=58))           
LISTEN           0                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=63),("envoy",pid=16,fd=57))           
LISTEN           0                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=62),("envoy",pid=16,fd=56))           
LISTEN           0                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=59),("envoy",pid=16,fd=55))           
// request lose update. The 5 + 1 + 2 accepted socket will not be served. 
LISTEN           5                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=54))                                  
LISTEN           1                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=53))                                  
LISTEN           0                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=52))                                  
LISTEN           2                1024                             0.0.0.0:8080                            0.0.0.0:*               users:(("envoy",pid=16,fd=51))      

@mattklein123
Copy link
Member

mattklein123 commented Oct 16, 2021

Ideally we want new listener fd always refers the underlying sock as fd=64 and fd=58

I think what you are seeing here looks correct to me. Each worker as its own fd, which is duplicated.

@lambdai if you have a repro, can you check the following. In the case of a listener update, we should be "cloning" the socket factory which goes through this code:

for (auto& socket : factory_to_clone.sockets_) {
// In the cloning case we always duplicate() the socket. This makes sure that during listener
// update/drain we don't lose any incoming connections when using reuse_port. Specifically on
// Linux the use of SO_REUSEPORT causes the kernel to allocate a separate queue for each socket
// on the same address. Incoming connections are immediately assigned to one of these queues.
// If connections are in the queue when the socket is closed, they are closed/reset, not sent to
// another queue. So avoid making extra queues in the kernel, even temporarily.
//
// TODO(mattklein123): In the current code as long as the address matches, the socket factory
// will be cloned, effectively ignoring any changed socket options. The code should probably
// block any updates to listeners that use the same address but different socket options.
// (It's possible we could handle changing some socket options, but this would be tricky and
// probably not worth the difficulty.)
sockets_.push_back(socket->duplicate());
}

Can you make sure this code is being invoked during the update and duplicating the socket for each worker? Assuming this is working, the only things I can think of are something like:

  1. There is some kernel bug around duplicating SO_REUSEPORT sockets such that connections are still getting dropped.
  2. Somehow events are getting dropped so that we aren't waking up the duplicated sockets to accept.

It's hard for me to tell in the above dump exactly what is going on (how many listeners, whether duplication occurred, etc.). If you can repro, perhaps if you add some more logging we can have a better idea of what is going on?

@lambdai
Copy link
Contributor

lambdai commented Oct 18, 2021

@lambdai if you have a repro, can you check the following. In the case of a listener update, we should be "cloning" the socket factory which goes through this code:

This piece seems good.

I may have found the root cause.

During the in place update, a ListenerImpl is created and the ListenSocketFactory is cloned. The listen sockets are cloned, too.

However, per in place update, the old listener is not entirely drained. The old ListenSocketFactory::closeAllSockets is not invoked immediately after the new listener is warmed up.

The old listen sockets are closed in RAII style, which is after the drain_timeout.

That says, OS will schedule accept sockets to the old listen fd and new listen fd, but no handle is watching and accept old listen fd. These sockets are lost.

@lambdai
Copy link
Contributor

lambdai commented Oct 18, 2021

I have an experimental fix at local that takes ownership of the listener socket factory if the update is in place update.
This fix massages the istio integration tests

@mattklein123
Copy link
Member

During the in place update, a ListenerImpl is created and the ListenSocketFactory is cloned. The listen sockets are cloned, too.

OK, this makes sense. I'm OK with a targeted fix, but I remember when we originally reviewed the in-place update changes, I commented on how it seemed strange that we were creating an entire new listener that we don't really use. Can we do additional cleanups here to prevent this type of issue? For an in-place update perhaps there should be no new socket factory at all?

@lambdai
Copy link
Contributor

lambdai commented Oct 19, 2021

During the in place update, a ListenerImpl is created and the ListenSocketFactory is cloned. The listen sockets are cloned, too.

OK, this makes sense. I'm OK with a targeted fix, but I remember when we originally reviewed the in-place update changes, I commented on how it seemed strange that we were creating an entire new listener that we don't really use. Can we do additional cleanups here to prevent this type of issue? For an in-place update perhaps there should be no new socket factory at all?

I transfer the ownership of listen socket factory to the in place updated new listener, all looks good until I hit the case updating an updating listener. Still fixing...

@mattklein123
Copy link
Member

I transfer the ownership of listen socket factory to the in place updated new listener, all looks good until I hit the case updating an updating listener. Still fixing...

OK lmk if you need help fixing. Just ping me on Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/listener investigate Potential bug that needs verification no stalebot Disables stalebot from closing an issue
Projects
None yet
4 participants