-
Notifications
You must be signed in to change notification settings - Fork 4.6k
fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130
fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130
Conversation
Codecov Report
@@ Coverage Diff @@
## master #29130 +/- ##
=========================================
- Coverage 76.7% 76.5% -0.2%
=========================================
Files 55 54 -1
Lines 3140 3119 -21
Branches 472 468 -4
=========================================
- Hits 2410 2388 -22
- Misses 565 567 +2
+ Partials 165 164 -1 |
Thoughts, @dancamarg0, @linuskendall, @brianlong? |
@steveluscher yup i can help test it out. Would prefer a release candidate package |
It's all yours, @0xCactus. You can |
getting the following error @steveluscher
|
Damnit. My bad. I'll publish a new package for you in a moment. |
Sorry for the delay @0xCactus. Try |
Alright, @0xCactus has this running in production at the moment. Let's give it a few days and see if there's improvement. Anyone else is free to pull down that version and give it a shot too! |
@steveluscher fyi our service has been relatively more stable since switching to use the change from this PR. Though it didn't seem to completely fix the issue. Yesterday, @dancamarg0 tweaked the keep-alive on their server to 19s and we saw the socket hang up issue occurring on the client side consistently every minute |
Just to point out the current keep-alive timeout is set to 60s, that's the setting we've been running for weeks for @0xCactus |
Wait, so here's what I think I got from the two messages above:
Are those two statements accurate? |
So a higher timeout actually seems to fix the issue after your patch apparently? I'll also collect some TCP data to see how often we see RST packets in 0xcactus RPC now and get back if I have any interesting data |
Interesting! When it comes right down to it, the goal is for the client's timeout to be just a bit less any of the other timeouts in the chain, so that the client always gives up on the free connection before any of the middle or end pieces do. Maybe there's a chance that if both the client and the load balancer are set to 19s that there's a small window of time in which the load balancer is dead but the client thinks it's still alive. You probably don't want to keep doing my testing in production for me, but it would be really interesting if you set the load balancer to match the RPC server @dancamarg0 (20s) and see if that performs just as well as 60s. |
Alright. I'm shipping this. Thanks for everyone's contributions here.
|
Problem
When contacting an RPC that's behind a load balancer, clients can often send an RPC request down a free socket, only to discover that socket has since been disposed of. These requests will fail.
Based on this excellent article on tuning keep-alive, this is what I think I've learned.
hyper
) has a default keep-alive timeout of 20s.Node.js
servers have a default keep-alive timeout of 5s.I believe the solutions to be as follows:
Summary of Changes
agentkeepalive
module.Fixes #27859, hopefully.