fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

steveluscher · 2022-12-06T22:25:12Z

Problem

When contacting an RPC that's behind a load balancer, clients can often send an RPC request down a free socket, only to discover that socket has since been disposed of. These requests will fail.

Based on this excellent article on tuning keep-alive, this is what I think I've learned.

The underlying HTTP library that the Solana RPC uses (hyper) has a default keep-alive timeout of 20s.
Typical Node.js servers have a default keep-alive timeout of 5s.
When the RPC is behind a load balancer, a higher ‘free socket timeout’ in the load balancer can result in the RPC closing the socket, but the load balancer (ergo, the client) thinking that it's still open. The next request will fail.

I believe the solutions to be as follows:

Let people supply their own agents or disable agents altogether should they like to do some tuning feat: you can now supply your own HTTP agent to a web3.js Connection #29125.
Reduce the timeout of our default agent to the Solana RPC's timeout, minus one second (20s - 1s = 19s) fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130.
RPC providers should maybe do the same – setting their load balancer timeouts to 1 second less than the Solana RPC's timeout (20s - 1s = 19s).

Summary of Changes

Replaced our custom agent implementation with the agentkeepalive module.
Set the free socket timeout to the Solana RPC's timeout minus one second (20s - 1s = 19s)

Fixes #27859, hopefully.

codecov · 2022-12-06T22:30:07Z

Codecov Report

Merging #29130 (9904db5) into master (f1427dd) will decrease coverage by 0.1%.
The diff coverage is 33.3%.

@@            Coverage Diff            @@
##           master   #29130     +/-   ##
=========================================
- Coverage    76.7%    76.5%   -0.2%     
=========================================
  Files          55       54      -1     
  Lines        3140     3119     -21     
  Branches      472      468      -4     
=========================================
- Hits         2410     2388     -22     
- Misses        565      567      +2     
+ Partials      165      164      -1

…default timeout

steveluscher · 2022-12-07T17:32:52Z

Thoughts, @dancamarg0, @linuskendall, @brianlong?

steveluscher · 2022-12-07T17:38:16Z

@0xCactus, @y2kappa, would you like a strategy to test this out before I land it? You can either patch in the changes yourself, or I can publish a release candidate package just for you, that you can add to your package.json.

0xCactus · 2022-12-07T17:53:40Z

@steveluscher yup i can help test it out. Would prefer a release candidate package

steveluscher · 2022-12-07T19:09:34Z

It's all yours, @0xCactus. You can yarn add @solana/web3.js@pr-29130.

0xCactus · 2022-12-07T20:13:41Z

getting the following error @steveluscher

{"level":50,"time":1670443877071,"pid":28,"hostname":"4169baa11267","service":"monitor","source":"pino ","error":{},"type":"account","msg":"monitor crashed: Error: failed to get info about account H6ARHf6YXhGYeQfUzQNGk6rDNnLBQKrenN712K4AQJEG: TypeError [ERR_INVALID_PROTOCOL]: Protocol \"https:\" not supported. Expected \"http:\""}

steveluscher · 2022-12-07T20:44:11Z

Damnit. My bad. I'll publish a new package for you in a moment.

steveluscher · 2022-12-07T23:11:40Z

Sorry for the delay @0xCactus. Try yarn add @solana/[email protected]

steveluscher · 2022-12-08T20:10:33Z

Alright, @0xCactus has this running in production at the moment. Let's give it a few days and see if there's improvement. Anyone else is free to pull down that version and give it a shot too!

0xCactus · 2022-12-13T21:35:13Z

@steveluscher fyi our service has been relatively more stable since switching to use the change from this PR. Though it didn't seem to completely fix the issue. Yesterday, @dancamarg0 tweaked the keep-alive on their server to 19s and we saw the socket hang up issue occurring on the client side consistently every minute

dancamarg0 · 2022-12-13T21:40:16Z

Just to point out the current keep-alive timeout is set to 60s, that's the setting we've been running for weeks for @0xCactus

steveluscher · 2022-12-13T22:55:02Z

Wait, so here's what I think I got from the two messages above:

Triton One did not change the keep-alive timeout on your RPC; it's still 60s.
That notwithstanding, you were running smoothly until yesterday when the hangup issue re-appeared, worse than before.

Are those two statements accurate?

dancamarg0 · 2022-12-14T04:01:35Z

Yes. Triton One did not change the keep-alive timeout on your RPC; it's still 60s.
He was running smoothly until yesterday where I changed the keep-alive timeout from 60s to 19s as per your recommendation

So a higher timeout actually seems to fix the issue after your patch apparently? I'll also collect some TCP data to see how often we see RST packets in 0xcactus RPC now and get back if I have any interesting data

steveluscher · 2022-12-14T17:42:01Z

So a higher timeout actually seems to fix the issue after your patch apparently?

Interesting! When it comes right down to it, the goal is for the client's timeout to be just a bit less any of the other timeouts in the chain, so that the client always gives up on the free connection before any of the middle or end pieces do.

Maybe there's a chance that if both the client and the load balancer are set to 19s that there's a small window of time in which the load balancer is dead but the client thinks it's still alive. You probably don't want to keep doing my testing in production for me, but it would be really interesting if you set the load balancer to match the RPC server @dancamarg0 (20s) and see if that performs just as well as 60s.

steveluscher · 2022-12-19T18:22:16Z

Alright. I'm shipping this. Thanks for everyone's contributions here.

Customers saw success with this change and particular configurations of their RPC endpoints, success being the disconnection rate falling to zero.
For those whose RPC endpoints configuration is still incompatible with these defaults now have the option to supply their own, or to turn off keep-alive altogether, using feat: you can now supply your own HTTP agent to a web3.js Connection #29125

github-actions bot added the web3.js Related to the JavaScript client label Dec 6, 2022

steveluscher marked this pull request as ready for review December 6, 2022 22:29

steveluscher mentioned this pull request Dec 6, 2022

socket hang up (ECONNRESET) - Web3js #27859

Closed

steveluscher added 4 commits December 6, 2022 22:59

Delete AgentManager

c18042f

Replace custom http.Agent implementation with agentkeepalive package

56a6061

Set the default free socket timeout to 1s less than the Solana RPC's …

3665d79

…default timeout

Add link to particular issue comment

4861aba

Create the correct flavor of default agent for http/https

9904db5

steveluscher merged commit 456a819 into solana-labs:master Dec 19, 2022

steveluscher deleted the set-upper-limit-on-agent-keepalive branch December 19, 2022 18:22

steveluscher added this to the web3.js Roadmap – December 2022 milestone Dec 30, 2022

This was referenced Jan 11, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.70.3 Balantion2020/Balantion#889

Open

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.71.0 Balantion2020/Balantion#901

Open

snyk-bot mentioned this pull request Jan 15, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.72.0 Balantion2020/Balantion#902

Open

Balantion2020 mentioned this pull request Jan 23, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#924

Open

This was referenced Feb 1, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#946

Open

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#952

Open

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#959

Open

Balantion2020 mentioned this pull request Feb 8, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#968

Open

snyk-bot mentioned this pull request Feb 14, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#985

Open

Balantion2020 mentioned this pull request Feb 21, 2023

[Snyk] Upgrade @solana/web3.js from 1.31.0 to 1.73.0 Balantion2020/Balantion#1009

Open

steveluscher mentioned this pull request Jul 24, 2023

socket hang up solana-labs/solana-web3.js#1426

Closed

steveluscher mentioned this pull request Dec 28, 2023

onSignature infinite loop solana-labs/solana-web3.js#1989

Closed

steveluscher mentioned this pull request Feb 15, 2024

[experimental] Figure out recommended way to override fetch dispatcher in Node solana-labs/solana-web3.js#2126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

steveluscher commented Dec 6, 2022 •

edited

Loading

codecov bot commented Dec 6, 2022 •

edited

Loading

steveluscher commented Dec 7, 2022

steveluscher commented Dec 7, 2022

0xCactus commented Dec 7, 2022

steveluscher commented Dec 7, 2022

0xCactus commented Dec 7, 2022

steveluscher commented Dec 7, 2022

steveluscher commented Dec 7, 2022

steveluscher commented Dec 8, 2022

0xCactus commented Dec 13, 2022

dancamarg0 commented Dec 13, 2022

steveluscher commented Dec 13, 2022

dancamarg0 commented Dec 14, 2022

steveluscher commented Dec 14, 2022

steveluscher commented Dec 19, 2022

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

fix: reduce Connection keep-alive timeout to 1 second fewer than the Solana RPC's keep-alive timeout #29130

Conversation

steveluscher commented Dec 6, 2022 • edited Loading

Problem

Summary of Changes

codecov bot commented Dec 6, 2022 • edited Loading

Codecov Report

steveluscher commented Dec 7, 2022

steveluscher commented Dec 7, 2022

0xCactus commented Dec 7, 2022

steveluscher commented Dec 7, 2022

0xCactus commented Dec 7, 2022

steveluscher commented Dec 7, 2022

steveluscher commented Dec 7, 2022

steveluscher commented Dec 8, 2022

0xCactus commented Dec 13, 2022

dancamarg0 commented Dec 13, 2022

steveluscher commented Dec 13, 2022

dancamarg0 commented Dec 14, 2022

steveluscher commented Dec 14, 2022

steveluscher commented Dec 19, 2022

steveluscher commented Dec 6, 2022 •

edited

Loading

codecov bot commented Dec 6, 2022 •

edited

Loading