Skip to content
This repository was archived by the owner on Nov 6, 2020. It is now read-only.

Improve error message: "Too many open files in system" #6791

Closed
dmitryn opened this issue Oct 16, 2017 · 22 comments · Fixed by #8744
Closed

Improve error message: "Too many open files in system" #6791

dmitryn opened this issue Oct 16, 2017 · 22 comments · Fixed by #8744
Labels
F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue. M4-core ⛓ Core client code / Rust. Q2-easy 💃 Can be fixed by copy and pasting from StackOverflow. Q3-medium A fair chunk of work, not necessarily very hard but not trivial either
Milestone

Comments

@dmitryn
Copy link

dmitryn commented Oct 16, 2017

I'm running:

  • Parity version: 1.8.0
  • Operating system: MacOS
  • And installed: via homebrew

====================

stack backtrace:

Thread 'IO Worker #3' panicked at 'Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } })', src/libcore/result.rs:860

This is a bug. Please report it at:

    https://github.com/paritytech/parity/issues/new

[1]    60411 abort      parity
@5chdn
Copy link
Contributor

5chdn commented Oct 16, 2017

please open a terminal and get the output of lsof

@5chdn 5chdn added M4-core ⛓ Core client code / Rust. Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Oct 16, 2017
@5chdn 5chdn changed the title Crashing bug Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } }) Oct 16, 2017
@dmitryn
Copy link
Author

dmitryn commented Oct 16, 2017

@5chdn sorry, but i can only post this for now, hope it helps:

$ lsof | wc -l
   11614

Also, on next start parity works fine, no errors.

@5chdn
Copy link
Contributor

5chdn commented Oct 17, 2017

What do you mean by "next start", after a reboot?

@dmitryn
Copy link
Author

dmitryn commented Oct 17, 2017

@5chdn i mean on the next app run (no system reboot was done)

@5chdn
Copy link
Contributor

5chdn commented Oct 18, 2017

Ok, so it's not reproducible?

@5chdn 5chdn closed this as completed Oct 18, 2017
@dmitryn
Copy link
Author

dmitryn commented Oct 18, 2017

Right, i cannot reproduce it anymore.

@pussinboot
Copy link

hi, i managed to run into the same problem after syncing testnet and leaving parity running for a while..
these are the command line options i'm using:
--no-ui --geth --chain ropsten --no-dapps --mode active --tracing off --pruning fast --db-compaction ssd --cache-size 1024 --snapshot-peers 50

Parity version Parity/v1.9.5-stable-ff821daf1-20180321/x86_64-linux-gnu/rustc1.24.1

lsof | wc -l
31434

most of the output from lsof can be found here https://linx.li/selif/61f03wv2.txt

and the backtrace

 ====================
 stack backtrace:
    0:     0x55ae5247490c - <no info>
 Thread 'IO Worker #1' panicked at 'Can't create handshake: Error(Io(Error { repr: Os { code: 24, message: "Too many open files" } }), State { next_error: None, backtrace: None })', /checkout/src/libcore/result.rs:916
 This is a bug. Please report it at:
     https://github.com/paritytech/parity/issues/new

@Tbaut
Copy link
Contributor

Tbaut commented Apr 5, 2018

@pussinboot did you try to restart parity?

@5chdn 5chdn added this to the 1.11 milestone Apr 5, 2018
@pussinboot
Copy link

yes, restarting parity does fix the error
just doing my part reporting the bug, hopefully the output from lsof helps

@Tbaut Tbaut reopened this Apr 6, 2018
@Tbaut Tbaut added F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue. and removed Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Apr 6, 2018
@5chdn 5chdn modified the milestones: 1.11, 1.12 Apr 24, 2018
@folsen
Copy link
Contributor

folsen commented May 19, 2018

@Tbaut I believe this is an OS error, when a websocket want to create a new connection it creates a new "file" and there's a max limit of files open that the OS enforced (this is a huge number though but is typically a problem for webservers with websocket connections). It almost sounds like there's some dapp or UI bug that spawns a ton of connections, not sure.

Pure speculation, but restarting parity will almost always fix the problem. The only thing we can do on our end is improve the error message. Will keep the issue open for that reason.

@folsen folsen changed the title Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } }) Improve error message: "Too many open files in system" May 19, 2018
@folsen folsen added Q2-easy 💃 Can be fixed by copy and pasting from StackOverflow. Q3-medium A fair chunk of work, not necessarily very hard but not trivial either labels May 19, 2018
@gnunicorn
Copy link
Contributor

gnunicorn commented May 29, 2018

Now, this one is odd. Even going back to that specific version, the only place this error could occur is when trying to create the handshake for a p2p-connection (only place having "Can't create handshake"), however, that point the socket is already open and internally the ::new doesn't even do anything, just setting up some structs. The only call that could make this run into an err is when trying to generate the random keys, through ethkey which is just a thin wrapper around rand::OsRnd - WHAT!?!

Which, of course, could happen, as on a *nix-system this goes to /dev/urandom, thus is a file read. But if we fail to read /dev/urandom because we have Too many open files in system, than this implies we are leaking file handels somewhere. Or are we generally expecting for parity to use more than 1024 open file handles at the same time (which is the default on a *nix system)? if so, we might want to advice people to set a higher ulimit to run parity....?

@folsen
Copy link
Contributor

folsen commented May 29, 2018

@gnunicorn Parity itself won't use that many handles (although I wouldn't rule out a leak entirely). But creating a websocket connection takes up a file handle afaik, so if people have a thousand websocket connections from a dapp or something that in turn is leaking handles or something beyond our control, then there's not much we can do about it.

@gnunicorn
Copy link
Contributor

gnunicorn commented May 29, 2018

true, @folsen , but if I take @pussinboot 's word for it:

hi, i managed to run into the same problem after syncing testnet and leaving parity running for a while..
these are the command line options i'm using:
--no-ui --geth --chain ropsten --no-dapps --mode active --tracing off --pruning fast --db-compaction ssd --cache-size 1024 --snapshot-peers 50

There is no way this can be caused by dapps (--no-dapps) and considering the "leaving it running for a while" to be true, there isn't any other user interaction causing that (--no-ui) but just having parity up and running.

I will see if there is some way to reproduce it, however, it might simply be fixed meanwhile.


Assuming that it is though, what would a better error message look like? Should we have an expect for the random-call, suggesting a restart for this ticket to be closed? Something like .expect("Looks like the system blocks calls to is Random Number Generator. Please consider restarting parity!")?

@dvdplm
Copy link
Collaborator

dvdplm commented May 30, 2018

what would a better error message look like?

Perhaps Too many open files; can't create handshake. Check your resource limits and restart parity! and perhaps a platform specific advice to check /etc/security/limit.conf (linux), sysctl (mac), something (win)?

@gnunicorn
Copy link
Contributor

After runnig it over night, and parity never even coming close to 300 open files even during sync, but staying around 122 for most of it, I don't think there is any leak (at this point). So I attempted to clarify the error message, which turned out to be a little ugly as we have to check for its particular system-code-number and wrap around things... Not sure that is really that much better to have in the code.

@folsen
Copy link
Contributor

folsen commented May 31, 2018

I actually just got hit by this error myself, so it still definitely is a thing.

I had 479 peers when it crashed and had ethstats ping the RPCs at the same time, on top of this it might've been sending out snapshots.

There's definitely something going on here, but it might be related to having really high peer count and not necessarily time. This node was running for well over a week before crashing, but it never crashed at all on 1.10 with the same settings. Though clearly the original report here dates back to much before, I think there's potentially a number of different reasons behind the different reports here.

@dvdplm
Copy link
Collaborator

dvdplm commented May 31, 2018

@folsen did you get OS error 23 or 24?

@folsen
Copy link
Contributor

folsen commented May 31, 2018

@dvdplm 24, Incoming streams error, closing sever: Os { code: 24, kind: Other, message: "Too many open files" }

Also, peak peers was 512, but an hour+ earlier than the shutdown.

@dvdplm
Copy link
Collaborator

dvdplm commented May 31, 2018

@folsen so it's the shell that started parity that blew the limit. And it's not while setting up the handshake – that's the nasty part of resource limit issues: they can pop up anywhere. Can you repeat it by lowering the process ulimit? :/

@gnunicorn
Copy link
Contributor

When trying to implement that message, we've noticed there are actually two distinct Errors reported in this ticket: 23 ("too many open files in system", first post) and 24 ("too many open files in process"). The later the proposed message is appropriate for (and being handled in the PR), however for 23 this isn't the case yet.

Do we want to have a better message for that, too? Considering that the appropriate response is killing many processes or even restarting the system? I feel like this might be a bit of our depth here - similar to a "no space left on device" this might simply be errors that we should propagate instead of rephrase as they are way out of the responsibility of the one process to handle... What do you think?

@folsen
Copy link
Contributor

folsen commented May 31, 2018

@gnunicorn Agree with you, when I originally phrased this I was thinking of a simpler situation "Can't create handshake" isn't very informative. If we could say "Can't create handshake because you have too many websocket connections open" then that'd be better. Unfortunately it seems we can't boil it down to anything so simple. Something like @dvdplm suggested might be appropriate, but yeah, don't spend too much time on it :)

@gnunicorn
Copy link
Contributor

@folsen

@dvdplm 24, Incoming streams error, closing sever: Os { code: 24, kind: Other, message: "Too many open files" } Also, peak peers was 512, but an hour+ earlier than the shutdown.

How long did you have this process running? Even with 512 connections open, we shouldn't be reaching the default ulimit of 1024 open files. This is odd. Did you run on master? Which command did you run it with?

Either way, I don't think the current approach in #8737 is particular useful: it wouldn't even catch this case, as it came up somewhere else (in "Incoming stream" rather than at handshake).

I'll investigate if there is some nicer way to have OS-Errors-Wrapped into more user-friendly messages directly within error-chain of network-p2p.


However, I still suspect there to be a leak somewhere, because the limit is just much higher than I'd anticipate for this process to reach even within usual usage. Some more debug info would be helpful.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue. M4-core ⛓ Core client code / Rust. Q2-easy 💃 Can be fixed by copy and pasting from StackOverflow. Q3-medium A fair chunk of work, not necessarily very hard but not trivial either
Projects
None yet
7 participants