Improve error message: "Too many open files in system" #6791

dmitryn · 2017-10-16T09:56:14Z

I'm running:

Parity version: 1.8.0

Operating system: MacOS

And installed: via homebrew

====================

stack backtrace:

Thread 'IO Worker #3' panicked at 'Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } })', src/libcore/result.rs:860

This is a bug. Please report it at:

    https://github.com/paritytech/parity/issues/new

[1]    60411 abort      parity

The text was updated successfully, but these errors were encountered:

5chdn · 2017-10-16T12:00:07Z

please open a terminal and get the output of lsof

dmitryn · 2017-10-16T12:06:19Z

@5chdn sorry, but i can only post this for now, hope it helps:

$ lsof | wc -l
   11614

Also, on next start parity works fine, no errors.

5chdn · 2017-10-17T13:09:44Z

What do you mean by "next start", after a reboot?

dmitryn · 2017-10-17T13:25:11Z

@5chdn i mean on the next app run (no system reboot was done)

5chdn · 2017-10-18T08:25:39Z

Ok, so it's not reproducible?

dmitryn · 2017-10-18T08:27:47Z

Right, i cannot reproduce it anymore.

pussinboot · 2018-04-05T01:23:10Z

hi, i managed to run into the same problem after syncing testnet and leaving parity running for a while..
these are the command line options i'm using:
--no-ui --geth --chain ropsten --no-dapps --mode active --tracing off --pruning fast --db-compaction ssd --cache-size 1024 --snapshot-peers 50

Parity version Parity/v1.9.5-stable-ff821daf1-20180321/x86_64-linux-gnu/rustc1.24.1

lsof | wc -l
31434

most of the output from lsof can be found here https://linx.li/selif/61f03wv2.txt

and the backtrace

 ====================
 stack backtrace:
    0:     0x55ae5247490c - <no info>
 Thread 'IO Worker #1' panicked at 'Can't create handshake: Error(Io(Error { repr: Os { code: 24, message: "Too many open files" } }), State { next_error: None, backtrace: None })', /checkout/src/libcore/result.rs:916
 This is a bug. Please report it at:
     https://github.com/paritytech/parity/issues/new

Tbaut · 2018-04-05T08:24:13Z

@pussinboot did you try to restart parity?

pussinboot · 2018-04-05T18:20:15Z

yes, restarting parity does fix the error
just doing my part reporting the bug, hopefully the output from lsof helps

folsen · 2018-05-19T09:14:01Z

@Tbaut I believe this is an OS error, when a websocket want to create a new connection it creates a new "file" and there's a max limit of files open that the OS enforced (this is a huge number though but is typically a problem for webservers with websocket connections). It almost sounds like there's some dapp or UI bug that spawns a ton of connections, not sure.

Pure speculation, but restarting parity will almost always fix the problem. The only thing we can do on our end is improve the error message. Will keep the issue open for that reason.

gnunicorn · 2018-05-29T12:45:58Z

Now, this one is odd. Even going back to that specific version, the only place this error could occur is when trying to create the handshake for a p2p-connection (only place having "Can't create handshake"), however, that point the socket is already open and internally the ::new doesn't even do anything, just setting up some structs. The only call that could make this run into an err is when trying to generate the random keys, through ethkey which is just a thin wrapper around rand::OsRnd - WHAT!?!

Which, of course, could happen, as on a *nix-system this goes to /dev/urandom, thus is a file read. But if we fail to read /dev/urandom because we have Too many open files in system, than this implies we are leaking file handels somewhere. Or are we generally expecting for parity to use more than 1024 open file handles at the same time (which is the default on a *nix system)? if so, we might want to advice people to set a higher ulimit to run parity....?

folsen · 2018-05-29T15:08:19Z

@gnunicorn Parity itself won't use that many handles (although I wouldn't rule out a leak entirely). But creating a websocket connection takes up a file handle afaik, so if people have a thousand websocket connections from a dapp or something that in turn is leaking handles or something beyond our control, then there's not much we can do about it.

gnunicorn · 2018-05-29T19:16:58Z

true, @folsen , but if I take @pussinboot 's word for it:

hi, i managed to run into the same problem after syncing testnet and leaving parity running for a while..
these are the command line options i'm using:
--no-ui --geth --chain ropsten --no-dapps --mode active --tracing off --pruning fast --db-compaction ssd --cache-size 1024 --snapshot-peers 50

There is no way this can be caused by dapps (--no-dapps) and considering the "leaving it running for a while" to be true, there isn't any other user interaction causing that (--no-ui) but just having parity up and running.

I will see if there is some way to reproduce it, however, it might simply be fixed meanwhile.

Assuming that it is though, what would a better error message look like? Should we have an expect for the random-call, suggesting a restart for this ticket to be closed? Something like .expect("Looks like the system blocks calls to is Random Number Generator. Please consider restarting parity!")?

dvdplm · 2018-05-30T06:11:53Z

what would a better error message look like?

Perhaps Too many open files; can't create handshake. Check your resource limits and restart parity! and perhaps a platform specific advice to check /etc/security/limit.conf (linux), sysctl (mac), something (win)?

gnunicorn · 2018-05-30T12:51:01Z

After runnig it over night, and parity never even coming close to 300 open files even during sync, but staying around 122 for most of it, I don't think there is any leak (at this point). So I attempted to clarify the error message, which turned out to be a little ugly as we have to check for its particular system-code-number and wrap around things... Not sure that is really that much better to have in the code.

folsen · 2018-05-31T07:19:27Z

I actually just got hit by this error myself, so it still definitely is a thing.

I had 479 peers when it crashed and had ethstats ping the RPCs at the same time, on top of this it might've been sending out snapshots.

There's definitely something going on here, but it might be related to having really high peer count and not necessarily time. This node was running for well over a week before crashing, but it never crashed at all on 1.10 with the same settings. Though clearly the original report here dates back to much before, I think there's potentially a number of different reasons behind the different reports here.

dvdplm · 2018-05-31T07:22:37Z

@folsen did you get OS error 23 or 24?

folsen · 2018-05-31T07:45:54Z

@dvdplm 24, Incoming streams error, closing sever: Os { code: 24, kind: Other, message: "Too many open files" }

Also, peak peers was 512, but an hour+ earlier than the shutdown.

dvdplm · 2018-05-31T08:45:56Z

@folsen so it's the shell that started parity that blew the limit. And it's not while setting up the handshake – that's the nasty part of resource limit issues: they can pop up anywhere. Can you repeat it by lowering the process ulimit? :/

gnunicorn · 2018-05-31T09:38:30Z

When trying to implement that message, we've noticed there are actually two distinct Errors reported in this ticket: 23 ("too many open files in system", first post) and 24 ("too many open files in process"). The later the proposed message is appropriate for (and being handled in the PR), however for 23 this isn't the case yet.

Do we want to have a better message for that, too? Considering that the appropriate response is killing many processes or even restarting the system? I feel like this might be a bit of our depth here - similar to a "no space left on device" this might simply be errors that we should propagate instead of rephrase as they are way out of the responsibility of the one process to handle... What do you think?

folsen · 2018-05-31T10:02:04Z

@gnunicorn Agree with you, when I originally phrased this I was thinking of a simpler situation "Can't create handshake" isn't very informative. If we could say "Can't create handshake because you have too many websocket connections open" then that'd be better. Unfortunately it seems we can't boil it down to anything so simple. Something like @dvdplm suggested might be appropriate, but yeah, don't spend too much time on it :)

gnunicorn · 2018-05-31T10:07:47Z

@folsen

@dvdplm 24, Incoming streams error, closing sever: Os { code: 24, kind: Other, message: "Too many open files" } Also, peak peers was 512, but an hour+ earlier than the shutdown.

How long did you have this process running? Even with 512 connections open, we shouldn't be reaching the default ulimit of 1024 open files. This is odd. Did you run on master? Which command did you run it with?

Either way, I don't think the current approach in #8737 is particular useful: it wouldn't even catch this case, as it came up somewhere else (in "Incoming stream" rather than at handshake).

I'll investigate if there is some nicer way to have OS-Errors-Wrapped into more user-friendly messages directly within error-chain of network-p2p.

However, I still suspect there to be a leak somewhere, because the limit is just much higher than I'd anticipate for this process to reach even within usual usage. Some more debug info would be helpful.

5chdn added M4-core ⛓ Core client code / Rust. Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Oct 16, 2017

5chdn changed the title ~~Crashing bug~~ Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } }) Oct 16, 2017

5chdn closed this as completed Oct 18, 2017

5chdn added this to the 1.11 milestone Apr 5, 2018

Tbaut reopened this Apr 6, 2018

Tbaut added F3-annoyance 💩 The client behaves within expectations, however this “expected behaviour” itself is at issue. and removed Z0-unconfirmed 🤔 Issue might be valid, but it’s not yet known. labels Apr 6, 2018

5chdn modified the milestones: 1.11, 1.12 Apr 24, 2018

folsen changed the title ~~Can't create handshake: StdIo(Error { repr: Os { code: 23, message: "Too many open files in system" } })~~ Improve error message: "Too many open files in system" May 19, 2018

folsen added Q2-easy 💃 Can be fixed by copy and pasting from StackOverflow. Q3-medium A fair chunk of work, not necessarily very hard but not trivial either labels May 19, 2018

gnunicorn mentioned this issue May 30, 2018

A more helpful Error message when we can't handshake because of too many open files #8737

Closed

gnunicorn mentioned this issue May 31, 2018

Custom Error Messages on ENFILE and EMFILE IO Errors #8744

Merged

debris closed this as completed in #8744 Jun 1, 2018

gnunicorn mentioned this issue Jun 4, 2018

Parity does not close domain socket properly, runs out of memory #8774

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error message: "Too many open files in system" #6791

Improve error message: "Too many open files in system" #6791

dmitryn commented Oct 16, 2017

5chdn commented Oct 16, 2017

dmitryn commented Oct 16, 2017

5chdn commented Oct 17, 2017

dmitryn commented Oct 17, 2017

5chdn commented Oct 18, 2017

dmitryn commented Oct 18, 2017

pussinboot commented Apr 5, 2018

Tbaut commented Apr 5, 2018

pussinboot commented Apr 5, 2018

folsen commented May 19, 2018

gnunicorn commented May 29, 2018 •

edited

Loading

folsen commented May 29, 2018

gnunicorn commented May 29, 2018 •

edited

Loading

dvdplm commented May 30, 2018

gnunicorn commented May 30, 2018

folsen commented May 31, 2018

dvdplm commented May 31, 2018

folsen commented May 31, 2018

dvdplm commented May 31, 2018

gnunicorn commented May 31, 2018

folsen commented May 31, 2018

gnunicorn commented May 31, 2018

Improve error message: "Too many open files in system" #6791

Improve error message: "Too many open files in system" #6791

Comments

dmitryn commented Oct 16, 2017

5chdn commented Oct 16, 2017

dmitryn commented Oct 16, 2017

5chdn commented Oct 17, 2017

dmitryn commented Oct 17, 2017

5chdn commented Oct 18, 2017

dmitryn commented Oct 18, 2017

pussinboot commented Apr 5, 2018

Tbaut commented Apr 5, 2018

pussinboot commented Apr 5, 2018

folsen commented May 19, 2018

gnunicorn commented May 29, 2018 • edited Loading

folsen commented May 29, 2018

gnunicorn commented May 29, 2018 • edited Loading

dvdplm commented May 30, 2018

gnunicorn commented May 30, 2018

folsen commented May 31, 2018

dvdplm commented May 31, 2018

folsen commented May 31, 2018

dvdplm commented May 31, 2018

gnunicorn commented May 31, 2018

folsen commented May 31, 2018

gnunicorn commented May 31, 2018

gnunicorn commented May 29, 2018 •

edited

Loading

gnunicorn commented May 29, 2018 •

edited

Loading