-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Improve error message: "Too many open files in system" #6791
Comments
please open a terminal and get the output of |
@5chdn sorry, but i can only post this for now, hope it helps:
Also, on next start |
What do you mean by "next start", after a reboot? |
@5chdn i mean on the next app run (no system reboot was done) |
Ok, so it's not reproducible? |
Right, i cannot reproduce it anymore. |
hi, i managed to run into the same problem after syncing testnet and leaving parity running for a while.. Parity version Parity/v1.9.5-stable-ff821daf1-20180321/x86_64-linux-gnu/rustc1.24.1
most of the output from lsof can be found here https://linx.li/selif/61f03wv2.txt and the backtrace
|
@pussinboot did you try to restart parity? |
yes, restarting parity does fix the error |
@Tbaut I believe this is an OS error, when a websocket want to create a new connection it creates a new "file" and there's a max limit of files open that the OS enforced (this is a huge number though but is typically a problem for webservers with websocket connections). It almost sounds like there's some dapp or UI bug that spawns a ton of connections, not sure. Pure speculation, but restarting parity will almost always fix the problem. The only thing we can do on our end is improve the error message. Will keep the issue open for that reason. |
Now, this one is odd. Even going back to that specific version, the only place this error could occur is when trying to create the handshake for a p2p-connection (only place having "Can't create handshake"), however, that point the socket is already open and internally the Which, of course, could happen, as on a *nix-system this goes to |
@gnunicorn Parity itself won't use that many handles (although I wouldn't rule out a leak entirely). But creating a websocket connection takes up a file handle afaik, so if people have a thousand websocket connections from a dapp or something that in turn is leaking handles or something beyond our control, then there's not much we can do about it. |
true, @folsen , but if I take @pussinboot 's word for it:
There is no way this can be caused by dapps ( I will see if there is some way to reproduce it, however, it might simply be fixed meanwhile. Assuming that it is though, what would a better error message look like? Should we have an |
Perhaps |
After runnig it over night, and parity never even coming close to 300 open files even during sync, but staying around 122 for most of it, I don't think there is any leak (at this point). So I attempted to clarify the error message, which turned out to be a little ugly as we have to check for its particular system-code-number and wrap around things... Not sure that is really that much better to have in the code. |
I actually just got hit by this error myself, so it still definitely is a thing. I had 479 peers when it crashed and had ethstats ping the RPCs at the same time, on top of this it might've been sending out snapshots. There's definitely something going on here, but it might be related to having really high peer count and not necessarily time. This node was running for well over a week before crashing, but it never crashed at all on 1.10 with the same settings. Though clearly the original report here dates back to much before, I think there's potentially a number of different reasons behind the different reports here. |
@folsen did you get OS error 23 or 24? |
@dvdplm 24, Also, peak peers was 512, but an hour+ earlier than the shutdown. |
@folsen so it's the shell that started parity that blew the limit. And it's not while setting up the handshake – that's the nasty part of resource limit issues: they can pop up anywhere. Can you repeat it by lowering the process |
When trying to implement that message, we've noticed there are actually two distinct Errors reported in this ticket: 23 ("too many open files in system", first post) and 24 ("too many open files in process"). The later the proposed message is appropriate for (and being handled in the PR), however for 23 this isn't the case yet. Do we want to have a better message for that, too? Considering that the appropriate response is killing many processes or even restarting the system? I feel like this might be a bit of our depth here - similar to a "no space left on device" this might simply be errors that we should propagate instead of rephrase as they are way out of the responsibility of the one process to handle... What do you think? |
@gnunicorn Agree with you, when I originally phrased this I was thinking of a simpler situation "Can't create handshake" isn't very informative. If we could say "Can't create handshake because you have too many websocket connections open" then that'd be better. Unfortunately it seems we can't boil it down to anything so simple. Something like @dvdplm suggested might be appropriate, but yeah, don't spend too much time on it :) |
How long did you have this process running? Even with 512 connections open, we shouldn't be reaching the default ulimit of 1024 open files. This is odd. Did you run on master? Which command did you run it with? Either way, I don't think the current approach in #8737 is particular useful: it wouldn't even catch this case, as it came up somewhere else (in "Incoming stream" rather than at handshake). I'll investigate if there is some nicer way to have OS-Errors-Wrapped into more user-friendly messages directly within error-chain of network-p2p. However, I still suspect there to be a leak somewhere, because the limit is just much higher than I'd anticipate for this process to reach even within usual usage. Some more debug info would be helpful. |
The text was updated successfully, but these errors were encountered: