-
-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 connection hangs; does Fog support timeout? #180
Comments
This is at best an Excon issue. In reviewing timeout conditions in the excon socket library, I'm not seeing any glaring issues. What excon version are you using ? |
Thanks for the quick response! I wanted to verify my understanding here before posting to excon, but I'm happy to move this to excon if that's appropriate Using
I'm working on updating our stack to the latest fog. |
will defer to @geemus on where he would like to field it. |
I think the excon stuff will only trigger if there is no activity at all, ie if you get one byte every 60 seconds, it would keep going. That said, actually having it happen like that seems exceedingly unlikely. I'd expect in most cases this would break out like you described. |
As would I. But I don't have a deep understanding of sockets. Can you suggest what I can do with the next such process that locks to try to debug this further? If I read the The fact that we had this exact same behavior on from two different libraries on two different endpoints makes me think this is something in my environment, but I don't know where to look next. We have never had this problem with S3 connections before. |
Also, the call that is locking up here is We do probably 500k of these queries in a day, so this is really rare, but it's blocking. If I could get a |
Yeah, I haven't heard of many instances of this, so it may have to do with environment or other particulars. Since we are mostly just relying on select related timeouts, you could try wrapping things in a |
Also, I think you are correct that fog just falls back (at least by default) to the excon default settings for timeouts. |
I wonder if this is happening because we are exceeding I just (roughly) checked the number of file descriptors for every running process on one server and came up with 1249:
But maybe some of those aren't open. This blog post about |
Hmm, yeah, maybe it is a file descriptor issue (might also explain why I haven't heard more widespread problems around this, as I suspect not everyone would be running into that). Those socket options sound promising, unfortunately as I began digging I found out:
See also: http://linux.die.net/man/7/socket So, unfortunately, it doesn't seem like that will help in this particular case. |
As we've reduced our load over the last week, we haven't seen these issues. But I'm not sure how that affect FD count because the same number of processes are running. They just have less stuff in their queue. But then maybe it's just a one-in-a-million that's more likely to hit with 500,000 reads than 150,000. I can certainly spread our workload around onto more servers. That's probably the solution for us for this. I'm surprised that we still have a 1024 FD limit given that there was talk a decade ago about servers handling 10,000 client connections (but maybe that died because computing power got so cheap). I saw Daniel Stenberg wrote about using Oh, yeah, but "Ruby doesn’t implement IO#poll." But Square hacked it for production. |
Thanks for the updates. poll does sound promising, but the hack seems terrifying. I'll keep it in mind should this recur or become a bigger issue, but it doesn't seem worth the risk/effort presently (given the rarity of it coming up). Going to close this out, but definitely let me know if you have further input, questions or would like to discuss further. |
We have been running into stalled connection issues lately. First we had them in our Google API calls which we solved by setting the
open_timeout
value in Hurley.Now we are seeing this in connections to S3 with Fog. I have processes that have hung for hours. Digging into them they always lock up waiting on
select
tos3-1.amazonaws.com
:Given the success on the previous issues I thought I'd see how to set timeouts in Fog. I see there's a
Fog.timeout
used in the tests but I didn't actually see that being applied to Excon. It looks like it's used as the default value forwait_for
commands for spinning up instances.I'm suspect that Fog depends on Excon's default 60-second timeout. Can anyone confirm that?
If Fog is not setting timeouts for Excon and Excon has a default 60-second timeout for connection, read, and write, any thoughts on why we would get locked up on this connection for hours?
The text was updated successfully, but these errors were encountered: