Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 connection hangs; does Fog support timeout? #180

Closed
jeremywadsack opened this issue Sep 4, 2015 · 12 comments
Closed

S3 connection hangs; does Fog support timeout? #180

jeremywadsack opened this issue Sep 4, 2015 · 12 comments

Comments

@jeremywadsack
Copy link

We have been running into stalled connection issues lately. First we had them in our Google API calls which we solved by setting the open_timeout value in Hurley.

Now we are seeing this in connections to S3 with Fog. I have processes that have hung for hours. Digging into them they always lock up waiting on select to s3-1.amazonaws.com:

$ strace -p 30973
Process 30973 attached - interrupt to quit
select(15, [14], NULL, NULL, NULL^C <unfinished ...>
Process 30973 detached

$ ll /proc/30973/fd/14
total 0
lrwx------ 1 deploy deploy 64 Sep  4 08:05 14 -> socket:[1150450345]

$ netstat -ea | grep 1150450345
tcp        0      0 xxx-localhost-xxx:44062 s3-1.amazonaws.com:https    ESTABLISHED deploy     1150450345 

Given the success on the previous issues I thought I'd see how to set timeouts in Fog. I see there's a Fog.timeout used in the tests but I didn't actually see that being applied to Excon. It looks like it's used as the default value for wait_for commands for spinning up instances.

I'm suspect that Fog depends on Excon's default 60-second timeout. Can anyone confirm that?

If Fog is not setting timeouts for Excon and Excon has a default 60-second timeout for connection, read, and write, any thoughts on why we would get locked up on this connection for hours?

@jeremywadsack jeremywadsack changed the title S3 connection hangs; never closes S3 connection hangs; does Fog support timeout? Sep 4, 2015
@lanej
Copy link
Member

lanej commented Sep 4, 2015

Fog.wait_for and Fog.timeout is meant to wait for operations to finish, not to timeout individual request / response cycles.

This is at best an Excon issue. In reviewing timeout conditions in the excon socket library, I'm not seeing any glaring issues.

What excon version are you using ?

@jeremywadsack
Copy link
Author

Thanks for the quick response!

I wanted to verify my understanding here before posting to excon, but I'm happy to move this to excon if that's appropriate

Using

excon-0.45.4
fog-1.33.0

I'm working on updating our stack to the latest fog.

@lanej
Copy link
Member

lanej commented Sep 4, 2015

will defer to @geemus on where he would like to field it.

@geemus
Copy link
Member

geemus commented Sep 4, 2015

I think the excon stuff will only trigger if there is no activity at all, ie if you get one byte every 60 seconds, it would keep going. That said, actually having it happen like that seems exceedingly unlikely. I'd expect in most cases this would break out like you described.

@jeremywadsack
Copy link
Author

As would I. But I don't have a deep understanding of sockets. Can you suggest what I can do with the next such process that locks to try to debug this further?

If I read the select man page right, that has to do with waiting for a "read" or "write" operation which makes me wonder how we solved the same behavior on the Google API calls by changing the "connection" timeout.

The fact that we had this exact same behavior on from two different libraries on two different endpoints makes me think this is something in my environment, but I don't know where to look next. We have never had this problem with S3 connections before.

@jeremywadsack
Copy link
Author

Also, the call that is locking up here is Fog::Storage#get_bucket(@s3_bucket, :prefix => prefix). In most cases the size of that list would be < 1kb (plus the overhead of the XML response from AWS). Even at 1 byte per second that I can't see how that would take 12 hours (we've had two such cases in the last two days).

We do probably 500k of these queries in a day, so this is really rare, but it's blocking. If I could get a Timeout exception and move on, I'd be ok.

@geemus
Copy link
Member

geemus commented Sep 8, 2015

Yeah, I haven't heard of many instances of this, so it may have to do with environment or other particulars. Since we are mostly just relying on select related timeouts, you could try wrapping things in a Timeout.timeout (but this can also be kind of iffy in terms of triggering or not). The way it is now seems less than ideal (probably a total timeout is still useful), but managing/tracking a total timeout is more complicated/messy and the existing solution has seemed usually to be good enough. I'm certainly happy to keep discussing, but like you I'm a bit uncertain about specifically how we would be hitting this case and/or what other next steps we could take to pin it down or eliminate it.

@geemus
Copy link
Member

geemus commented Sep 8, 2015

Also, I think you are correct that fog just falls back (at least by default) to the excon default settings for timeouts.

@jeremywadsack
Copy link
Author

I wonder if this is happening because we are exceeding FD_SETSIZE (1024 file descriptors) on the server. Although I would expect that to raise an error, not lock the select call.

I just (roughly) checked the number of file descriptors for every running process on one server and came up with 1249:

for p in `ps ax | cut -c -5`; do sudo ls -l /proc/$p/fd | wc -l; done | awk '{total = total + $1}END{print total}'

But maybe some of those aren't open.

This blog post about IO.select mentions why Timeout.timeout can be a problem (which is probably what you were referring to). The commenter suggests using SO_SNDTIMEO and SO_RCVTIMEO. But that's getting well beyond my understanding of this. And getting into the bowels of excon.

@geemus
Copy link
Member

geemus commented Sep 9, 2015

Hmm, yeah, maybe it is a file descriptor issue (might also explain why I haven't heard more widespread problems around this, as I suspect not everyone would be running into that).

Those socket options sound promising, unfortunately as I began digging I found out:

SO_RCVTIMEO and SO_SNDTIMEO
Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on.

See also: http://linux.die.net/man/7/socket

So, unfortunately, it doesn't seem like that will help in this particular case.

@jeremywadsack
Copy link
Author

As we've reduced our load over the last week, we haven't seen these issues. But I'm not sure how that affect FD count because the same number of processes are running. They just have less stuff in their queue. But then maybe it's just a one-in-a-million that's more likely to hit with 500,000 reads than 150,000.

I can certainly spread our workload around onto more servers. That's probably the solution for us for this.

I'm surprised that we still have a 1024 FD limit given that there was talk a decade ago about servers handling 10,000 client connections (but maybe that died because computing power got so cheap). I saw Daniel Stenberg wrote about using poll instead of select to work around this because poll doesn't use bitmaps for file descriptors. I have no idea if it's worth overhauling Excon for that (as it's a client library, not a server library) but it might make it more scalable for systems that use lots of parallel client connections as we do.

Oh, yeah, but "Ruby doesn’t implement IO#poll." But Square hacked it for production.

@geemus
Copy link
Member

geemus commented Sep 10, 2015

Thanks for the updates. poll does sound promising, but the hack seems terrifying. I'll keep it in mind should this recur or become a bigger issue, but it doesn't seem worth the risk/effort presently (given the rarity of it coming up). Going to close this out, but definitely let me know if you have further input, questions or would like to discuss further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants