S3 connection hangs; does Fog support timeout? #180

jeremywadsack · 2015-09-04T18:22:36Z

We have been running into stalled connection issues lately. First we had them in our Google API calls which we solved by setting the open_timeout value in Hurley.

Now we are seeing this in connections to S3 with Fog. I have processes that have hung for hours. Digging into them they always lock up waiting on select to s3-1.amazonaws.com:

$ strace -p 30973
Process 30973 attached - interrupt to quit
select(15, [14], NULL, NULL, NULL^C <unfinished ...>
Process 30973 detached

$ ll /proc/30973/fd/14
total 0
lrwx------ 1 deploy deploy 64 Sep  4 08:05 14 -> socket:[1150450345]

$ netstat -ea | grep 1150450345
tcp        0      0 xxx-localhost-xxx:44062 s3-1.amazonaws.com:https    ESTABLISHED deploy     1150450345

Given the success on the previous issues I thought I'd see how to set timeouts in Fog. I see there's a Fog.timeout used in the tests but I didn't actually see that being applied to Excon. It looks like it's used as the default value for wait_for commands for spinning up instances.

I'm suspect that Fog depends on Excon's default 60-second timeout. Can anyone confirm that?

If Fog is not setting timeouts for Excon and Excon has a default 60-second timeout for connection, read, and write, any thoughts on why we would get locked up on this connection for hours?

The text was updated successfully, but these errors were encountered:

lanej · 2015-09-04T18:41:35Z

Fog.wait_for and Fog.timeout is meant to wait for operations to finish, not to timeout individual request / response cycles.

This is at best an Excon issue. In reviewing timeout conditions in the excon socket library, I'm not seeing any glaring issues.

What excon version are you using ?

jeremywadsack · 2015-09-04T18:58:25Z

Thanks for the quick response!

I wanted to verify my understanding here before posting to excon, but I'm happy to move this to excon if that's appropriate

Using

excon-0.45.4
fog-1.33.0

I'm working on updating our stack to the latest fog.

lanej · 2015-09-04T19:18:22Z

will defer to @geemus on where he would like to field it.

geemus · 2015-09-04T19:23:40Z

I think the excon stuff will only trigger if there is no activity at all, ie if you get one byte every 60 seconds, it would keep going. That said, actually having it happen like that seems exceedingly unlikely. I'd expect in most cases this would break out like you described.

jeremywadsack · 2015-09-04T19:40:23Z

As would I. But I don't have a deep understanding of sockets. Can you suggest what I can do with the next such process that locks to try to debug this further?

If I read the select man page right, that has to do with waiting for a "read" or "write" operation which makes me wonder how we solved the same behavior on the Google API calls by changing the "connection" timeout.

The fact that we had this exact same behavior on from two different libraries on two different endpoints makes me think this is something in my environment, but I don't know where to look next. We have never had this problem with S3 connections before.

jeremywadsack · 2015-09-04T20:01:01Z

Also, the call that is locking up here is Fog::Storage#get_bucket(@s3_bucket, :prefix => prefix). In most cases the size of that list would be < 1kb (plus the overhead of the XML response from AWS). Even at 1 byte per second that I can't see how that would take 12 hours (we've had two such cases in the last two days).

We do probably 500k of these queries in a day, so this is really rare, but it's blocking. If I could get a Timeout exception and move on, I'd be ok.

geemus · 2015-09-08T14:33:55Z

Yeah, I haven't heard of many instances of this, so it may have to do with environment or other particulars. Since we are mostly just relying on select related timeouts, you could try wrapping things in a Timeout.timeout (but this can also be kind of iffy in terms of triggering or not). The way it is now seems less than ideal (probably a total timeout is still useful), but managing/tracking a total timeout is more complicated/messy and the existing solution has seemed usually to be good enough. I'm certainly happy to keep discussing, but like you I'm a bit uncertain about specifically how we would be hitting this case and/or what other next steps we could take to pin it down or eliminate it.

geemus · 2015-09-08T14:47:09Z

Also, I think you are correct that fog just falls back (at least by default) to the excon default settings for timeouts.

jeremywadsack · 2015-09-08T17:02:59Z

I wonder if this is happening because we are exceeding FD_SETSIZE (1024 file descriptors) on the server. Although I would expect that to raise an error, not lock the select call.

I just (roughly) checked the number of file descriptors for every running process on one server and came up with 1249:

for p in `ps ax | cut -c -5`; do sudo ls -l /proc/$p/fd | wc -l; done | awk '{total = total + $1}END{print total}'

But maybe some of those aren't open.

This blog post about IO.select mentions why Timeout.timeout can be a problem (which is probably what you were referring to). The commenter suggests using SO_SNDTIMEO and SO_RCVTIMEO. But that's getting well beyond my understanding of this. And getting into the bowels of excon.

geemus · 2015-09-09T15:14:40Z

Hmm, yeah, maybe it is a file descriptor issue (might also explain why I haven't heard more widespread problems around this, as I suspect not everyone would be running into that).

Those socket options sound promising, unfortunately as I began digging I found out:

SO_RCVTIMEO and SO_SNDTIMEO
Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on.

See also: http://linux.die.net/man/7/socket

So, unfortunately, it doesn't seem like that will help in this particular case.

jeremywadsack · 2015-09-09T15:53:04Z

As we've reduced our load over the last week, we haven't seen these issues. But I'm not sure how that affect FD count because the same number of processes are running. They just have less stuff in their queue. But then maybe it's just a one-in-a-million that's more likely to hit with 500,000 reads than 150,000.

I can certainly spread our workload around onto more servers. That's probably the solution for us for this.

I'm surprised that we still have a 1024 FD limit given that there was talk a decade ago about servers handling 10,000 client connections (but maybe that died because computing power got so cheap). I saw Daniel Stenberg wrote about using poll instead of select to work around this because poll doesn't use bitmaps for file descriptors. I have no idea if it's worth overhauling Excon for that (as it's a client library, not a server library) but it might make it more scalable for systems that use lots of parallel client connections as we do.

Oh, yeah, but "Ruby doesn’t implement IO#poll." But Square hacked it for production.

geemus · 2015-09-10T15:03:53Z

Thanks for the updates. poll does sound promising, but the hack seems terrifying. I'll keep it in mind should this recur or become a bigger issue, but it doesn't seem worth the risk/effort presently (given the rarity of it coming up). Going to close this out, but definitely let me know if you have further input, questions or would like to discuss further.

jeremywadsack changed the title ~~S3 connection hangs; never closes~~ S3 connection hangs; does Fog support timeout? Sep 4, 2015

geemus closed this as completed Sep 10, 2015

utako mentioned this issue May 16, 2016

I'd like to configure my Excon read_timeout and write_timeout #254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 connection hangs; does Fog support timeout? #180

S3 connection hangs; does Fog support timeout? #180

jeremywadsack commented Sep 4, 2015

lanej commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

lanej commented Sep 4, 2015

geemus commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

geemus commented Sep 8, 2015

geemus commented Sep 8, 2015

jeremywadsack commented Sep 8, 2015

geemus commented Sep 9, 2015

jeremywadsack commented Sep 9, 2015

geemus commented Sep 10, 2015

S3 connection hangs; does Fog support timeout? #180

S3 connection hangs; does Fog support timeout? #180

Comments

jeremywadsack commented Sep 4, 2015

lanej commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

lanej commented Sep 4, 2015

geemus commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

jeremywadsack commented Sep 4, 2015

geemus commented Sep 8, 2015

geemus commented Sep 8, 2015

jeremywadsack commented Sep 8, 2015

geemus commented Sep 9, 2015

jeremywadsack commented Sep 9, 2015

geemus commented Sep 10, 2015