Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads hang indefinitely if connection interrupted from lack of timeout #1366

Closed
sneakers-the-rat opened this issue Nov 28, 2023 · 7 comments · Fixed by #1369
Closed

Downloads hang indefinitely if connection interrupted from lack of timeout #1366

sneakers-the-rat opened this issue Nov 28, 2023 · 7 comments · Fixed by #1369
Assignees
Labels

Comments

@sneakers-the-rat
Copy link
Contributor

sneakers-the-rat commented Nov 28, 2023

tell me when you're getting sick of me <3

Downloads happen by iterating over some downloader object, wrapped in some number of attempts, catching HTTPErrors

The core of the downloading logic for the default downloader is here, standard requests streaming:

dandi-cli/dandi/dandiapi.py

Lines 1473 to 1480 in 0971d02

result = self.client.session.get(url, stream=True, headers=headers)
# TODO: apparently we might need retries here as well etc
# if result.status_code not in (200, 201):
result.raise_for_status()
for chunk in result.iter_content(chunk_size=chunk_size):
if chunk: # could be some "keep alive"?
yield chunk
lgr.info("Asset %s successfully downloaded", self.identifier)

The zarr downloader is I think identical in this block?

dandi-cli/dandi/dandiapi.py

Lines 1903 to 1911 in 0971d02

result = self.client.session.get(url, stream=True, headers=headers)
# TODO: apparently we might need retries here as well etc
# if result.status_code not in (200, 201):
result.raise_for_status()
for chunk in result.iter_content(chunk_size=chunk_size):
if chunk: # could be some "keep alive"?
yield chunk
lgr.info("File %s in Zarr %s successfully downloaded", self, self.zarr_id)

A timeout is not used in either case, and the requests docs warn:

Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely

and elaborates further on how timeouts are handled after the initial connection here

If the connection is interrupted, the downloader hangs indefinitely even if the connection is re-established.

MWE

  • systems tested: MacOS and Debian 12
  • dandi version: 0.58.0

To replicate:

  • Start downloading any dandiset with the dandi download command
  • Toggle WiFi off and on or disconnect and reconnect ethernet cable briefly.

Expected Behavior

Since there is already retry logic around the download command, I think it would be straightforward to

  • set some timeout value in consts or read from env variable
  • pass timeout value to session.get call
  • catch requests.exceptions.Timeout in existing requests.exceptions.HTTPError block, which should already sleep for 5 seconds and retry.

Happy to PR this very minimal change :)

@jwodder
Copy link
Member

jwodder commented Nov 28, 2023

@yarikoptic What should the timeout be set to?

@yarikoptic
Copy link
Member

what would be your judgement? I think 60 sec is good enough to say that connection is dead if no response.

@jwodder
Copy link
Member

jwodder commented Nov 28, 2023

@yarikoptic That seems high. I was expecting something more like 5 or 10 seconds, but I'm having difficulty finding any resources on optimal timeout values.

EDIT: For the record, httpx's default timeout is 5 seconds.

@yarikoptic
Copy link
Member

https://www.python-httpx.org/advanced/#fine-tuning-the-configuration gives an example of using 60 sec for connection time out... Ok, what about mid-ground of 30? ;) here we are to prevent a complete "hang" - waiting for a minute or half of it is IMHO ok compromise.

@sneakers-the-rat
Copy link
Contributor Author

requests has two timeouts :) one timeout for connections, and one timeout for read timeout (the amount of time to wait between bytes sent from the server after a connection is established).

So maybe a shorter connection timeout to raise on a nonresponsive server and a longer read timeout to let a connection heal?

yarikoptic added a commit that referenced this issue Nov 29, 2023
Set 30-second connect & read timeout when downloading files
@sneakers-the-rat
Copy link
Contributor Author

Thanks y'all !

Copy link

🚀 Issue was released in 0.58.2 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants