Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow download #2927

Closed
pdudnik opened this issue Jan 9, 2017 · 8 comments
Closed

slow download #2927

pdudnik opened this issue Jan 9, 2017 · 8 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. performance priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. type: question Request for information or clarification. Not an issue.

Comments

@pdudnik
Copy link

pdudnik commented Jan 9, 2017

I am trying to download a 400M gcs file. I am using https://github.com/GoogleCloudPlatform/google-cloud-python/blob/ce6756fbe3633c74fd742567654565147628f4ba/storage/google/cloud/storage/blob.py. I noticed that by default my download looks to be chunked due to this setting here

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/master/core/google/cloud/streaming/transfer.py#L46

As a result downloading the file results in 400 calls to GCS which significantly slows down the download.

Is there some clean way I can override that when using blob.download_to_file?

@daspecster daspecster added the api: storage Issues related to the Cloud Storage API. label Jan 10, 2017
@dhermes
Copy link
Contributor

dhermes commented Jan 12, 2017

Thanks for reporting. This is actually deeper than it seems. The "correct" fix is for us to remove the (non-public) google.streaming stuff that this relies on and get a better chunking story (that doesn't rely on httplib2).

For a fix that works right now, you can duplicate the source but pass chunksize to Download. Also, gsutil (the CLI tool) has a very optimized strategy for fast downloads.

@danoscarmike danoscarmike added Status: Acknowledged priority: p2 Moderately-important priority. Fix may not be included in next release. type: question Request for information or clarification. Not an issue. labels Feb 28, 2017
@bjwatson bjwatson added the status: blocked Resolving the issue is dependent on other work. label Feb 28, 2017
@bjwatson
Copy link

@lukesneeringer says this is blocked on httplib2 work.

@lukesneeringer
Copy link
Contributor

lukesneeringer commented Mar 17, 2017

The correct solution is blocked on #1998.

What would be the benefits and drawbacks of increasing the default in the meantime, though? Would it be okay to make the default chunk size 10 MB instead of 1 MB?

Alternatively, could we find out the size of the file in advance and make the chunk size into some reasonable fragment (say, 2% or 5% of file size, with a 1 MB lower limit)?

@pdudnik
Copy link
Author

pdudnik commented Mar 17, 2017 via email

@evanj
Copy link
Contributor

evanj commented Jun 3, 2017

I think this is basically a duplicate of #2222

@lukesneeringer
Copy link
Contributor

@dhermes Is this easier now that #1998 is done?

@dhermes
Copy link
Contributor

dhermes commented Aug 10, 2017

Not easier or harder. AFAIK there is no perfect magic chunking answer, @thobrla has said before that downloading in a single request (vs. chunks) is almost always the right answer

@tseaver
Copy link
Contributor

tseaver commented Jan 8, 2018

Duplicate: #2222.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. performance priority: p2 Moderately-important priority. Fix may not be included in next release. status: blocked Resolving the issue is dependent on other work. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

8 participants