-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase chunk size for speeding up file downloads #5501
base: main
Are you sure you want to change the base?
Conversation
Original fix: huggingface/huggingface_hub#1267 Not sure this function is actually still called though.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
We simply do GET requests to hf.co to download files from the Hub right now. We may switch to hfh when we update how we do caching You can try on any dataset hosted on the hub like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement, @Narsil.
Just a question below.
src/datasets/utils/file_utils.py
Outdated
@@ -377,7 +377,7 @@ def http_get( | |||
desc=desc or "Downloading", | |||
disable=not logging.is_progress_bar_enabled(), | |||
) as progress: | |||
for chunk in response.iter_content(chunk_size=1024): | |||
for chunk in response.iter_content(chunk_size=1024 * 1024): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be aligned with huggingface_hub
, shouldn't it be 10 MB instead?
for chunk in response.iter_content(chunk_size=1024 * 1024): | |
for chunk in response.iter_content(chunk_size=10 * 1024 * 1024): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. In my experiements it made only tiny improvements over 1Mb but you're right let's keep consistent!
I would like to run benchmarks on this too, before claiming victory, however I'm short on time this week :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To do benchmarks, use a machine that has good network (> 50Mo) otherwise you don't see anything.
On DGX or large AWS instances, I was able to get ~500Mo/s of transfer speed where it was stuck at ~50Mo/s before.
For larger networks, it's impossible to do in Python afaik (but there are still ways to achieve it with hf_transfer
but this is very much non official, since it's rare to get such bandwidth)
Co-authored-by: Albert Villanova del Moral <[email protected]>
Show benchmarksPyArrow==6.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
Original fix: huggingface/huggingface_hub#1267
Not sure this function is actually still called though.
I haven't done benches on this. Is there a dataset where files are hosted on the hub through cloudfront so we can have the same setup as in
hf_hub
?