Increase the chunk size for faster download #1267

Narsil · 2022-12-15T08:54:11Z

Larger chunk help reduce overhead
Not too large chunks so resume still works

In my very small testing I get much better download speeds already with cloudfront.

Download times go for gpt2 from ~5s+ to ~2s- (tested both on AWS and dgx).

Testing setup:

from huggingface_hub import hf_hub_download

filename = hf_hub_download("gpt2", "pytorch_model.bin", force_download=True, cache_dir="/mnt/ramdisk/")
print(filename)

Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.

- Larger chunk help reduce overhead - Not too large chunks so resume still works In my very small testing I get much better download speeds already with cloudfront. Download times go for `gpt2` from ~5s+ to ~2s- (tested both on AWS and dgx). Testing setup: ```python from huggingface_hub import hf_hub_download filename = hf_hub_download("gpt2", "pytorch_model.bin", force_download=True, cache_dir="/mnt/ramdisk/") print(filename) ``` Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.

HuggingFaceDocBuilderDev · 2022-12-15T08:57:55Z

The documentation is not available anymore as the PR was closed or merged.

codecov · 2022-12-15T09:52:05Z

Codecov Report

Base: 56.15% // Head: 83.90% // Increases project coverage by +27.74% 🎉

Coverage data is based on head (5959a3b) compared to base (c0e795b).
Patch coverage: 96.22% of modified lines in pull request are covered.

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1267       +/-   ##
===========================================
+ Coverage   56.15%   83.90%   +27.74%     
===========================================
  Files          47       47               
  Lines        4566     4597       +31     
===========================================
+ Hits         2564     3857     +1293     
+ Misses       2002      740     -1262

Impacted Files	Coverage Δ
src/huggingface_hub/utils/__init__.py	`100.00% <ø> (ø)`
src/huggingface_hub/_commit_api.py	`92.44% <93.10%> (+8.29%)`	⬆️
src/huggingface_hub/file_download.py	`88.57% <100.00%> (+27.01%)`	⬆️
src/huggingface_hub/utils/tqdm.py	`100.00% <100.00%> (+40.90%)`	⬆️
src/huggingface_hub/repository.py	`78.76% <0.00%> (+0.88%)`	⬆️
src/huggingface_hub/__init__.py	`75.75% <0.00%> (+3.03%)`	⬆️
src/huggingface_hub/utils/_runtime.py	`62.50% <0.00%> (+4.80%)`	⬆️
src/huggingface_hub/utils/_chunk_utils.py	`100.00% <0.00%> (+7.14%)`	⬆️
src/huggingface_hub/utils/_hf_folder.py	`100.00% <0.00%> (+13.15%)`	⬆️
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Wauplin · 2022-12-15T09:57:45Z

I can confirm the speed-up on my side (x1.7) as well even though my local bandwidth is also limiting. Thanks for finding that out @Narsil ! I think 10MB is fine in term of chunksize for resume download.

I have added a few lines of code to your PR 😇 Sorry for that but it was easier like this. It's just to display the correct filename in the progress bar when downloading from a CDN.

Wauplin · 2022-12-15T09:58:53Z

src/huggingface_hub/file_download.py

+    displayed_name = url
+    content_disposition = r.headers.get("Content-Disposition")
+    if content_disposition is not None and "filename=" in content_disposition:
+        # Means file is on CDN
+        displayed_name = content_disposition.split("filename=")[-1]


Not related to download speed. I've added that part myself to fix the progress bar naming 🙄

Wauplin

Thanks !

Narsil · 2022-12-15T10:08:20Z

Base: 56.15% // Head: 83.88% // Increases project coverage by +27.73% tada

Is this bot sane in his mind ? I have a hard time believing that line :D

Wauplin · 2022-12-15T10:11:12Z

I have a hard time believing that line :D

But you did a really good job !!

Actually codecov is most likely badly configured 😕 PR comments are irrelevant but I still keep them to get the URL to the coverage report itself.

EDIT: the thing is that some tests are sporadically failing (HTTP 403 rate limit exceeded for example) and when failing on main the report is not uploaded to codecov. So when you make a PR and the tests pass, you get +30% coverage because of that (we have several tests jobs in the CI so the codecov report is uploaded piece by piece, hence the "56.15%" instead of "0%")

XciD · 2022-12-15T13:37:20Z

Nice one @Narsil

I think we can still improve by opening multiple connection to the server instead of one.
With Range: bytes= header.

Narsil · 2022-12-15T15:27:42Z

Nice one @Narsil

I think we can still improve by opening multiple connection to the server instead of one. With Range: bytes= header.

100% but this is the low hanging fruit.
Cloudfront is actually faster than S3 btw (not by much).

LysandreJik

Cool!

Original fix: huggingface/huggingface_hub#1267 Not sure this function is actually still called though.

Narsil requested review from Wauplin and LysandreJik December 15, 2022 08:54

Narsil requested a review from XciD December 15, 2022 09:29

fix displayed filename in progress bar for CDN files

5959a3b

Wauplin reviewed Dec 15, 2022

View reviewed changes

Wauplin approved these changes Dec 15, 2022

View reviewed changes

LysandreJik approved these changes Dec 15, 2022

View reviewed changes

Wauplin merged commit d147cbd into main Dec 15, 2022

Wauplin deleted the Narsil-patch-1 branch December 15, 2022 15:46

Narsil mentioned this pull request Dec 20, 2022

Enabling hf_transfer use. #1272

Merged

Wauplin mentioned this pull request Dec 23, 2022

snapshot_download does not show the name of the files downloaded in parallel #1281

Closed

Narsil mentioned this pull request Feb 3, 2023

Increase chunk size for speeding up file downloads huggingface/datasets#5501

Open

Narsil added a commit to huggingface/datasets that referenced this pull request Feb 3, 2023

Speeding up file downloads

6ef20f9

Original fix: huggingface/huggingface_hub#1267 Not sure this function is actually still called though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase the chunk size for faster download #1267

Increase the chunk size for faster download #1267

Narsil commented Dec 15, 2022

HuggingFaceDocBuilderDev commented Dec 15, 2022 •

edited

Loading

codecov bot commented Dec 15, 2022 •

edited

Loading

Wauplin commented Dec 15, 2022 •

edited

Loading

Wauplin Dec 15, 2022 •

edited

Loading

Wauplin left a comment

Narsil commented Dec 15, 2022

Wauplin commented Dec 15, 2022 •

edited

Loading

XciD commented Dec 15, 2022

Narsil commented Dec 15, 2022

LysandreJik left a comment

Increase the chunk size for faster download #1267

Increase the chunk size for faster download #1267

Conversation

Narsil commented Dec 15, 2022

HuggingFaceDocBuilderDev commented Dec 15, 2022 • edited Loading

codecov bot commented Dec 15, 2022 • edited Loading

Codecov Report

Wauplin commented Dec 15, 2022 • edited Loading

Wauplin Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

Wauplin left a comment

Choose a reason for hiding this comment

Narsil commented Dec 15, 2022

Wauplin commented Dec 15, 2022 • edited Loading

XciD commented Dec 15, 2022

Narsil commented Dec 15, 2022

LysandreJik left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 15, 2022 •

edited

Loading

codecov bot commented Dec 15, 2022 •

edited

Loading

Wauplin commented Dec 15, 2022 •

edited

Loading

Wauplin Dec 15, 2022 •

edited

Loading

Wauplin commented Dec 15, 2022 •

edited

Loading