Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the chunk size for faster download #1267

Merged
merged 2 commits into from
Dec 15, 2022
Merged

Increase the chunk size for faster download #1267

merged 2 commits into from
Dec 15, 2022

Conversation

Narsil
Copy link
Contributor

@Narsil Narsil commented Dec 15, 2022

  • Larger chunk help reduce overhead
  • Not too large chunks so resume still works

In my very small testing I get much better download speeds already with cloudfront.

Download times go for gpt2 from ~5s+ to ~2s- (tested both on AWS and dgx).

Testing setup:

from huggingface_hub import hf_hub_download

filename = hf_hub_download("gpt2", "pytorch_model.bin", force_download=True, cache_dir="/mnt/ramdisk/")
print(filename)

Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.

- Larger chunk help reduce overhead
- Not too large chunks so resume still works

In my very small testing I get much better download speeds already with cloudfront.

Download times go for `gpt2` from ~5s+ to ~2s- (tested both on AWS and dgx).

Testing setup:

```python
from huggingface_hub import hf_hub_download

filename = hf_hub_download("gpt2", "pytorch_model.bin", force_download=True, cache_dir="/mnt/ramdisk/")
print(filename)
```

Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Dec 15, 2022

The documentation is not available anymore as the PR was closed or merged.

@Narsil Narsil requested a review from XciD December 15, 2022 09:29
@codecov
Copy link

codecov bot commented Dec 15, 2022

Codecov Report

Base: 56.15% // Head: 83.90% // Increases project coverage by +27.74% 🎉

Coverage data is based on head (5959a3b) compared to base (c0e795b).
Patch coverage: 96.22% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1267       +/-   ##
===========================================
+ Coverage   56.15%   83.90%   +27.74%     
===========================================
  Files          47       47               
  Lines        4566     4597       +31     
===========================================
+ Hits         2564     3857     +1293     
+ Misses       2002      740     -1262     
Impacted Files Coverage Δ
src/huggingface_hub/utils/__init__.py 100.00% <ø> (ø)
src/huggingface_hub/_commit_api.py 92.44% <93.10%> (+8.29%) ⬆️
src/huggingface_hub/file_download.py 88.57% <100.00%> (+27.01%) ⬆️
src/huggingface_hub/utils/tqdm.py 100.00% <100.00%> (+40.90%) ⬆️
src/huggingface_hub/repository.py 78.76% <0.00%> (+0.88%) ⬆️
src/huggingface_hub/__init__.py 75.75% <0.00%> (+3.03%) ⬆️
src/huggingface_hub/utils/_runtime.py 62.50% <0.00%> (+4.80%) ⬆️
src/huggingface_hub/utils/_chunk_utils.py 100.00% <0.00%> (+7.14%) ⬆️
src/huggingface_hub/utils/_hf_folder.py 100.00% <0.00%> (+13.15%) ⬆️
... and 25 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@Wauplin
Copy link
Contributor

Wauplin commented Dec 15, 2022

I can confirm the speed-up on my side (x1.7) as well even though my local bandwidth is also limiting. Thanks for finding that out @Narsil ! I think 10MB is fine in term of chunksize for resume download.

I have added a few lines of code to your PR 😇 Sorry for that but it was easier like this. It's just to display the correct filename in the progress bar when downloading from a CDN.

Comment on lines +488 to +492
displayed_name = url
content_disposition = r.headers.get("Content-Disposition")
if content_disposition is not None and "filename=" in content_disposition:
# Means file is on CDN
displayed_name = content_disposition.split("filename=")[-1]
Copy link
Contributor

@Wauplin Wauplin Dec 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to download speed. I've added that part myself to fix the progress bar naming 🙄

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@Narsil
Copy link
Contributor Author

Narsil commented Dec 15, 2022

Base: 56.15% // Head: 83.88% // Increases project coverage by +27.73% tada

Is this bot sane in his mind ? I have a hard time believing that line :D

@Wauplin
Copy link
Contributor

Wauplin commented Dec 15, 2022

I have a hard time believing that line :D

But you did a really good job !!

Actually codecov is most likely badly configured 😕 PR comments are irrelevant but I still keep them to get the URL to the coverage report itself.

EDIT: the thing is that some tests are sporadically failing (HTTP 403 rate limit exceeded for example) and when failing on main the report is not uploaded to codecov. So when you make a PR and the tests pass, you get +30% coverage because of that (we have several tests jobs in the CI so the codecov report is uploaded piece by piece, hence the "56.15%" instead of "0%")

@XciD
Copy link
Member

XciD commented Dec 15, 2022

Nice one @Narsil

I think we can still improve by opening multiple connection to the server instead of one.
With Range: bytes= header.

@Narsil
Copy link
Contributor Author

Narsil commented Dec 15, 2022

Nice one @Narsil

I think we can still improve by opening multiple connection to the server instead of one. With Range: bytes= header.

100% but this is the low hanging fruit.
Cloudfront is actually faster than S3 btw (not by much).

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@Wauplin Wauplin merged commit d147cbd into main Dec 15, 2022
@Wauplin Wauplin deleted the Narsil-patch-1 branch December 15, 2022 15:46
Narsil added a commit to huggingface/datasets that referenced this pull request Feb 3, 2023
Original fix: huggingface/huggingface_hub#1267
Not sure this function is actually still called though.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants