-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase the chunk size for faster download #1267
Conversation
- Larger chunk help reduce overhead - Not too large chunks so resume still works In my very small testing I get much better download speeds already with cloudfront. Download times go for `gpt2` from ~5s+ to ~2s- (tested both on AWS and dgx). Testing setup: ```python from huggingface_hub import hf_hub_download filename = hf_hub_download("gpt2", "pytorch_model.bin", force_download=True, cache_dir="/mnt/ramdisk/") print(filename) ``` Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.
The documentation is not available anymore as the PR was closed or merged. |
Codecov ReportBase: 56.15% // Head: 83.90% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1267 +/- ##
===========================================
+ Coverage 56.15% 83.90% +27.74%
===========================================
Files 47 47
Lines 4566 4597 +31
===========================================
+ Hits 2564 3857 +1293
+ Misses 2002 740 -1262
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I can confirm the speed-up on my side (x1.7) as well even though my local bandwidth is also limiting. Thanks for finding that out @Narsil ! I think 10MB is fine in term of chunksize for resume download. I have added a few lines of code to your PR 😇 Sorry for that but it was easier like this. It's just to display the correct filename in the progress bar when downloading from a CDN. |
displayed_name = url | ||
content_disposition = r.headers.get("Content-Disposition") | ||
if content_disposition is not None and "filename=" in content_disposition: | ||
# Means file is on CDN | ||
displayed_name = content_disposition.split("filename=")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to download speed. I've added that part myself to fix the progress bar naming 🙄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Is this bot sane in his mind ? I have a hard time believing that line :D |
But you did a really good job !! Actually codecov is most likely badly configured 😕 PR comments are irrelevant but I still keep them to get the URL to the coverage report itself. EDIT: the thing is that some tests are sporadically failing (HTTP 403 rate limit exceeded for example) and when failing on main the report is not uploaded to codecov. So when you make a PR and the tests pass, you get +30% coverage because of that (we have several tests jobs in the CI so the codecov report is uploaded piece by piece, hence the "56.15%" instead of "0%") |
Nice one @Narsil I think we can still improve by opening multiple connection to the server instead of one. |
100% but this is the low hanging fruit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
Original fix: huggingface/huggingface_hub#1267 Not sure this function is actually still called though.
In my very small testing I get much better download speeds already with cloudfront.
Download times go for
gpt2
from ~5s+ to ~2s- (tested both on AWS and dgx).Testing setup:
Using a tmpfs part of the disk avoids actually writing to disk and taking it into account.