Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute sha by chunks when uploading #1296

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Jan 17, 2023

Fix #1294 (thanks @wfjsw).

When computing the sha of a file (before uploading it), the chunksize was defined but never used. Therefore the file was entirely loaded in memory which is problematic for huge files. This PR solves that.

I also refactored it to make it "more readable". I tested it quickly on my machine and found out that a chunksize of 1MB seems optimal is term of performance (very rough estimation) and still ok in term of memory footprint.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 17, 2023

(for the record, here is the small benchmark I did on a file of 3GB:

chunksize, time
500, 2.786808490753174
1000, 2.576904535293579
1000000, 1.8990943431854248
10000000, 1.995957374572754
100000000, 2.643815755844116

I ran it multiple times and got constant results
)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 17, 2023

The documentation is not available anymore as the PR was closed or merged.

@codecov
Copy link

codecov bot commented Jan 17, 2023

Codecov Report

❗ No coverage uploaded for pull request base (main@f173842). Click here to learn what that means.
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head cb1c89b differs from pull request most recent head 3f965cd. Consider uploading reports for the commit 3f965cd to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1296   +/-   ##
=======================================
  Coverage        ?   83.41%           
=======================================
  Files           ?       47           
  Lines           ?     4690           
  Branches        ?        0           
=======================================
  Hits            ?     3912           
  Misses          ?      778           
  Partials        ?        0           
Impacted Files Coverage Δ
src/huggingface_hub/utils/sha.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc'ing @SBrandeis who wrote the initial version, but yeah LGTM (i think we just forgot to pass the param)

@Wauplin
Copy link
Contributor Author

Wauplin commented Jan 18, 2023

(i think we just forgot to pass the param)

Yup exactly.
The refacto is just to make it more concise but the logic is exactly the same.

@Wauplin Wauplin added this to the v0.12 milestone Jan 19, 2023
@Wauplin Wauplin merged commit 57fe360 into main Jan 19, 2023
@Wauplin Wauplin deleted the 1294-compute-sha-by-chunks-to-preserve-memory branch January 19, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HfApi.upload_file consumes huge amount of memory when hashing
3 participants