Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize warnings/errors/documentation related to file size limit #995

Closed
Wauplin opened this issue Aug 17, 2022 · 6 comments
Closed

Harmonize warnings/errors/documentation related to file size limit #995

Wauplin opened this issue Aug 17, 2022 · 6 comments
Labels
discussion documentation Improvements or additions to documentation

Comments

@Wauplin
Copy link
Contributor

Wauplin commented Aug 17, 2022

In general, file size limit is not made very clear for the user especially when uploading a file to the hub via HTTP endpoint. Discussion started as part of #847 on whether we should throw an explicit error and provide guidance when a file is too big to be uploaded (see comments #847 (comment), #847 (comment), #847 (comment), #847 (comment) and #847 (comment)).

In the documentation, we also mention a limit of 50GB in upload_file and a limit of 5GB before using LFS in general.

To be discussed:

  1. What is the actual limit for a single-file to be uploaded via HTTP ? Is there even a limit ?
  2. What is the limit we want to set for a single-file to be uploaded ? In particular, @LysandreJik mentioned that big files are not served through CDN.
    a. A possibility I see here is to raise a ValueError if file size is above 30GB (hard-limit) and raise a warning if file size is above 10GB (soft-limit).
  3. How to document that consistently ?
    a. I would propose to have a dedicated page/section in the documentation and each method create_commit, upload_file, upload_folder, push_to_hub,... could refer to it.
  4. (extra) Do we want to propose a utility helper to cut a big file into shards ? Either before uploading or on the fly (note: this is not exactly the same as uploading a big LFS file into chunks).
@Wauplin
Copy link
Contributor Author

Wauplin commented Aug 17, 2022

Ping @julien-c @Pierrci @SBrandeis for moon-landing-related questions.
Ping @nateraw since you showed interest on this matter in previous comments.

@Pierrci
Copy link
Member

Pierrci commented Aug 17, 2022

To clarify a bit all the different numbers:

  • A file must be uploaded through LFS if:
    • It's a binary file and its size is > 1MB
    • It's not a binary file and its size is > 10MB
  • 5GB is the threshold after which an LFS file needs to be uploaded through our multipart custom transfer agent
  • 30GB 50GB is the size limit for cached files w/ CloudFront. EDIT (21/03/2024): cached file limit is now 50GB. See discussion here [internal].
  • 50GB is the size limit for served files w/ CloudFront <- this means files b/t 30 and 50GB can be served through CloudFront, though they're not cached by it (there is no doc about this upper limit, this is from our own observations) EDIT (21/03/2024): it looks like now max cached size == max served size
  1. As a result of those limits, the maximum file size that we allow to be uploaded is also 50GB.
  2. An error is already returned by the server if > 50GB (by the /preupload endpoint, and also the deprecated /upload one). Encouraging smaller files is a good idea though, so why not a warning for when > 10GB.

I would say yes for 3., for 4. I would say why not if it's gonna be useful to downstream libs, but I will let others chime in :)

Also cc @Kakulukian @allendorf for information

@julien-c
Copy link
Member

Also yes from me for 3. (inside hub-docs probably?)

For 4. i think it's on the downstream libraries to do it, because they have more context to do it in a better way. For instance transformers has utilities to split super large checkpoint files into multiple files but each file is a valid weight file (containing certain layers for instance)

cc @LysandreJik @sgugger too

@osanseviero
Copy link
Contributor

For 3, I think hub-docs is a better place than huggingface_hub. Maybe somewhere under the Repositories category

@LysandreJik
Copy link
Member

Indeed, we had started something with @muellerzr for 4., but in the end having framework-specific approaches made much more sense. I'd focus on 3. and do 4. only if we see extensive requests.

As long as the page in 3. is heavily referenced from huggingface_hub's docs, fine for me to have it in hub-docs if you all think it makes more sense to have it there.

@Wauplin Wauplin added the documentation Improvements or additions to documentation label Jan 26, 2023
@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 29, 2023

Done as part of #1565.

@Wauplin Wauplin closed this as completed Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants