From 438d639f9236c34dc44d9e312e71d3d1df64a021 Mon Sep 17 00:00:00 2001 From: Lucain Pouget Date: Thu, 20 Jul 2023 18:02:19 +0200 Subject: [PATCH 1/8] tips about large uplaods --- docs/source/guides/repository.md | 2 +- docs/source/guides/upload.md | 97 +++++++++++++++++++++++++++++--- 2 files changed, 91 insertions(+), 8 deletions(-) diff --git a/docs/source/guides/repository.md b/docs/source/guides/repository.md index f0987a52ab..13c324e665 100644 --- a/docs/source/guides/repository.md +++ b/docs/source/guides/repository.md @@ -71,7 +71,7 @@ Specify the `repo_id` of the repository you want to delete: >>> delete_repo(repo_id="lysandre/my-corrupted-dataset", repo_type="dataset") ``` -### Clone a repository (only for Spaces) +### Duplicate a repository (only for Spaces) In some cases, you want to copy someone else's repo to adapt it to your use case. This is possible for Spaces using the [`duplicate_space`] method. It will duplicate the whole repository. diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index b013cb8244..a8317cbeba 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -106,8 +106,13 @@ but before that, all previous logs on the repo on deleted. All of this in a sing ... ) ``` -## Non-blocking upload +## Advanced features +In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub. +However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them! + + +### Non-blocking uploads In some cases, you want to push data without blocking your main thread. This is particularly useful to upload logs and artifacts while continuing a training. To do so, you can use the `run_as_future` argument in both [`upload_file`] and [`upload_folder`]. This will return a [`concurrent.futures.Future`](https://docs.python.org/3/library/concurrent.futures.html#future-objects) @@ -154,7 +159,7 @@ Future(...) Future(...) ``` -## Upload a folder by chunks +### Upload a folder by chunks [`upload_folder`] makes it easy to upload an entire folder to the Hub. However, for large folders (thousands of files or hundreds of GB), it can still be challenging. If you have a folder with a lot of files, you might want to upload @@ -193,7 +198,7 @@ notice. -## Scheduled uploads +### Scheduled uploads The Hugging Face Hub makes it easy to save and version data. However, there are some limitations when updating the same file thousands of times. For instance, you might want to save logs of a training process or user feedback on a deployed Space. In these cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. The main reason is that you don't want to version every update of your data because it'll make the git repository unusable. The [`CommitScheduler`] class offers a solution to this problem. @@ -262,14 +267,14 @@ For more details about the [`CommitScheduler`], here is what you need to know: `scheduler.lock` lock to ensure thread-safety. The lock is blocked only when the scheduler scans the folder for changes, not when it uploads data. You can safely assume that it will not affect the user experience on your Space. -### Space persistence demo +#### Space persistence demo Persisting data from a Space to a Dataset on the Hub is the main use case for [`CommitScheduler`]. Depending on the use case, you might want to structure your data differently. The structure has to be robust to concurrent users and restarts which often implies generating UUIDs. Besides robustness, you should upload data in a format readable by the 🤗 Datasets library for later reuse. We created a [Space](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver) that demonstrates how to save several different data formats (you may need to adapt it for your own specific needs). -### Custom uploads +#### Custom uploads [`CommitScheduler`] assumes your data is append-only and should be uploading "as is". However, you might want to customize the way data is uploaded. You can do that by creating a class inheriting from [`CommitScheduler`] @@ -316,7 +321,7 @@ containing different implementations depending on your use cases. -## create_commit +### create_commit The [`upload_file`] and [`upload_folder`] functions are high-level APIs that are generally convenient to use. We recommend trying these functions first if you don't need to work at a lower level. However, if you want to work at a commit-level, @@ -371,11 +376,89 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al For more detailed information, take a look at the [`HfApi`] reference. -## Push files with Git LFS +## Tips and tricks for large uploads + +The above guide shows how to technically upload files to the Hub. However there are a few things to know when you're +trying to upload a large quantity of data. Given the time it takes to stream the data, it can be very annoying to +get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before +starting the upload. + +### Technical limitations + +What are we talking about when we say "large uploads" and what are their associated limitations? Large uploads can be +very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files +(e.g. an image dataset). Each repository type will have its own limitations so here is a list of factors to consider: + +- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on the size of a repository on +Hub. However if you plan to upload hundreds of GBs or even TBs of data, we would be grateful if you could let us know +in advance to be aware of it. It's also better to get assistance if you have questions on your process. +- **Number of files**: There are two practical limits about the number of files in your repo. Under the hood, the Hub uses +Git to version the data which has its limitations: + - The total number of files in the repo. It cannot exceed 1M files. If you are planning to upload more than 1M files, + we recommend to merge the data into fewer files. For example, json files can be merged into a single jsonl file or + large datasets can be exported as parquet. + - The maximum number of files per folder. It cannot exceed 10k files per folder. A simple solution for that is to + create a repository structure that uses subdirectories. A repo with 1k folder from `000/` to `999/` each one of + them containing at most 1000 files is already enough. +- **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g. +model weights), we strongly recommend to split them **into chunks of around 10GB each**. There are a few reasons for this: + - Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always + happen when streaming data and smaller files avoid resuming from the beginning in case of errors. + - Files are served to the users using CloudFront. From our experience, files above 30GB are not cached by this service + leading to a slower download speed. + - A hard-limit of 50GB has been set on the Hub. If you try to commit larger files, an error will be returned by the server. +- **Number of commits**: The total number of commits on your repo history. There is no hard limit for this. However, from +our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to +improve the service but one must always remember that a git repository is not meant to work as a database with a lot of +writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a +fresh start. +- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is done on the Hub, each +git operation (addition or delete) is checked by the server before actually doing it. This means that when a hundred LFS +files are committed at once, the server must check if each file has been correctly uploaded -checking its existence +and size-. A timeout of 60s is set on the request, meaning that if the process takes more time an error is raised +client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still +completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend +to add around 50-100 files per commit. + +To summarize it quickly: +- Reach out to us for large repos (TBs of data) +- Max 1M files on the repo +- Max 10k files per folder +- ~10GB max per file +- ~100 max files per commit +- ~1000-3000 commits maximum + +### Practical tips + +Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical +tips to make your upload process as smooth as possible. + +- **Start small**: We recommend to start with a small amount of data to test your upload script. It's easier to iterate +on a script when failing takes only a small amount of time. +- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen but it's always +best to consider that something will fail at least once -no matter if it's due to your machine, your connection or our +servers. For example if you plan to upload a large number of files, it's best to keep track locally of which files you +already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never +be re-uploaded twice but checking it client-side can still save some time. +- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed-up +uploads on machines with very high bandwidth. To use it you must install it (`pip install hf_transfer`) and enable it +by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as environment variable. You can then use `huggingface_hub` normally. +Disclaimer: this is a power user tool. It is tested and production-ready but lacks user-friendly features like progress +bars or advanced error handling. + +## (legacy) Upload files with Git LFS All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub. However we also provide [`Repository`], a wrapper around the git tool to manage a local repository. + + +Although [`Repository`] is not formally deprecated, we recommend using the HTTP-based methods described above instead. +For more details about this recommendation, please have a look to [this guide](../concepts/git_vs_http) explaining the +core differences between HTTP-based and Git-based approaches. + + + Git LFS automatically handles files larger than 10MB. But for very large files (>5GB), you need to install a custom transfer agent for Git LFS: ```bash From 53caabe7df05fef526f229cb9189f01af47794ea Mon Sep 17 00:00:00 2001 From: Lucain Pouget Date: Thu, 20 Jul 2023 18:11:00 +0200 Subject: [PATCH 2/8] typos --- docs/source/guides/upload.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index a8317cbeba..5bb9221965 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -387,7 +387,7 @@ starting the upload. What are we talking about when we say "large uploads" and what are their associated limitations? Large uploads can be very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files -(e.g. an image dataset). Each repository type will have its own limitations so here is a list of factors to consider: +(e.g. an image dataset). Each repository will have its own limitations so here is a list of factors to consider: - **Repository size**: The total size of the data you're planning to upload. There is no hard limit on the size of a repository on Hub. However if you plan to upload hundreds of GBs or even TBs of data, we would be grateful if you could let us know @@ -396,7 +396,7 @@ in advance to be aware of it. It's also better to get assistance if you have que Git to version the data which has its limitations: - The total number of files in the repo. It cannot exceed 1M files. If you are planning to upload more than 1M files, we recommend to merge the data into fewer files. For example, json files can be merged into a single jsonl file or - large datasets can be exported as parquet. + large datasets can be exported as parquet files. - The maximum number of files per folder. It cannot exceed 10k files per folder. A simple solution for that is to create a repository structure that uses subdirectories. A repo with 1k folder from `000/` to `999/` each one of them containing at most 1000 files is already enough. From 16dd98a2e04ae5f97071b76604b4591f00a9ef3b Mon Sep 17 00:00:00 2001 From: Lucain Date: Fri, 21 Jul 2023 11:48:14 +0200 Subject: [PATCH 3/8] Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> Co-authored-by: Omar Sanseviero --- docs/source/guides/upload.md | 46 +++++++++++++++++------------------- 1 file changed, 22 insertions(+), 24 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index 5bb9221965..c092db87fe 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -113,6 +113,7 @@ However, `huggingface_hub` has more advanced features to make things easier. Let ### Non-blocking uploads + In some cases, you want to push data without blocking your main thread. This is particularly useful to upload logs and artifacts while continuing a training. To do so, you can use the `run_as_future` argument in both [`upload_file`] and [`upload_folder`]. This will return a [`concurrent.futures.Future`](https://docs.python.org/3/library/concurrent.futures.html#future-objects) @@ -378,44 +379,41 @@ For more detailed information, take a look at the [`HfApi`] reference. ## Tips and tricks for large uploads -The above guide shows how to technically upload files to the Hub. However there are a few things to know when you're -trying to upload a large quantity of data. Given the time it takes to stream the data, it can be very annoying to +There are some limitations to be aware of when you're trying to upload a large amount of data. Given the time it takes to stream the data, it can be very annoying to get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before starting the upload. -### Technical limitations +### Hub repository size limitations -What are we talking about when we say "large uploads" and what are their associated limitations? Large uploads can be +What are we talking about when we say "large uploads", and what are their associated limitations? Large uploads can be very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files -(e.g. an image dataset). Each repository will have its own limitations so here is a list of factors to consider: +(e.g. an image dataset). Each repository will have its own limitations, so here is a list of factors to consider: -- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on the size of a repository on -Hub. However if you plan to upload hundreds of GBs or even TBs of data, we would be grateful if you could let us know -in advance to be aware of it. It's also better to get assistance if you have questions on your process. +- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know +in advance so we can better help you if have any questions during the process. - **Number of files**: There are two practical limits about the number of files in your repo. Under the hood, the Hub uses Git to version the data which has its limitations: - - The total number of files in the repo. It cannot exceed 1M files. If you are planning to upload more than 1M files, - we recommend to merge the data into fewer files. For example, json files can be merged into a single jsonl file or - large datasets can be exported as parquet files. - - The maximum number of files per folder. It cannot exceed 10k files per folder. A simple solution for that is to - create a repository structure that uses subdirectories. A repo with 1k folder from `000/` to `999/` each one of - them containing at most 1000 files is already enough. + - The total number of files in the repo cannot exceed 1M files. If you are planning to upload more than 1M files, + we recommend merging the data into fewer files. For example, json files can be merged into a single jsonl file or + large datasets can be exported as Parquet files. + - The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to + create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough. - **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g. model weights), we strongly recommend to split them **into chunks of around 10GB each**. There are a few reasons for this: - Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always happen when streaming data and smaller files avoid resuming from the beginning in case of errors. - Files are served to the users using CloudFront. From our experience, files above 30GB are not cached by this service leading to a slower download speed. - - A hard-limit of 50GB has been set on the Hub. If you try to commit larger files, an error will be returned by the server. -- **Number of commits**: The total number of commits on your repo history. There is no hard limit for this. However, from + - A hard-limit of 50GB has been set on the Hub for an individual file. If you try to commit larger files, an error will be returned by the server. +- **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to improve the service but one must always remember that a git repository is not meant to work as a database with a lot of writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a fresh start. - **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is done on the Hub, each git operation (addition or delete) is checked by the server before actually doing it. This means that when a hundred LFS -files are committed at once, the server must check if each file has been correctly uploaded -checking its existence -and size-. A timeout of 60s is set on the request, meaning that if the process takes more time an error is raised +files are committed at once, the server must check if each file has been correctly uploaded, checking its existence +and size. A timeout of 60s is set on the request, meaning that if the process takes more time an error is raised client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend to add around 50-100 files per commit. @@ -433,16 +431,16 @@ To summarize it quickly: Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical tips to make your upload process as smooth as possible. -- **Start small**: We recommend to start with a small amount of data to test your upload script. It's easier to iterate +- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate on a script when failing takes only a small amount of time. -- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen but it's always -best to consider that something will fail at least once -no matter if it's due to your machine, your connection or our -servers. For example if you plan to upload a large number of files, it's best to keep track locally of which files you +- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always +best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our +servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. - **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed-up -uploads on machines with very high bandwidth. To use it you must install it (`pip install hf_transfer`) and enable it -by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as environment variable. You can then use `huggingface_hub` normally. +uploads on machines with very high bandwidth. To use it, you must install it (`pip install hf_transfer`) and enable it +by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. You can then use `huggingface_hub` normally. Disclaimer: this is a power user tool. It is tested and production-ready but lacks user-friendly features like progress bars or advanced error handling. From 5487aea9b87f7e294b7110b602d90702e76d64d7 Mon Sep 17 00:00:00 2001 From: Lucain Pouget Date: Fri, 21 Jul 2023 14:43:40 +0200 Subject: [PATCH 4/8] add table --- docs/source/guides/upload.md | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index c092db87fe..a88785f931 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -379,9 +379,19 @@ For more detailed information, take a look at the [`HfApi`] reference. ## Tips and tricks for large uploads -There are some limitations to be aware of when you're trying to upload a large amount of data. Given the time it takes to stream the data, it can be very annoying to -get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before -starting the upload. +There are some limitations to be aware of when you're trying to upload a large amount of data. Given the time it takes to stream the data, it can be very annoying to get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before starting the upload. + + +| Characteristic | Limit | Type of limit | Recommended approach | +| ---------------- | ------------------ | ------------------------------ | ---------------------------------------- | +| Repo size | unlimited | Ø | contact us for large repos (TBs of data) | +| Files per repo | 1M max | hard limit | merge data into fewer files | +| Files per folder | 10k max | hard limit | use subdirectories in repo | +| File size | ~10GB max | soft limit (can go up to 50GB) | split data into chunked files | +| Commit size | ~100 files | soft limit (risk of timeouts) | upload files in multiple commits | +| Commits per repo | ~1000-3000 commits | soft limit (risk of slower UX) | upload multiple files per commit | + +Please read the next section to get a better understanding of those limits and how to deal with them. ### Hub repository size limitations @@ -389,8 +399,7 @@ What are we talking about when we say "large uploads", and what are their associ very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files (e.g. an image dataset). Each repository will have its own limitations, so here is a list of factors to consider: -- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know -in advance so we can better help you if have any questions during the process. +- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord). - **Number of files**: There are two practical limits about the number of files in your repo. Under the hood, the Hub uses Git to version the data which has its limitations: - The total number of files in the repo cannot exceed 1M files. If you are planning to upload more than 1M files, From e94cf1f70f62d9154046e65106e16649c7525c1a Mon Sep 17 00:00:00 2001 From: Lucain Date: Fri, 21 Jul 2023 17:29:08 +0200 Subject: [PATCH 5/8] Apply suggestions from code review Co-authored-by: Julien Chaumond --- docs/source/guides/upload.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index a88785f931..4ebc508a6e 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -461,7 +461,7 @@ However we also provide [`Repository`], a wrapper around the git tool to manage Although [`Repository`] is not formally deprecated, we recommend using the HTTP-based methods described above instead. -For more details about this recommendation, please have a look to [this guide](../concepts/git_vs_http) explaining the +For more details about this recommendation, please have a look at [this guide](../concepts/git_vs_http) explaining the core differences between HTTP-based and Git-based approaches. From 4193f9b2b3d782f1014a332e74782c0471a94c56 Mon Sep 17 00:00:00 2001 From: Lucain Pouget Date: Fri, 21 Jul 2023 17:29:33 +0200 Subject: [PATCH 6/8] fix --- docs/source/guides/upload.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index a88785f931..bb4ad7c874 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -427,14 +427,6 @@ client-side. However, it can happen (in rare cases) that even if the timeout is completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend to add around 50-100 files per commit. -To summarize it quickly: -- Reach out to us for large repos (TBs of data) -- Max 1M files on the repo -- Max 10k files per folder -- ~10GB max per file -- ~100 max files per commit -- ~1000-3000 commits maximum - ### Practical tips Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical From 81847ff13f97740e1621577feb79f268f6d4f9bc Mon Sep 17 00:00:00 2001 From: Lucain Date: Wed, 16 Aug 2023 15:09:46 +0200 Subject: [PATCH 7/8] Apply suggestions from code review Co-authored-by: Pierric Cistac --- docs/source/guides/upload.md | 47 +++++++++++++++++++----------------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index 012b93c0cf..1c38cf5d30 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -379,17 +379,19 @@ For more detailed information, take a look at the [`HfApi`] reference. ## Tips and tricks for large uploads -There are some limitations to be aware of when you're trying to upload a large amount of data. Given the time it takes to stream the data, it can be very annoying to get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before starting the upload. +There are some limitations to be aware of when you're dealing with a large amount of data in your repo. Given the time it takes to stream the data, it can be very annoying to get an upload/push to fail at the end of the process or encounter a degraded experience, be it on hf.co or when working locally. We gathered a list of tips and recommendations to think about when structuring your repo. -| Characteristic | Limit | Type of limit | Recommended approach | -| ---------------- | ------------------ | ------------------------------ | ---------------------------------------- | -| Repo size | unlimited | Ø | contact us for large repos (TBs of data) | -| Files per repo | 1M max | hard limit | merge data into fewer files | -| Files per folder | 10k max | hard limit | use subdirectories in repo | -| File size | ~10GB max | soft limit (can go up to 50GB) | split data into chunked files | -| Commit size | ~100 files | soft limit (risk of timeouts) | upload files in multiple commits | -| Commits per repo | ~1000-3000 commits | soft limit (risk of slower UX) | upload multiple files per commit | +| Characteristic | Recommended | Tips | +| ---------------- | ------------------ | ---------------------------------------- | +| Repo size | - | contact us for large repos (TBs of data) | +| Files per repo | <100k | merge data into fewer files | +| Entries per folder | <10k | use subdirectories in repo | +| File size | <5GB | split data into chunked files | +| Commit size | <100 files* | upload files in multiple commits | +| Commits per repo | - | upload multiple files per commit | + +_* Not relevant when using `git` CLI directly_ Please read the next section to get a better understanding of those limits and how to deal with them. @@ -397,32 +399,33 @@ Please read the next section to get a better understanding of those limits and h What are we talking about when we say "large uploads", and what are their associated limitations? Large uploads can be very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files -(e.g. an image dataset). Each repository will have its own limitations, so here is a list of factors to consider: +(e.g. an image dataset). + +Under the hood, the Hub uses Git to version the data, which has structural implications on what you can do in your repo. +If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**, +which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider: - **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord). -- **Number of files**: There are two practical limits about the number of files in your repo. Under the hood, the Hub uses -Git to version the data which has its limitations: - - The total number of files in the repo cannot exceed 1M files. If you are planning to upload more than 1M files, - we recommend merging the data into fewer files. For example, json files can be merged into a single jsonl file or - large datasets can be exported as Parquet files. +- **Number of files**: + - For optimal experience, we recommend to keep the total number of files under 100k. Try merging the data into fewer files if you have more. + For example, json files can be merged into a single jsonl file or large datasets can be exported as Parquet files. - The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough. - **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g. -model weights), we strongly recommend to split them **into chunks of around 10GB each**. There are a few reasons for this: +model weights), we strongly recommend to split them **into chunks of around 5GB each**. There are a few reasons for this: - Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always happen when streaming data and smaller files avoid resuming from the beginning in case of errors. - - Files are served to the users using CloudFront. From our experience, files above 30GB are not cached by this service + - Files are served to the users using CloudFront. From our experience, huge files are not cached by this service leading to a slower download speed. - - A hard-limit of 50GB has been set on the Hub for an individual file. If you try to commit larger files, an error will be returned by the server. - **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to improve the service but one must always remember that a git repository is not meant to work as a database with a lot of writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a fresh start. -- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is done on the Hub, each -git operation (addition or delete) is checked by the server before actually doing it. This means that when a hundred LFS -files are committed at once, the server must check if each file has been correctly uploaded, checking its existence -and size. A timeout of 60s is set on the request, meaning that if the process takes more time an error is raised +- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is uploaded on the Hub, each +git operation (addition or delete) is checked by the server. When a hundred LFS files are committed at once, +each file is checked individually to make sure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`, +a timeout of 60s is set on the request, meaning that if the process takes more time an error is raised client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend to add around 50-100 files per commit. From 4c2bb66dd09b3e145e3408b9b0681f9b90c1c7f2 Mon Sep 17 00:00:00 2001 From: Pierric Cistac Date: Wed, 16 Aug 2023 09:21:10 -0600 Subject: [PATCH 8/8] grammarly pass --- docs/source/guides/upload.md | 40 +++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 19 deletions(-) diff --git a/docs/source/guides/upload.md b/docs/source/guides/upload.md index 1c38cf5d30..4d7e5a2a27 100644 --- a/docs/source/guides/upload.md +++ b/docs/source/guides/upload.md @@ -379,7 +379,9 @@ For more detailed information, take a look at the [`HfApi`] reference. ## Tips and tricks for large uploads -There are some limitations to be aware of when you're dealing with a large amount of data in your repo. Given the time it takes to stream the data, it can be very annoying to get an upload/push to fail at the end of the process or encounter a degraded experience, be it on hf.co or when working locally. We gathered a list of tips and recommendations to think about when structuring your repo. +There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data, +getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying. +We gathered a list of tips and recommendations for structuring your repo. | Characteristic | Recommended | Tips | @@ -393,7 +395,7 @@ There are some limitations to be aware of when you're dealing with a large amoun _* Not relevant when using `git` CLI directly_ -Please read the next section to get a better understanding of those limits and how to deal with them. +Please read the next section to understand better those limits and how to deal with them. ### Hub repository size limitations @@ -405,44 +407,44 @@ Under the hood, the Hub uses Git to version the data, which has structural impli If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**, which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider: -- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord). +- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if you have any questions during the process. You can contact us at datasets@huggingface.co or on [our Discord](http://hf.co/join/discord). - **Number of files**: - - For optimal experience, we recommend to keep the total number of files under 100k. Try merging the data into fewer files if you have more. - For example, json files can be merged into a single jsonl file or large datasets can be exported as Parquet files. + - For optimal experience, we recommend keeping the total number of files under 100k. Try merging the data into fewer files if you have more. + For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files. - The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough. -- **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g. -model weights), we strongly recommend to split them **into chunks of around 5GB each**. There are a few reasons for this: - - Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always +- **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks of around 5GB each**. +There are a few reasons for this: + - Uploading and downloading smaller files is much easier both for you and the other users. Connection issues can always happen when streaming data and smaller files avoid resuming from the beginning in case of errors. - Files are served to the users using CloudFront. From our experience, huge files are not cached by this service leading to a slower download speed. - **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from -our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to -improve the service but one must always remember that a git repository is not meant to work as a database with a lot of -writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a +our experience, the user experience on the Hub starts to degrade after a few thousand commits. We are constantly working to +improve the service, but one must always remember that a git repository is not meant to work as a database with a lot of +writes. If your repo's history gets very large, it is always possible to squash all the commits to get a fresh start. -- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is uploaded on the Hub, each +- **Number of operations per commit**: Once again, there is no hard limit here. When a commit is uploaded on the Hub, each git operation (addition or delete) is checked by the server. When a hundred LFS files are committed at once, -each file is checked individually to make sure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`, -a timeout of 60s is set on the request, meaning that if the process takes more time an error is raised +each file is checked individually to ensure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`, +a timeout of 60s is set on the request, meaning that if the process takes more time, an error is raised client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend -to add around 50-100 files per commit. +adding around 50-100 files per commit. ### Practical tips -Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical +Now that we've seen the technical aspects you must consider when structuring your repository, let's see some practical tips to make your upload process as smooth as possible. - **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate -on a script when failing takes only a small amount of time. +on a script when failing takes only a little time. - **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen, but it's always best to consider that something will fail at least once -no matter if it's due to your machine, your connection, or our servers. For example, if you plan to upload a large number of files, it's best to keep track locally of which files you already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never be re-uploaded twice but checking it client-side can still save some time. -- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed-up +- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed up uploads on machines with very high bandwidth. To use it, you must install it (`pip install hf_transfer`) and enable it by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as an environment variable. You can then use `huggingface_hub` normally. Disclaimer: this is a power user tool. It is tested and production-ready but lacks user-friendly features like progress @@ -451,7 +453,7 @@ bars or advanced error handling. ## (legacy) Upload files with Git LFS All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub. -However we also provide [`Repository`], a wrapper around the git tool to manage a local repository. +However, we also provide [`Repository`], a wrapper around the git tool to manage a local repository.