Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tips to upload large models/datasets #1565

Merged
merged 11 commits into from
Aug 16, 2023
2 changes: 1 addition & 1 deletion docs/source/guides/repository.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ Specify the `repo_id` of the repository you want to delete:
>>> delete_repo(repo_id="lysandre/my-corrupted-dataset", repo_type="dataset")
```

### Clone a repository (only for Spaces)
### Duplicate a repository (only for Spaces)

In some cases, you want to copy someone else's repo to adapt it to your use case.
This is possible for Spaces using the [`duplicate_space`] method. It will duplicate the whole repository.
Expand Down
97 changes: 90 additions & 7 deletions docs/source/guides/upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,13 @@ but before that, all previous logs on the repo on deleted. All of this in a sing
... )
```

## Non-blocking upload
## Advanced features

In most cases, you won't need more than [`upload_file`] and [`upload_folder`] to upload your files to the Hub.
However, `huggingface_hub` has more advanced features to make things easier. Let's have a look at them!


### Non-blocking uploads
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
In some cases, you want to push data without blocking your main thread. This is particularly useful to upload logs and
artifacts while continuing a training. To do so, you can use the `run_as_future` argument in both [`upload_file`] and
[`upload_folder`]. This will return a [`concurrent.futures.Future`](https://docs.python.org/3/library/concurrent.futures.html#future-objects)
Expand Down Expand Up @@ -154,7 +159,7 @@ Future(...)
Future(...)
```

## Upload a folder by chunks
### Upload a folder by chunks

[`upload_folder`] makes it easy to upload an entire folder to the Hub. However, for large folders (thousands of files or
hundreds of GB), it can still be challenging. If you have a folder with a lot of files, you might want to upload
Expand Down Expand Up @@ -193,7 +198,7 @@ notice.

</Tip>

## Scheduled uploads
### Scheduled uploads

The Hugging Face Hub makes it easy to save and version data. However, there are some limitations when updating the same file thousands of times. For instance, you might want to save logs of a training process or user
feedback on a deployed Space. In these cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. The main reason is that you don't want to version every update of your data because it'll make the git repository unusable. The [`CommitScheduler`] class offers a solution to this problem.
Expand Down Expand Up @@ -262,14 +267,14 @@ For more details about the [`CommitScheduler`], here is what you need to know:
`scheduler.lock` lock to ensure thread-safety. The lock is blocked only when the scheduler scans the folder for
changes, not when it uploads data. You can safely assume that it will not affect the user experience on your Space.

### Space persistence demo
#### Space persistence demo

Persisting data from a Space to a Dataset on the Hub is the main use case for [`CommitScheduler`]. Depending on the use
case, you might want to structure your data differently. The structure has to be robust to concurrent users and
restarts which often implies generating UUIDs. Besides robustness, you should upload data in a format readable by the 🤗 Datasets library for later reuse. We created a [Space](https://huggingface.co/spaces/Wauplin/space_to_dataset_saver)
that demonstrates how to save several different data formats (you may need to adapt it for your own specific needs).

### Custom uploads
#### Custom uploads

[`CommitScheduler`] assumes your data is append-only and should be uploading "as is". However, you
might want to customize the way data is uploaded. You can do that by creating a class inheriting from [`CommitScheduler`]
Expand Down Expand Up @@ -316,7 +321,7 @@ containing different implementations depending on your use cases.

</Tip>

## create_commit
### create_commit

The [`upload_file`] and [`upload_folder`] functions are high-level APIs that are generally convenient to use. We recommend
trying these functions first if you don't need to work at a lower level. However, if you want to work at a commit-level,
Expand Down Expand Up @@ -371,11 +376,89 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al

For more detailed information, take a look at the [`HfApi`] reference.

## Push files with Git LFS
## Tips and tricks for large uploads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful to not break this url

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not be too worried about it to be honest. I've checked and #upload-files-with-git-lfs is referenced nowhere in our internal docs (both hfh and hf_docs). Doesn't mean that such a URL doesn't exist in the wild but I would expect probability to be quite low. And even if it's the case, users will be redirected to the correct page even though it's not the correct section. Since Git LFS upload is mostly deprecated it should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw it's quite easy to add backward compat for a particular url anchor if needed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure we don't want to put that content on its own doc page? (no strong opinio)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion either. I thought the content was a bit light to get its own page

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was initially looking for this PR in hub-docs, so yeah it could make sense haha, but it seems to me that some of the advice and limits are specific to huggingface_hub (or at least to uploads through the HTTP API), so we can keep it here, at least for now


The above guide shows how to technically upload files to the Hub. However there are a few things to know when you're
trying to upload a large quantity of data. Given the time it takes to stream the data, it can be very annoying to
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
get a commit to fail at the end of the process. We gathered a list of tips and recommendations to think about before
starting the upload.

### Technical limitations
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

What are we talking about when we say "large uploads" and what are their associated limitations? Large uploads can be
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files
(e.g. an image dataset). Each repository will have its own limitations so here is a list of factors to consider:
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on the size of a repository on
Hub. However if you plan to upload hundreds of GBs or even TBs of data, we would be grateful if you could let us know
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
in advance to be aware of it. It's also better to get assistance if you have questions on your process.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- **Number of files**: There are two practical limits about the number of files in your repo. Under the hood, the Hub uses
Git to version the data which has its limitations:
- The total number of files in the repo. It cannot exceed 1M files. If you are planning to upload more than 1M files,
we recommend to merge the data into fewer files. For example, json files can be merged into a single jsonl file or
large datasets can be exported as parquet files.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- The maximum number of files per folder. It cannot exceed 10k files per folder. A simple solution for that is to
create a repository structure that uses subdirectories. A repo with 1k folder from `000/` to `999/` each one of
them containing at most 1000 files is already enough.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- **File size**: individual files also have a limit on their maximum size. In the case of uploading large files (e.g.
model weights), we strongly recommend to split them **into chunks of around 10GB each**. There are a few reasons for this:
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- Uploading and downloading smaller files is easier both for you and the other users. Connection issues can always
happen when streaming data and smaller files avoid resuming from the beginning in case of errors.
- Files are served to the users using CloudFront. From our experience, files above 30GB are not cached by this service
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
leading to a slower download speed.
- A hard-limit of 50GB has been set on the Hub. If you try to commit larger files, an error will be returned by the server.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- **Number of commits**: The total number of commits on your repo history. There is no hard limit for this. However, from
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
our experience, the user experience on the Hub start to degrade after a few thousands commits. We are always working to
improve the service but one must always remember that a git repository is not meant to work as a database with a lot of
writings. In case your repo's history gets very large, it is always possible to squash all the commits to get a
fresh start.
- **Number of operations per commit**: Once again, there is no hard-limit here. When a commit is done on the Hub, each
git operation (addition or delete) is checked by the server before actually doing it. This means that when a hundred LFS
files are committed at once, the server must check if each file has been correctly uploaded -checking its existence
and size-. A timeout of 60s is set on the request, meaning that if the process takes more time an error is raised
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still
completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend
to add around 50-100 files per commit.

To summarize it quickly:
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- Reach out to us for large repos (TBs of data)
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
- Max 1M files on the repo
- Max 10k files per folder
- ~10GB max per file
- ~100 max files per commit
- ~1000-3000 commits maximum
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

### Practical tips

Now that we saw the technical aspects you must consider when structuring your repository, let's see some practical
tips to make your upload process as smooth as possible.

- **Start small**: We recommend to start with a small amount of data to test your upload script. It's easier to iterate
on a script when failing takes only a small amount of time.
- **Expect failures**: Streaming large amounts of data is challenging. You don't know what can happen but it's always
best to consider that something will fail at least once -no matter if it's due to your machine, your connection or our
servers. For example if you plan to upload a large number of files, it's best to keep track locally of which files you
already uploaded before uploading the next batch. You are ensured that an LFS file that is already committed will never
be re-uploaded twice but checking it client-side can still save some time.
- **Use `hf_transfer`**: this is a Rust-based [library](https://github.com/huggingface/hf_transfer) meant to speed-up
uploads on machines with very high bandwidth. To use it you must install it (`pip install hf_transfer`) and enable it
by setting `HF_HUB_ENABLE_HF_TRANSFER=1` as environment variable. You can then use `huggingface_hub` normally.
Disclaimer: this is a power user tool. It is tested and production-ready but lacks user-friendly features like progress
bars or advanced error handling.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

## (legacy) Upload files with Git LFS

All the methods described above use the Hub's API to upload files. This is the recommended way to upload files to the Hub.
However we also provide [`Repository`], a wrapper around the git tool to manage a local repository.

<Tip warning={true}>

Although [`Repository`] is not formally deprecated, we recommend using the HTTP-based methods described above instead.
For more details about this recommendation, please have a look to [this guide](../concepts/git_vs_http) explaining the
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
core differences between HTTP-based and Git-based approaches.

</Tip>

Git LFS automatically handles files larger than 10MB. But for very large files (>5GB), you need to install a custom transfer agent for Git LFS:

```bash
Expand Down