Skip to content

Commit

Permalink
documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Wauplin committed Apr 17, 2023
1 parent e4f9a23 commit ef093a0
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 1 deletion.
39 changes: 39 additions & 0 deletions docs/source/guides/upload.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,45 @@ but before that, all previous logs on the repo on deleted. All of this in a sing
... )
```

### Upload a folder by chunks

[`upload_folder`] make it easy to upload an entire folder to the Hub. However, for large folders (thousands of files or
hundreds of GB), it can still be challenging. If you have a folder with a lot of files, you might want to upload
it in several commits. If you experience an error or a connection issue during the upload, you would not have to resume
the process from the beginning.

You can do that by passing the `multi_commits=True` as argument. Under the hood, `huggingface_hub` will list the files
to upload/delete and split them in several commits. The strategy is defined based on the number and size of the files
to upload, limiting both the number of files and the size of each commit. A PR is open on the Hub and all commits are
pushed to it. Once the PR is merged, the commits are squashed into a single commit. If the process is interrupted
before completing, you can rerun your script to resume the upload. The created PR will be automatically detected and
the upload will resume from where it stopped. It is recommended to pass `multi_commits_verbose=True` to get a better
understanding of the upload and its progress.

The example below will upload the checkpoints folder to a dataset in multiple commits. A PR will be created on the Hub
and merged automatically once the upload is complete. If you prefer the PR to stay open and review it manually, you can
pass `create_pr=True`.

```py
>>> upload_folder(
... folder_path="local/checkpoints",
... repo_id="username/my-dataset",
... repo_type="dataset",
... multi_commits=True,
... multi_commits_verbose=True,
... )
```

If you want a better control on the upload strategy (i.e. the commits that are created), you can have a look at the
low-level [`plan_multi_commits`] and [`create_commits_on_pr`] methods.

<Tip warning={true}>

`multi_commits` is still an experimental feature. Its API and behavior is subject to change in the future without prior
notice.

</Tip>


### create_commit

Expand Down
2 changes: 2 additions & 0 deletions docs/source/package_reference/hf_api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ models = hf_api.list_models()

[[autodoc]] HfApi

[[autodoc]] plan_multi_commits

## API Dataclasses

### CommitInfo
Expand Down
8 changes: 8 additions & 0 deletions src/huggingface_hub/_multi_commits.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@

from ._commit_api import CommitOperation, CommitOperationAdd, CommitOperationDelete
from .community import DiscussionWithDetails
from .utils import experimental
from .utils._cache_manager import _format_size


Expand Down Expand Up @@ -73,6 +74,7 @@ class MultiCommitException(Exception):
STEP_ID_REGEX = re.compile(r"- \[(?P<completed>[ |x])\].*(?P<step_id>[a-fA-F0-9]{64})", flags=re.MULTILINE)


@experimental
def plan_multi_commits(
operations: Iterable[CommitOperation],
max_operations_per_commit: int = 50,
Expand Down Expand Up @@ -104,6 +106,12 @@ def plan_multi_commits(
lists of [`CommitOperationAdd`] representing the addition commits to push. The second item is a list of lists
of [`CommitOperationDelete`] representing the deletion commits.
<Tip warning={true}>
`plan_multi_commits` is experimental. Its API and behavior is subject to change in the future without prior notice.
</Tip>
Example:
```python
>>> from huggingface_hub import HfApi, plan_multi_commits
Expand Down
21 changes: 20 additions & 1 deletion src/huggingface_hub/hf_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,13 @@
import requests
from requests.exceptions import HTTPError

from huggingface_hub.utils import IGNORE_GIT_FOLDER_PATTERNS, EntryNotFoundError, RepositoryNotFoundError, get_session
from huggingface_hub.utils import (
IGNORE_GIT_FOLDER_PATTERNS,
EntryNotFoundError,
RepositoryNotFoundError,
experimental,
get_session,
)

from ._commit_api import (
CommitOperation,
Expand Down Expand Up @@ -2436,6 +2442,7 @@ def _payload_as_ndjson() -> Iterable[bytes]:
pr_url=commit_data["pullRequestUrl"] if create_pr else None,
)

@experimental
@validate_hf_hub_args
def create_commits_on_pr(
self,
Expand All @@ -2461,6 +2468,12 @@ def create_commits_on_pr(
guaranteed as we might implement parallel commits in the future. Be sure that your are not updating several
times the same file.
<Tip warning={true}>
`create_commits_on_pr` is experimental. Its API and behavior is subject to change in the future without prior notice.
</Tip>
Args:
repo_id (`str`):
The repository in which the commits will be pushed. Example: `"username/my-cool-model"`.
Expand Down Expand Up @@ -2991,6 +3004,12 @@ def upload_folder(
</Tip>
<Tip warning={true}>
`multi_commits` is experimental. Its API and behavior is subject to change in the future without prior notice.
</Tip>
Example:
```python
Expand Down

0 comments on commit ef093a0

Please sign in to comment.