Skip to content

Commit

Permalink
Merge branch 'main' into experimental-hf-logger
Browse files Browse the repository at this point in the history
  • Loading branch information
Wauplin committed Jun 7, 2023
2 parents 73f868a + 8eabd33 commit c578b18
Show file tree
Hide file tree
Showing 13 changed files with 822 additions and 60 deletions.
79 changes: 77 additions & 2 deletions docs/source/guides/upload.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,74 @@ notice.

</Tip>

### Scheduled uploads

The Hugging Face Hub makes it easy to save and version data. However, there are some limitations when updating the same file thousands of times. For instance, you might want to save logs of a training process or user
feedback on a deployed Space. In these cases, uploading the data as a dataset on the Hub makes sense, but it can be hard to do properly. The main reason is that you don't want to version every update of your data because it'll make the git repository unusable. The [`CommitScheduler`] class offers a solution to this problem.

The idea is to run a background job that regularly pushes a local folder to the Hub. Let's assume you have a
Gradio Space that takes as input some text and generates two translations of it. Then, the user can select their preferred translation. For each run, you want to save the input, output, and user preference to analyze the results. This is a
perfect use case for [`CommitScheduler`]; you want to save data to the Hub (potentially millions of user feedback), but
you don't _need_ to save in real-time each user's input. Instead, you can save the data locally in a JSON file and
upload it every 10 minutes. For example:

```py
>>> import json
>>> import uuid
>>> from pathlib import Path
>>> import gradio as gr
>>> from huggingface_hub import CommitScheduler

# Define the file where to save the data. Use UUID to make sure not to overwrite existing data from a previous run.
>>> feedback_file = Path("user_feedback/") / f"data_{uuid.uuid4()}.json"
>>> feedback_folder = feedback_file.parent

# Schedule regular uploads. Remote repo and local folder are created if they don't already exist.
>>> scheduler = CommitScheduler(
... repo_id="report-translation-feedback",
... repo_type="dataset",
... folder_path=feedback_folder,
... path_in_repo="data",
... every=10,
... )

# Define the function that will be called when the user submits its feedback (to be called in Gradio)
>>> def save_feedback(input_text:str, output_1: str, output_2:str, user_choice: int) -> None:
... """
... Append input/outputs and user feedback to a JSON Lines file using a thread lock to avoid concurrent writes from different users.
... """
... with scheduler.lock:
... with feedback_file.open("a") as f:
... f.write(json.dumps({"input": input_text, "output_1": output_1, "output_2": output_2, "user_choice": user_choice}))
... f.write("\n")

# Start Gradio
>>> with gr.Blocks() as demo:
>>> ... # define Gradio demo + use `save_feedback`
>>> demo.launch()
```

And that's it! User input/outputs and feedback will be available as a dataset on the Hub. By using a unique JSON file name, you are guaranteed you won't overwrite data from a previous run or data from another
Spaces/replicas pushing concurrently to the same repository.

For more details about the [`CommitScheduler`], here is what you need to know:
- **append-only:**
It is assumed that you will only add content to the folder. You must only append data to existing files or create
new files. Deleting or overwriting a file might corrupt your repository.
- **git history**:
The scheduler will commit the folder every `every` minutes. To avoid polluting the git repository too much, it is
recommended to set a minimal value of 5 minutes. Besides, the scheduler is designed to avoid empty commits. If no
new content is detected in the folder, the scheduled commit is dropped.
- **errors:**
The scheduler run as background thread. It is started when you instantiate the class and never stops. In particular,
if an error occurs during the upload (example: connection issue), the scheduler will silently ignore it and retry
at the next scheduled commit.
- **thread-safety:**
In most cases it is safe to assume that you can write to a file without having to worry about a lock file. The
scheduler will not crash or be corrupted if you write content to the folder while it's uploading. In practice,
_it is possible_ that concurrency issues happen for heavy-loaded apps. In this case, we advice to use the
`scheduler.lock` lock to ensure thread-safety. The lock is blocked only when the scheduler scans the folder for
changes, not when it uploads data. You can safely assume that it will not affect the user experience on your Space.

### create_commit

Expand All @@ -207,6 +275,12 @@ If you want to work at a commit-level, use the [`create_commit`] function direct

- [`CommitOperationDelete`] removes a file or a folder from a repository. This operation accepts `path_in_repo` as an argument.

- [`CommitOperationCopy`] copies a file within a repository. This operation accepts three arguments:

- `src_path_in_repo`: the repository path of the file to copy.
- `path_in_repo`: the repository path where the file should be copied.
- `src_revision`: optional - the revision of the file to copy if your want to copy a file from a differnt branch/revision.

For example, if you want to upload two files and delete a file in a Hub repository:

1. Use the appropriate `CommitOperation` to add or delete a file and to delete a folder:
Expand All @@ -219,6 +293,7 @@ For example, if you want to upload two files and delete a file in a Hub reposito
... CommitOperationAdd(path_in_repo="weights.h5", path_or_fileobj="~/repo/weights-final.h5"),
... CommitOperationDelete(path_in_repo="old-weights.h5"),
... CommitOperationDelete(path_in_repo="logs/"),
... CommitOperationCopy(src_path_in_repo="image.png", path_in_repo="duplicate_image.png"),
... ]
```

Expand Down Expand Up @@ -250,7 +325,7 @@ huggingface-cli lfs-enable-largefiles

You should install this for each repository that has a very large file. Once installed, you'll be able to push files larger than 5GB.

## commit context manager
### commit context manager

The `commit` context manager handles four of the most common Git commands: pull, add, commit, and push. `git-lfs` automatically tracks any file larger than 10MB. In the following example, the `commit` context manager:

Expand Down Expand Up @@ -309,7 +384,7 @@ When `blocking=False`, commands are tracked, and your script will only exit when
>>> last_command.failed
```

## push_to_hub
### push_to_hub

The [`Repository`] class has a [`~Repository.push_to_hub`] function to add files, make a commit, and push them to a repository. Unlike the `commit` context manager, you'll need to pull from a repository first before calling [`~Repository.push_to_hub`].

Expand Down
6 changes: 6 additions & 0 deletions docs/source/package_reference/hf_api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,12 @@ Below are the supported values for [`CommitOperation`]:

[[autodoc]] CommitOperationDelete

[[autodoc]] CommitOperationCopy

## CommitScheduler

[[autodoc]] CommitScheduler

## Token helper

`huggingface_hub` stores the authentication information locally so that it may be re-used in subsequent
Expand Down
8 changes: 7 additions & 1 deletion src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,15 @@
from typing import TYPE_CHECKING


__version__ = "0.15.0.dev0"
__version__ = "0.16.0.dev0"

# Alphabetical order of definitions is ensured in tests
# WARNING: any comment added in this dictionary definition will be lost when
# re-generating the file !
_SUBMOD_ATTRS = {
"_commit_scheduler": [
"CommitScheduler",
],
"_inference": [
"InferenceClient",
"InferenceTimeoutError",
Expand Down Expand Up @@ -131,6 +134,7 @@
"CommitInfo",
"CommitOperation",
"CommitOperationAdd",
"CommitOperationCopy",
"CommitOperationDelete",
"DatasetSearchArguments",
"GitCommitInfo",
Expand Down Expand Up @@ -349,6 +353,7 @@ def __dir__():
# make style
# ```
if TYPE_CHECKING: # pragma: no cover
from ._commit_scheduler import CommitScheduler # noqa: F401
from ._inference import (
InferenceClient, # noqa: F401
InferenceTimeoutError, # noqa: F401
Expand Down Expand Up @@ -424,6 +429,7 @@ def __dir__():
CommitInfo, # noqa: F401
CommitOperation, # noqa: F401
CommitOperationAdd, # noqa: F401
CommitOperationCopy, # noqa: F401
CommitOperationDelete, # noqa: F401
DatasetSearchArguments, # noqa: F401
GitCommitInfo, # noqa: F401
Expand Down
114 changes: 111 additions & 3 deletions src/huggingface_hub/_commit_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@
from collections import defaultdict
from contextlib import contextmanager
from dataclasses import dataclass, field
from itertools import groupby
from pathlib import Path, PurePosixPath
from typing import Any, BinaryIO, Dict, Iterable, Iterator, List, Optional, Union
from typing import TYPE_CHECKING, Any, BinaryIO, Dict, Iterable, Iterator, List, Optional, Tuple, Union

from tqdm.contrib.concurrent import thread_map

Expand All @@ -18,6 +19,7 @@
from .constants import ENDPOINT, HF_HUB_ENABLE_HF_TRANSFER
from .lfs import UploadInfo, lfs_upload, post_lfs_batch_info
from .utils import (
EntryNotFoundError,
build_hf_headers,
chunk_iterable,
hf_raise_for_status,
Expand All @@ -29,6 +31,10 @@
from .utils._typing import Literal


if TYPE_CHECKING:
from .hf_api import RepoFile


logger = logging.get_logger(__name__)


Expand Down Expand Up @@ -66,6 +72,36 @@ def __post_init__(self):
)


@dataclass
class CommitOperationCopy:
"""
Data structure holding necessary info to copy a file in a repository on the Hub.
Limitations:
- Only LFS files can be copied. To copy a regular file, you need to download it locally and re-upload it
- Cross-repository copies are not supported.
Note: you can combine a [`CommitOperationCopy`] and a [`CommitOperationDelete`] to rename an LFS file on the Hub.
Args:
src_path_in_repo (`str`):
Relative filepath in the repo of the file to be copied, e.g. `"checkpoints/1fec34a/weights.bin"`.
path_in_repo (`str`):
Relative filepath in the repo where to copy the file, e.g. `"checkpoints/1fec34a/weights_copy.bin"`.
src_revision (`str`, *optional*):
The git revision of the file to be copied. Can be any valid git revision.
Default to the target commit revision.
"""

src_path_in_repo: str
path_in_repo: str
src_revision: Optional[str] = None

def __post_init__(self):
self.src_path_in_repo = _validate_path_in_repo(self.src_path_in_repo)
self.path_in_repo = _validate_path_in_repo(self.path_in_repo)


@dataclass
class CommitOperationAdd:
"""
Expand Down Expand Up @@ -206,7 +242,7 @@ def _validate_path_in_repo(path_in_repo: str) -> str:
return path_in_repo


CommitOperation = Union[CommitOperationAdd, CommitOperationDelete]
CommitOperation = Union[CommitOperationAdd, CommitOperationCopy, CommitOperationDelete]


def warn_on_overwriting_operations(operations: List[CommitOperation]) -> None:
Expand Down Expand Up @@ -449,9 +485,68 @@ def fetch_upload_modes(
return upload_modes


@validate_hf_hub_args
def fetch_lfs_files_to_copy(
copies: Iterable[CommitOperationCopy],
repo_type: str,
repo_id: str,
token: Optional[str],
revision: str,
endpoint: Optional[str] = None,
) -> Dict[Tuple[str, Optional[str]], "RepoFile"]:
"""
Requests the Hub files information of the LFS files to be copied, including their sha256.
Args:
copies (`Iterable` of :class:`CommitOperationCopy`):
Iterable of :class:`CommitOperationCopy` describing the files to
copy on the Hub.
repo_type (`str`):
Type of the repo to upload to: `"model"`, `"dataset"` or `"space"`.
repo_id (`str`):
A namespace (user or an organization) and a repo name separated
by a `/`.
token (`str`, *optional*):
An authentication token ( See https://huggingface.co/settings/tokens )
revision (`str`):
The git revision to upload the files to. Can be any valid git revision.
Returns: `Dict[Tuple[str, Optional[str]], RepoFile]]`
Key is the file path and revision of the file to copy, value is the repo file.
Raises:
[`~utils.HfHubHTTPError`]
If the Hub API returned an error.
[`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError)
If the Hub API response is improperly formatted.
"""
from .hf_api import HfApi

hf_api = HfApi(endpoint=endpoint, token=token)
files_to_copy = {}
for src_revision, operations in groupby(copies, key=lambda op: op.src_revision):
operations = list(operations) # type: ignore
paths = [op.src_path_in_repo for op in operations]
src_repo_files = hf_api.list_files_info(
repo_id=repo_id, paths=paths, revision=src_revision or revision, repo_type=repo_type
)
for src_repo_file in src_repo_files:
if not src_repo_file.lfs:
raise NotImplementedError("Copying a non-LFS file is not implemented")
files_to_copy[(src_repo_file.rfilename, src_revision)] = src_repo_file
for operation in operations:
if (operation.src_path_in_repo, src_revision) not in files_to_copy:
raise EntryNotFoundError(
f"Cannot copy {operation.src_path_in_repo} at revision "
f"{src_revision or revision}: file is missing on repo."
)
return files_to_copy


def prepare_commit_payload(
operations: Iterable[CommitOperation],
upload_modes: Dict[str, UploadMode],
files_to_copy: Dict[Tuple[str, Optional[str]], "RepoFile"],
commit_message: str,
commit_description: Optional[str] = None,
parent_commit: Optional[str] = None,
Expand Down Expand Up @@ -503,7 +598,20 @@ def prepare_commit_payload(
"key": "deletedFolder" if operation.is_folder else "deletedFile",
"value": {"path": operation.path_in_repo},
}
# 2.d. Never expected to happen
# 2.d. Case copying a file or folder
elif isinstance(operation, CommitOperationCopy):
file_to_copy = files_to_copy[(operation.src_path_in_repo, operation.src_revision)]
if not file_to_copy.lfs:
raise NotImplementedError("Copying a non-LFS file is not implemented")
yield {
"key": "lfsFile",
"value": {
"path": operation.path_in_repo,
"algo": "sha256",
"oid": file_to_copy.lfs["sha256"],
},
}
# 2.e. Never expected to happen
else:
raise ValueError(
f"Unknown operation to commit. Operation: {operation}. Upload mode:"
Expand Down
Loading

0 comments on commit c578b18

Please sign in to comment.