Skip to content

Commit

Permalink
Merge branch 'main' into mc/rm_lock
Browse files Browse the repository at this point in the history
  • Loading branch information
Wauplin authored Sep 18, 2023
2 parents 2b025e0 + 014e9d6 commit 774a0d8
Show file tree
Hide file tree
Showing 11 changed files with 67 additions and 83 deletions.
2 changes: 2 additions & 0 deletions .github/workflows/python-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
case "${{ matrix.test_name }}" in
"Repository only" | "Everything else")
sudo apt update
sudo apt install -y libsndfile1-dev
;;
Expand All @@ -69,6 +70,7 @@ jobs:
;;
tensorflow)
sudo apt update
sudo apt install -y graphviz
pip install .[tensorflow]
;;
Expand Down
58 changes: 2 additions & 56 deletions docs/source/en/guides/upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,62 +435,8 @@ For more detailed information, take a look at the [`HfApi`] reference.

There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
getting an upload/push to fail at the end of the process or encountering a degraded experience, be it on hf.co or when working locally, can be very annoying.
We gathered a list of tips and recommendations for structuring your repo.


| Characteristic | Recommended | Tips |
| ---------------- | ------------------ | ------------------------------------------------------ |
| Repo size | - | contact us for large repos (TBs of data) |
| Files per repo | <100k | merge data into fewer files |
| Entries per folder | <10k | use subdirectories in repo |
| File size | <5GB | split data into chunked files |
| Commit size | <100 files* | upload files in multiple commits |
| Commits per repo | - | upload multiple files per commit and/or squash history |

_* Not relevant when using `git` CLI directly_

Please read the next section to understand better those limits and how to deal with them.

### Hub repository size limitations

What are we talking about when we say "large uploads", and what are their associated limitations? Large uploads can be
very diverse, from repositories with a few huge files (e.g. model weights) to repositories with thousands of small files
(e.g. an image dataset).

Under the hood, the Hub uses Git to version the data, which has structural implications on what you can do in your repo.
If your repo is crossing some of the numbers mentioned in the previous section, **we strongly encourage you to check out [`git-sizer`](https://github.com/github/git-sizer)**,
which has very detailed documentation about the different factors that will impact your experience. Here is a TL;DR of factors to consider:

- **Repository size**: The total size of the data you're planning to upload. There is no hard limit on a Hub repository size. However, if you plan to upload hundreds of GBs or even TBs of data, we would appreciate it if you could let us know in advance so we can better help you if you have any questions during the process. You can contact us at [email protected] or on [our Discord](http://hf.co/join/discord).
- **Number of files**:
- For optimal experience, we recommend keeping the total number of files under 100k. Try merging the data into fewer files if you have more.
For example, json files can be merged into a single jsonl file, or large datasets can be exported as Parquet files.
- The maximum number of files per folder cannot exceed 10k files per folder. A simple solution is to
create a repository structure that uses subdirectories. For example, a repo with 1k folders from `000/` to `999/`, each containing at most 1000 files, is already enough.
- **File size**: In the case of uploading large files (e.g. model weights), we strongly recommend splitting them **into chunks of around 5GB each**.
There are a few reasons for this:
- Uploading and downloading smaller files is much easier both for you and the other users. Connection issues can always
happen when streaming data and smaller files avoid resuming from the beginning in case of errors.
- Files are served to the users using CloudFront. From our experience, huge files are not cached by this service
leading to a slower download speed.
In all cases no single LFS file will be able to be >50GB. I.e. 50GB is the hard limit for single file size.
- **Number of commits**: There is no hard limit for the total number of commits on your repo history. However, from
our experience, the user experience on the Hub starts to degrade after a few thousand commits. We are constantly working to
improve the service, but one must always remember that a git repository is not meant to work as a database with a lot of
writes. If your repo's history gets very large, it is always possible to squash all the commits to get a
fresh start using [`super_squash_history`]. This is a non-revertible operation.
- **Number of operations per commit**: Once again, there is no hard limit here. When a commit is uploaded on the Hub, each
git operation (addition or delete) is checked by the server. When a hundred LFS files are committed at once,
each file is checked individually to ensure it's been correctly uploaded. When pushing data through HTTP with `huggingface_hub`,
a timeout of 60s is set on the request, meaning that if the process takes more time, an error is raised
client-side. However, it can happen (in rare cases) that even if the timeout is raised client-side, the process is still
completed server-side. This can be checked manually by browsing the repo on the Hub. To prevent this timeout, we recommend
adding around 50-100 files per commit.

### Practical tips

Now that we've seen the technical aspects you must consider when structuring your repository, let's see some practical
tips to make your upload process as smooth as possible.

Check out our [Repository limitations and recommendations](https://huggingface.co/docs/hub/repositories-recommendations) guide for best practices on how to structure your repositories on the Hub. Next, let's move on with some practical tips to make your upload process as smooth as possible.

- **Start small**: We recommend starting with a small amount of data to test your upload script. It's easier to iterate
on a script when failing takes only a little time.
Expand Down
2 changes: 1 addition & 1 deletion src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
from typing import TYPE_CHECKING


__version__ = "0.17.0.dev0"
__version__ = "0.18.0.dev0"

# Alphabetical order of definitions is ensured in tests
# WARNING: any comment added in this dictionary definition will be lost when
Expand Down
22 changes: 13 additions & 9 deletions src/huggingface_hub/commands/upload.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@
from huggingface_hub import logging
from huggingface_hub._commit_scheduler import CommitScheduler
from huggingface_hub.commands import BaseHuggingfaceCLICommand
from huggingface_hub.hf_api import create_repo, upload_file, upload_folder
from huggingface_hub.hf_api import HfApi
from huggingface_hub.utils import disable_progress_bars, enable_progress_bars


Expand Down Expand Up @@ -134,7 +134,7 @@ def __init__(self, args: Namespace) -> None:
self.commit_message: Optional[str] = args.commit_message
self.commit_description: Optional[str] = args.commit_description
self.create_pr: bool = args.create_pr
self.token: Optional[str] = args.token
self.api: HfApi = HfApi(token=args.token, library_name="huggingface-cli")
self.quiet: bool = args.quiet # disable warnings and progress bars

# Check `--every` is valid
Expand Down Expand Up @@ -222,7 +222,7 @@ def _upload(self) -> str:
path_in_repo=path_in_repo,
private=self.private,
every=self.every,
token=self.token,
hf_api=self.api,
)
print(f"Scheduling commits every {self.every} minutes to {scheduler.repo_id}.")
try: # Block main thread until KeyboardInterrupt
Expand All @@ -235,33 +235,37 @@ def _upload(self) -> str:
# Otherwise, create repo and proceed with the upload
if not os.path.isfile(self.local_path) and not os.path.isdir(self.local_path):
raise FileNotFoundError(f"No such file or directory: '{self.local_path}'.")
repo_id = create_repo(
repo_id=self.repo_id, repo_type=self.repo_type, exist_ok=True, private=self.private, token=self.token
repo_id = self.api.create_repo(
repo_id=self.repo_id,
repo_type=self.repo_type,
exist_ok=True,
private=self.private,
space_sdk="gradio" if self.repo_type == "space" else None,
# ^ We don't want it to fail when uploading to a Space => let's set Gradio by default.
# ^ I'd rather not add CLI args to set it explicitly as we already have `huggingface-cli repo create` for that.
).repo_id

# File-based upload
if os.path.isfile(self.local_path):
return upload_file(
return self.api.upload_file(
path_or_fileobj=self.local_path,
path_in_repo=self.path_in_repo,
repo_id=repo_id,
repo_type=self.repo_type,
revision=self.revision,
token=self.token,
commit_message=self.commit_message,
commit_description=self.commit_description,
create_pr=self.create_pr,
)

# Folder-based upload
else:
return upload_folder(
return self.api.upload_folder(
folder_path=self.local_path,
path_in_repo=self.path_in_repo,
repo_id=repo_id,
repo_type=self.repo_type,
revision=self.revision,
token=self.token,
commit_message=self.commit_message,
commit_description=self.commit_description,
create_pr=self.create_pr,
Expand Down
20 changes: 17 additions & 3 deletions src/huggingface_hub/file_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
)
from .utils import (
EntryNotFoundError,
FileMetadataError,
GatedRepoError,
LocalEntryNotFoundError,
RepositoryNotFoundError,
Expand Down Expand Up @@ -700,7 +701,7 @@ def cached_download(
# we fallback to the regular etag header.
# If we don't have any of those, raise an error.
if etag is None:
raise OSError(
raise FileMetadataError(
"Distant resource does not have an ETag, we won't be able to reliably ensure reproducibility."
)
# We get the expected size of the file, to check the download went well.
Expand Down Expand Up @@ -1246,15 +1247,19 @@ def hf_hub_download(
# Commit hash must exist
commit_hash = metadata.commit_hash
if commit_hash is None:
raise OSError("Distant resource does not seem to be on huggingface.co (missing commit header).")
raise FileMetadataError(
"Distant resource does not seem to be on huggingface.co. It is possible that a configuration issue"
" prevents you from downloading resources from https://huggingface.co. Please check your firewall"
" and proxy settings and make sure your SSL certificates are updated."
)

# Etag must exist
etag = metadata.etag
# We favor a custom header indicating the etag of the linked resource, and
# we fallback to the regular etag header.
# If we don't have any of those, raise an error.
if etag is None:
raise OSError(
raise FileMetadataError(
"Distant resource does not have an ETag, we won't be able to reliably ensure reproducibility."
)

Expand Down Expand Up @@ -1293,12 +1298,21 @@ def hf_hub_download(
# (if it's not the case, the error will be re-raised)
head_call_error = error
pass
except FileMetadataError as error:
# Multiple reasons for a FileMetadataError:
# - Wrong network configuration (proxy, firewall, SSL certificates)
# - Inconsistency on the Hub
# => let's switch to 'local_files_only=True' to check if the files are already cached.
# (if it's not the case, the error will be re-raised)
head_call_error = error
pass

# etag can be None for several reasons:
# 1. we passed local_files_only.
# 2. we don't have a connection
# 3. Hub is down (HTTP 500 or 504)
# 4. repo is not found -for example private or gated- and invalid/missing token sent
# 5. Hub is blocked by a firewall or proxy is not set correctly.
# => Try to get the last downloaded one from the specified revision.
#
# If the specified revision is a commit hash, look inside "snapshots".
Expand Down
14 changes: 13 additions & 1 deletion src/huggingface_hub/hf_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2540,7 +2540,19 @@ def create_repo(
# See https://github.com/huggingface/huggingface_hub/pull/733/files#r820604472
json["lfsmultipartthresh"] = self._lfsmultipartthresh # type: ignore
headers = self._build_hf_headers(token=token, is_write_action=True)
r = get_session().post(path, headers=headers, json=json)

while True:
r = get_session().post(path, headers=headers, json=json)
if r.status_code == 409 and "Cannot create repo: another conflicting operation is in progress" in r.text:
# Since https://github.com/huggingface/moon-landing/pull/7272 (private repo), it is not possible to
# concurrently create repos on the Hub for a same user. This is rarely an issue, except when running
# tests. To avoid any inconvenience, we retry to create the repo for this specific error.
# NOTE: This could have being fixed directly in the tests but adding it here should fixed CIs for all
# dependent libraries.
# NOTE: If a fix is implemented server-side, we should be able to remove this retry mechanism.
logger.debug("Create repo failed due to a concurrency issue. Retrying...")
continue
break

try:
hf_raise_for_status(r)
Expand Down
1 change: 1 addition & 0 deletions src/huggingface_hub/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
from ._errors import (
BadRequestError,
EntryNotFoundError,
FileMetadataError,
GatedRepoError,
HfHubHTTPError,
LocalEntryNotFoundError,
Expand Down
7 changes: 7 additions & 0 deletions src/huggingface_hub/utils/_errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@
from ._fixes import JSONDecodeError


class FileMetadataError(OSError):
"""Error triggered when the metadata of a file on the Hub cannot be retrieved (missing ETag or commit_hash).
Inherits from `OSError` for backward compatibility.
"""


class HfHubHTTPError(HTTPError):
"""
HTTPError to inherit from for any custom HTTP Error raised in HF Hub.
Expand Down
20 changes: 9 additions & 11 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ def test_upload_basic(self) -> None:
self.assertEqual(cmd.commit_description, None)
self.assertEqual(cmd.create_pr, False)
self.assertEqual(cmd.every, None)
self.assertEqual(cmd.token, None)
self.assertEqual(cmd.api.token, None)
self.assertEqual(cmd.quiet, False)

def test_upload_with_all_options(self) -> None:
Expand Down Expand Up @@ -135,7 +135,7 @@ def test_upload_with_all_options(self) -> None:
self.assertEqual(cmd.commit_description, "My commit description")
self.assertEqual(cmd.create_pr, True)
self.assertEqual(cmd.every, 5)
self.assertEqual(cmd.token, "my-token")
self.assertEqual(cmd.api.token, "my-token")
self.assertEqual(cmd.quiet, True)

def test_upload_implicit_local_path_when_folder_exists(self) -> None:
Expand Down Expand Up @@ -211,8 +211,8 @@ def test_every_as_float(self) -> None:
cmd = UploadCommand(self.parser.parse_args(["upload", DUMMY_MODEL_ID, ".", "--every", "0.5"]))
self.assertEqual(cmd.every, 0.5)

@patch("huggingface_hub.commands.upload.upload_folder")
@patch("huggingface_hub.commands.upload.create_repo")
@patch("huggingface_hub.commands.upload.HfApi.upload_folder")
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
with SoftTemporaryDirectory() as cache_dir:
cmd = UploadCommand(
Expand All @@ -223,15 +223,14 @@ def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
cmd.run()

create_mock.assert_called_once_with(
repo_id="my-model", repo_type="model", exist_ok=True, private=True, token=None
repo_id="my-model", repo_type="model", exist_ok=True, private=True, space_sdk=None
)
upload_mock.assert_called_once_with(
folder_path=cache_dir,
path_in_repo=".",
repo_id=create_mock.return_value.repo_id,
repo_type="model",
revision=None,
token=None,
commit_message=None,
commit_description=None,
create_pr=False,
Expand All @@ -240,8 +239,8 @@ def test_upload_folder_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
delete_patterns=["*.json"],
)

@patch("huggingface_hub.commands.upload.upload_file")
@patch("huggingface_hub.commands.upload.create_repo")
@patch("huggingface_hub.commands.upload.HfApi.upload_file")
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
def test_upload_file_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
with SoftTemporaryDirectory() as cache_dir:
file_path = Path(cache_dir) / "file.txt"
Expand All @@ -254,21 +253,20 @@ def test_upload_file_mock(self, create_mock: Mock, upload_mock: Mock) -> None:
cmd.run()

create_mock.assert_called_once_with(
repo_id="my-dataset", repo_type="dataset", exist_ok=True, private=False, token=None
repo_id="my-dataset", repo_type="dataset", exist_ok=True, private=False, space_sdk=None
)
upload_mock.assert_called_once_with(
path_or_fileobj=str(file_path),
path_in_repo="logs/file.txt",
repo_id=create_mock.return_value.repo_id,
repo_type="dataset",
revision=None,
token=None,
commit_message=None,
commit_description=None,
create_pr=True,
)

@patch("huggingface_hub.commands.upload.create_repo")
@patch("huggingface_hub.commands.upload.HfApi.create_repo")
def test_upload_missing_path(self, create_mock: Mock) -> None:
cmd = UploadCommand(self.parser.parse_args(["upload", "my-model", "/path/to/missing_file", "logs/file.txt"]))
with self.assertRaises(FileNotFoundError):
Expand Down
2 changes: 1 addition & 1 deletion tests/test_inference_async_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ async def test_get_status_too_big_model() -> None:

@pytest.mark.asyncio
async def test_get_status_loaded_model() -> None:
model_status = await AsyncInferenceClient().get_model_status("bigcode/starcoder")
model_status = await AsyncInferenceClient().get_model_status("bigscience/bloom")
assert model_status.loaded is True
assert model_status.state == "Loaded"
assert model_status.compute_type == "gpu"
Expand Down
2 changes: 1 addition & 1 deletion tests/test_inference_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -519,7 +519,7 @@ def test_too_big_model(self) -> None:

def test_loaded_model(self) -> None:
client = InferenceClient()
model_status = client.get_model_status("bigcode/starcoder")
model_status = client.get_model_status("bigscience/bloom")
self.assertTrue(model_status.loaded)
self.assertEqual(model_status.state, "Loaded")
self.assertEqual(model_status.compute_type, "gpu")
Expand Down

0 comments on commit 774a0d8

Please sign in to comment.