Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CommitOperationCopy #1495

Merged
merged 17 commits into from
Jun 6, 2023
Merged

add CommitOperationCopy #1495

merged 17 commits into from
Jun 6, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 2, 2023

close #1083

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 2, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lhoestq thanks for opening a PR on this topic! I've made a few comments on your work. The 2 main things I would change in the logic:

  • fail early in case of missing file on the repo or non-LFS file
  • handle list_files_info differently (returned value is a bit misleading but basically you can't assume the output to be aligned 1-1 with the input) 😕

Also listed a few things to complete before being ready to merge. Could you:

  1. Add in the existing test a second copy from same source but different destination (basically make a commit to copy twice the same file to 2 different location) => we should ensure it works (not the case currently)
  2. Update operations argument in create_commit docstring
  3. Add small section + example in the upload guide
  4. Add CommitOperationCopy to package reference

Thanks a lot in advance! 🤗

Note: I made some changes to update _multi_commits.py (no need to care about that one) accordingly + make CommitOperationCopy a first-class citizen in huggingface_hub.__init__.py. You should pull from the branch before making changes :)

src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Show resolved Hide resolved
for src_revision, operations in groupby(copies, key=lambda op: op.src_revision):
operations = list(operations) # type: ignore
paths = [op.src_path_in_repo for op in operations]
src_repo_files = hf_api.list_files_info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list_files_info can be a bit misleading but basically you cannot assume that the returned files will be in the same order as the input paths + you cannot even assume that the list will be the same size. In particular:

  • if a path is duplicated (i.e. user is copying a file multiple times), the server will return only one RepoFile instance
  • if a path is missing on the repo, the server will silently ignore it

So what I would advice to do is:

  • take as input a List[CommitOperationCopy] instead of Iterable
  • do a first pass with for src_revision, operations in groupby(copies, key=lambda op: op.src_revision):
    • list_files_info
    • if a returned RepoFile has .lfs = None => raise an issue
    • add to files_to_copy using RepoFile.rfilename, src_revision and RepoFile.lfs["sha256"]
  • do a second pass on copies
    • if op.src_path_in_repo/op.src_revision pair is not found in the sha256 dictionary => raise the EntryNotFound error

(disclaimer: I haven't tested the above logic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good ! why switch to List instead of Iterable ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can iterate on a list twice. On an iterable it's not guaranteed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I only iterate on it once using groupby so we're good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect :)

src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/hf_api.py Show resolved Hide resolved
@lhoestq lhoestq marked this pull request as ready for review June 6, 2023 14:12
@lhoestq
Copy link
Member Author

lhoestq commented Jun 6, 2023

Took all your comments into account :)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for making the changes :)
I'll quickly update _multi_commits.py and we'll be ready to merge.

docs/source/guides/upload.mdx Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Jun 6, 2023

feel free to merge once it's good for you ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add copy operation to commit API
3 participants