Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run HfApi methods in the background (run_as_future) #1458

Merged
merged 19 commits into from
May 17, 2023

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Apr 28, 2023

Related to #939 (cc @LysandreJik @osanseviero @julien-c)

EDIT: PR completely refactored (so is this description)
EDIT 2: PR completely refactored (again)


This PR makes is possible to run HfApi methods in the background. I'm using a ThreadPoolExecutor with only 1 worker. This means that we are preserving the order in which the jobs are submitted. The returned value is a Future with methods like .is_done(), .result() and .exception() already implemented so we don't have to support weird stuff in huggingface_hub.

The goal of this PR is:

  • to use HfApi in training scripts instead of the git-based methods. Similar to Repository.push(blocking=False) but no need to clone the repo.
  • to avoid interrupting the training process while uploading
  • to avoid crashing the training in case of a ConnectionError (or any error)
  • NOT to parallelize concurrent calls to the Hub => for more flexibility, users musts implement their threading strategy themselves

UX-wise, we did several iterations. We came to the conclusion that having a run_as_future:bool = False argument in future-compatible methods was the way to go. It is not perfect but from a developer perspective, that should be the most practicle.

>>> import time
>>> from huggingface_hub import HfApi

>>> api = HfApi()
>>> api.upload_file(...)  # takes Xs
# URL to upload file

>>> future = api.upload_file(..., run_as_future=True) # instant
>>> future.result() # wait until complete
# URL to upload file

Since the chosen solution is quite verbose in hf_api.py to get the correct type annotations/IDE autocompletion, only the most useful methods have a run_as_future: bool argument : upload_file, upload_folder and create_commit. For other methods, it is still possible to run them in the background using HfApi.run_as_future:

>>> from huggingface_hub import HfApi
>>> api = HfApi()
>>> api.run_as_future(api.create_repo, "username/my-model", exists_ok=True)
Future(...)
>>> api.upload_file(
...     repo_id="username/my-model",
...     path_in_repo="file.txt",
...     path_or_fileobj=b"file content",
...     run_as_future=True,
... )
Future(...)

TODO:

@sgugger
Copy link
Contributor

sgugger commented Apr 28, 2023

API-wise, this makes sense to me. We'd need to adapt the internals of the Trainer once this is released if we want to migrate from Repository to this API, but we can do the same thing with this new API.

@Wauplin
Copy link
Contributor Author

Wauplin commented Apr 28, 2023

Technical details:

To make the methods of HfApi "threaded", I overwrite the __getattribute__ method of HfApi to update the behavior of the methods as they are accessed by the user at runtime. Instead of a normal method, I return a class with a __call__ method (sync) and a threaded one (async). It works fine in term of usability and has the advantage of not touching the default behavior.

The counterpart is that static type checking is broken for the .threaded() method (meaning mypy complains and no auto-completion possible). I don't think it's a huge problem as the main target are training scripts that are already not typed. I didn't want to add a blocking:bool = True default parameter to each and every method as it would be annoying to maintain (+type annotation of the returned value would be annoying as well).

Logic-wise the PR is complete if we go for this solution. I'll need to add tests/docstrings/documentation afterwards.

EDIT: PR completely refactored since this comment.

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented May 3, 2023

I like the design! We pretty much only use upload_folder in the diffusers training scripts, so happy to change to upload_folder.threaded() logic

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 9, 2023

The documentation is not available anymore as the PR was closed or merged.

@Wauplin Wauplin changed the title [Draft] Run HfApi methods in the background Run HfApi methods in the background (upload_file_threaded) May 9, 2023
@Wauplin Wauplin marked this pull request as ready for review May 9, 2023 15:21
@Wauplin Wauplin requested review from LysandreJik and julien-c May 9, 2023 15:21
@codecov
Copy link

codecov bot commented May 15, 2023

Codecov Report

Patch coverage: 60.65% and project coverage change: +29.04 🎉

Comparison is base (10b1cb2) 53.08% compared to head (a40f4e2) 82.13%.

❗ Current head a40f4e2 differs from pull request most recent head 522fbc4. Consider uploading reports for the commit 522fbc4 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1458       +/-   ##
===========================================
+ Coverage   53.08%   82.13%   +29.04%     
===========================================
  Files          53       54        +1     
  Lines        5638     5759      +121     
===========================================
+ Hits         2993     4730     +1737     
+ Misses       2645     1029     -1616     
Impacted Files Coverage Δ
src/huggingface_hub/_multi_commits.py 36.73% <0.00%> (ø)
src/huggingface_hub/_threaded_hf_api.py 59.13% <59.13%> (ø)
src/huggingface_hub/hf_api.py 83.83% <100.00%> (+41.68%) ⬆️
src/huggingface_hub/utils/_validators.py 100.00% <100.00%> (+13.63%) ⬆️

... and 35 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@LysandreJik
Copy link
Member

Thank you for your PR!

I can't find the docs for the HfApi on your PR. On main the HfApi is populated as it should, which is not the case here.

I wanted to check the docs specifically as I'm wondering if we need to 2x the API + docs of the HfApi to add a threaded version. I wonder if having a boolean argument threaded to each method (that could be added thanks to a @threaded decorator) wouldn't be simpler from a user's and a developer's perspective.

  • Having each thread-compatible method have its _threaded() counterpart results in an API that's a bit more bloated than it needs to be, especially as you'll need to add them to the root of HfApi if you want to mirror the methods that can be imported without a HfApi instance, for example with from huggingface_hub.hf_api import list_models. Not doing so means that the user will have to adjust their logic to handle instantiation when they didn't need to before.
  • This needs a significant script + a visible resulting private file in the repository directly.

WDYT?

@patrickvonplaten
Copy link
Contributor

I liked the API of:

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> future = whoami.detach() # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

more than the current API because the current API has become pretty nested/hard to read: HfAPI inherits from _ThreadedHFApi which inherits from _HfApi. +1 to @LysandreJik that also it's not great if the API docs are fully duplicated.

Think I also like the idea of adding a boolean flag instead. BTW why the name threaded ? Would something like as_future be better? threaded to me sounds like the method is multi-threaded to be faster, but it really only only creates a future no?

Also just an idea, could a context manager make sense here mabye?

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> with huggingface_hub.as_future():
>>>    future = whoami() # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

Problem here though is that not all methods can be instantiated as a future and it's also not like the future closes when the context manager closes, so maybe not super intuitive.

Overall to me the most natural design would indeed be a flag, like @LysandreJik proposed:

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> future = whoami(return_as_future=True) # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

@Wauplin
Copy link
Contributor Author

Wauplin commented May 16, 2023

Thanks a lot @LysandreJik @patrickvonplaten for your feedback. To be honest I have been very indecisive on how to design the API. A large part of this indecisiveness is that I don't like having a return value that is not typed correctly. It might seem like nitpicking but in my opinion type annotations really helps UX-wise and especially to help IDEs with auto-completion. I'm not too concerned when dealing with internal stuff but here we are talking about an exposed public API. That been said, I agree with both of you that duplicating all methods with a _threaded version is not satisfying. It had the advantage of using only classic Python features and be flexible but yeah, let's drop the idea.

At first I suggested to do something like future = HfApi().upload_file.as_future(repo_id=..., path_in_repo=...) (or .detach(...)) but I realized it's not satisfying either. For 1., .as_future was an attribute of a method which is already clunky. And 2., since .as_future(...) was not typed the user would have no autocompletion, no suggestion of args and no docs in the IDE when typing .as_future(.

I did not investigate the idea of a context manager but I think we would have the same kind of problem. When doing

>>> with huggingface_hub.as_future():
>>>    future = whoami() # instant
>>> value = future.result() # takes 1s

the IDE would not understand that "future" is a Future object (and not a dict).

In the end, I agree with having a flag as_future (btw thanks @patrickvonplaten for the naming!). I finally arrived to a satisfying solution using the @overload decorator. In the example below, if as_future=False is passed, it returns a string and if as_future=True, it returns a Future[str]:

class Api:
    @overload
    def do_something(self, a: int, b: bool = ..., as_future: Literal[False] = ...) -> str:  # type: ignore
        ...

    @overload
    def do_something(self, a: int, b: bool = ..., as_future: Literal[True] = ...) -> Future[str]:
        ...

    @future_compatible
    def do_something(self, a: int, b: bool = True, as_future: bool = False) -> Union[str, Future[str]]:
        return f"{a} {b} {threading.current_thread().name}"

Unless you disagree, I think I'll go with this solution. It's unfortunately quite verbose to write but UX-wise it should be nice to use for the user. I investigated a bit and the overload stubs could be placed in a separate .pyi file if we want but it has some drawbacks so I don't think I'll do that for now. What I suggest is to manually write those for the methods we want to decorate and think later if we want to write a script to generate them. Here is the list I was thinking about:

  • create_repo, delete_repo
  • create_commit, create_commits_on_pr
  • upload_file, upload_folder, delete_file, delete_folder
  • create_branch, delete_branch, create_tag, delete_tag
  • add_space_secret, delete_space_secret, request_space_hardware, set_space_sleep_time, pause_space, restart_space, duplicate_space,

(IMO, it's not so much needed for read-only methods like model_info, list_models, list_repo_files, list_repo_commits,...,)

@Wauplin
Copy link
Contributor Author

Wauplin commented May 16, 2023

TL;DR:

  • I want auto-completion to work nicely, meaning type annotations where possible
  • Agree having duplicate _threaded methods is a bad idea
  • I found a solution using @overload decorator to do type annotation "correctly" (not perfect but it's enough)
  • I'll see how I'll implement it but most probably manually and for a subset of methods (only the ones that "write" something)

And thanks again for the feedback!

@LysandreJik
Copy link
Member

This looks like a good solution to me!

@Wauplin Wauplin changed the title Run HfApi methods in the background (upload_file_threaded) Run HfApi methods in the background (run_as_future) May 17, 2023
@Wauplin
Copy link
Contributor Author

Wauplin commented May 17, 2023

@patrickvonplaten @LysandreJik

Finally I implemented run_as_future: bool = False as a flag argument for create_commit, upload_file and upload_folder. Type annotations/IDE autocompletion should be fine now :)

I did not want to implement it for all methods as it's quite verbose and it doesn't make sense in many cases. If a power user really want to run other methods in the background, they can use HfApi.run_as_future(...) which does exactly the same (while being less practical).

Example:

>>> from huggingface_hub import HfApi
>>> api = HfApi()
>>> api.run_as_future(api.create_repo, "username/my-model", exists_ok=True)
Future(...)
>>> api.upload_file(
...     repo_id="username/my-model",
...     path_in_repo="file.txt",
...     path_or_fileobj=b"file content",
...     run_as_future=True,
... )
Future(...)

I have updated the PR description, added some tests and added a section in the docs. I think the PR is finally in a final state for review (:crossed_fingers:) :)

@patrickvonplaten
Copy link
Contributor

I like it!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Love the API, and the implementation looks correct.

Thanks for iterating!

@validate_hf_hub_args
@future_compatible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how simple this becomes!

@Wauplin Wauplin merged commit 87ac63a into main May 17, 2023
@Wauplin Wauplin deleted the 939-run-hfapi-in-background branch May 17, 2023 16:33
@Wauplin
Copy link
Contributor Author

Wauplin commented May 17, 2023

Youhou! Finally merging it :) Thanks for the reviews and feedback ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants