Run `HfApi` methods in the background (`run_as_future`) #1458

Wauplin · 2023-04-28T14:06:42Z

Related to #939 (cc @LysandreJik @osanseviero @julien-c)

EDIT: PR completely refactored (so is this description)
EDIT 2: PR completely refactored (again)

This PR makes is possible to run HfApi methods in the background. I'm using a ThreadPoolExecutor with only 1 worker. This means that we are preserving the order in which the jobs are submitted. The returned value is a Future with methods like .is_done(), .result() and .exception() already implemented so we don't have to support weird stuff in huggingface_hub.

The goal of this PR is:

to use HfApi in training scripts instead of the git-based methods. Similar to Repository.push(blocking=False) but no need to clone the repo.
to avoid interrupting the training process while uploading
to avoid crashing the training in case of a ConnectionError (or any error)
NOT to parallelize concurrent calls to the Hub => for more flexibility, users musts implement their threading strategy themselves

UX-wise, we did several iterations. We came to the conclusion that having a run_as_future:bool = False argument in future-compatible methods was the way to go. It is not perfect but from a developer perspective, that should be the most practicle.

>>> import time
>>> from huggingface_hub import HfApi

>>> api = HfApi()
>>> api.upload_file(...)  # takes Xs
# URL to upload file

>>> future = api.upload_file(..., run_as_future=True) # instant
>>> future.result() # wait until complete
# URL to upload file

Since the chosen solution is quite verbose in hf_api.py to get the correct type annotations/IDE autocompletion, only the most useful methods have a run_as_future: bool argument : upload_file, upload_folder and create_commit. For other methods, it is still possible to run them in the background using HfApi.run_as_future:

>>> from huggingface_hub import HfApi
>>> api = HfApi()
>>> api.run_as_future(api.create_repo, "username/my-model", exists_ok=True)
Future(...)
>>> api.upload_file(
...     repo_id="username/my-model",
...     path_in_repo="file.txt",
...     path_or_fileobj=b"file content",
...     run_as_future=True,
... )
Future(...)

TODO:

sgugger · 2023-04-28T14:18:01Z

API-wise, this makes sense to me. We'd need to adapt the internals of the Trainer once this is released if we want to migrate from Repository to this API, but we can do the same thing with this new API.

Wauplin · 2023-04-28T14:22:07Z

Technical details:

To make the methods of HfApi "threaded", I overwrite the __getattribute__ method of HfApi to update the behavior of the methods as they are accessed by the user at runtime. Instead of a normal method, I return a class with a __call__ method (sync) and a threaded one (async). It works fine in term of usability and has the advantage of not touching the default behavior.

The counterpart is that static type checking is broken for the .threaded() method (meaning mypy complains and no auto-completion possible). I don't think it's a huge problem as the main target are training scripts that are already not typed. I didn't want to add a blocking:bool = True default parameter to each and every method as it would be annoying to maintain (+type annotation of the returned value would be annoying as well).

~~Logic-wise the PR is complete if we go for this solution. I'll need to add tests/docstrings/documentation afterwards.~~

EDIT: PR completely refactored since this comment.

patrickvonplaten · 2023-05-03T08:51:09Z

I like the design! We pretty much only use upload_folder in the diffusers training scripts, so happy to change to upload_folder.threaded() logic

HuggingFaceDocBuilderDev · 2023-05-09T14:54:19Z

The documentation is not available anymore as the PR was closed or merged.

codecov · 2023-05-15T15:43:08Z

Codecov Report

Patch coverage: 60.65% and project coverage change: +29.04 🎉

Comparison is base (10b1cb2) 53.08% compared to head (a40f4e2) 82.13%.

❗ Current head a40f4e2 differs from pull request most recent head 522fbc4. Consider uploading reports for the commit 522fbc4 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1458       +/-   ##
===========================================
+ Coverage   53.08%   82.13%   +29.04%     
===========================================
  Files          53       54        +1     
  Lines        5638     5759      +121     
===========================================
+ Hits         2993     4730     +1737     
+ Misses       2645     1029     -1616

Impacted Files	Coverage Δ
src/huggingface_hub/_multi_commits.py	`36.73% <0.00%> (ø)`
src/huggingface_hub/_threaded_hf_api.py	`59.13% <59.13%> (ø)`
src/huggingface_hub/hf_api.py	`83.83% <100.00%> (+41.68%)`	⬆️
src/huggingface_hub/utils/_validators.py	`100.00% <100.00%> (+13.63%)`	⬆️

... and 35 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

LysandreJik · 2023-05-15T16:40:03Z

Thank you for your PR!

I can't find the docs for the HfApi on your PR. On main the HfApi is populated as it should, which is not the case here.

I wanted to check the docs specifically as I'm wondering if we need to 2x the API + docs of the HfApi to add a threaded version. I wonder if having a boolean argument threaded to each method (that could be added thanks to a @threaded decorator) wouldn't be simpler from a user's and a developer's perspective.

Having each thread-compatible method have its _threaded() counterpart results in an API that's a bit more bloated than it needs to be, especially as you'll need to add them to the root of HfApi if you want to mirror the methods that can be imported without a HfApi instance, for example with from huggingface_hub.hf_api import list_models. Not doing so means that the user will have to adjust their logic to handle instantiation when they didn't need to before.
This needs a significant script + a visible resulting private file in the repository directly.

WDYT?

patrickvonplaten · 2023-05-16T10:36:56Z

I liked the API of:

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> future = whoami.detach() # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

more than the current API because the current API has become pretty nested/hard to read: HfAPI inherits from _ThreadedHFApi which inherits from _HfApi. +1 to @LysandreJik that also it's not great if the API docs are fully duplicated.

Think I also like the idea of adding a boolean flag instead. BTW why the name threaded ? Would something like as_future be better? threaded to me sounds like the method is multi-threaded to be faster, but it really only only creates a future no?

Also just an idea, could a context manager make sense here mabye?

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> with huggingface_hub.as_future():
>>>    future = whoami() # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

Problem here though is that not all methods can be instantiated as a future and it's also not like the future closes when the context manager closes, so maybe not super intuitive.

Overall to me the most natural design would indeed be a flag, like @LysandreJik proposed:

>>> import time
>>> from huggingface_hub import whoami

>>> whoami()  # takes 1s
"Wauplin"
>>> future = whoami(return_as_future=True) # instant
>>> time.sleep(1)
>>> future.result()
"Wauplin"

Wauplin · 2023-05-16T13:29:39Z

Thanks a lot @LysandreJik @patrickvonplaten for your feedback. To be honest I have been very indecisive on how to design the API. A large part of this indecisiveness is that I don't like having a return value that is not typed correctly. It might seem like nitpicking but in my opinion type annotations really helps UX-wise and especially to help IDEs with auto-completion. I'm not too concerned when dealing with internal stuff but here we are talking about an exposed public API. That been said, I agree with both of you that duplicating all methods with a _threaded version is not satisfying. It had the advantage of using only classic Python features and be flexible but yeah, let's drop the idea.

At first I suggested to do something like future = HfApi().upload_file.as_future(repo_id=..., path_in_repo=...) (or .detach(...)) but I realized it's not satisfying either. For 1., .as_future was an attribute of a method which is already clunky. And 2., since .as_future(...) was not typed the user would have no autocompletion, no suggestion of args and no docs in the IDE when typing .as_future(.

I did not investigate the idea of a context manager but I think we would have the same kind of problem. When doing

>>> with huggingface_hub.as_future():
>>>    future = whoami() # instant
>>> value = future.result() # takes 1s

the IDE would not understand that "future" is a Future object (and not a dict).

In the end, I agree with having a flag as_future (btw thanks @patrickvonplaten for the naming!). I finally arrived to a satisfying solution using the @overload decorator. In the example below, if as_future=False is passed, it returns a string and if as_future=True, it returns a Future[str]:

class Api:
    @overload
    def do_something(self, a: int, b: bool = ..., as_future: Literal[False] = ...) -> str:  # type: ignore
        ...

    @overload
    def do_something(self, a: int, b: bool = ..., as_future: Literal[True] = ...) -> Future[str]:
        ...

    @future_compatible
    def do_something(self, a: int, b: bool = True, as_future: bool = False) -> Union[str, Future[str]]:
        return f"{a} {b} {threading.current_thread().name}"

Unless you disagree, I think I'll go with this solution. It's unfortunately quite verbose to write but UX-wise it should be nice to use for the user. I investigated a bit and the overload stubs could be placed in a separate .pyi file if we want but it has some drawbacks so I don't think I'll do that for now. What I suggest is to manually write those for the methods we want to decorate and think later if we want to write a script to generate them. Here is the list I was thinking about:

create_repo, delete_repo
create_commit, create_commits_on_pr
upload_file, upload_folder, delete_file, delete_folder
create_branch, delete_branch, create_tag, delete_tag
add_space_secret, delete_space_secret, request_space_hardware, set_space_sleep_time, pause_space, restart_space, duplicate_space,

(IMO, it's not so much needed for read-only methods like model_info, list_models, list_repo_files, list_repo_commits,...,)

Wauplin · 2023-05-16T13:32:44Z

TL;DR:

I want auto-completion to work nicely, meaning type annotations where possible
Agree having duplicate _threaded methods is a bad idea
I found a solution using @overload decorator to do type annotation "correctly" (not perfect but it's enough)
I'll see how I'll implement it but most probably manually and for a subset of methods (only the ones that "write" something)

And thanks again for the feedback!

LysandreJik · 2023-05-16T13:49:04Z

This looks like a good solution to me!

Wauplin · 2023-05-17T09:41:09Z

@patrickvonplaten @LysandreJik

Finally I implemented run_as_future: bool = False as a flag argument for create_commit, upload_file and upload_folder. Type annotations/IDE autocompletion should be fine now :)

I did not want to implement it for all methods as it's quite verbose and it doesn't make sense in many cases. If a power user really want to run other methods in the background, they can use HfApi.run_as_future(...) which does exactly the same (while being less practical).

Example:

>>> from huggingface_hub import HfApi
>>> api = HfApi()
>>> api.run_as_future(api.create_repo, "username/my-model", exists_ok=True)
Future(...)
>>> api.upload_file(
...     repo_id="username/my-model",
...     path_in_repo="file.txt",
...     path_or_fileobj=b"file content",
...     run_as_future=True,
... )
Future(...)

I have updated the PR description, added some tests and added a section in the docs. I think the PR is finally in a final state for review (:crossed_fingers:) :)

patrickvonplaten · 2023-05-17T10:23:50Z

I like it!

LysandreJik

Looks great! Love the API, and the implementation looks correct.

Thanks for iterating!

LysandreJik · 2023-05-17T15:39:42Z

src/huggingface_hub/hf_api.py

    @validate_hf_hub_args
+    @future_compatible


Love how simple this becomes!

Wauplin · 2023-05-17T16:33:32Z

Youhou! Finally merging it :) Thanks for the reviews and feedback ❤️

POC to run HfApi methods in the background

7980525

Wauplin mentioned this pull request May 8, 2023

push_to_hub fails with "cannot lock ref" and "failed to push some refs" huggingface/transformers#23187

Closed

4 tasks

Wauplin added 5 commits May 9, 2023 10:01

Merge branch 'main' into 939-run-hfapi-in-background

7a5d1b7

Renamed detach to threaded + fix repr

9ec1e0d

Complete Refacto

9e28cba

fix python37

af118bc

useless changes

c78917f

Wauplin added 2 commits May 9, 2023 17:09

add test

9ec4ee1

update CI

a40f4e2

Wauplin changed the title ~~[Draft] Run HfApi methods in the background~~ Run HfApi methods in the background (upload_file_threaded) May 9, 2023

Wauplin marked this pull request as ready for review May 9, 2023 15:21

Wauplin requested review from LysandreJik and julien-c May 9, 2023 15:21

Merge branch 'main' into 939-run-hfapi-in-background

4d52410

Wauplin added 6 commits May 16, 2023 16:53

first step to revert to as_future=True flad

164cec8

Make create_commit, upload_file and upload_folder future-compatible

0ccb4ff

Finally settle with a solution + tests + docs

1d516d7

add legacy script for reference

522fbc4

fix tests

d440337

fix typing in 3.7

5de0cdd

Wauplin changed the title ~~Run HfApi methods in the background (upload_file_threaded)~~ Run HfApi methods in the background (run_as_future) May 17, 2023

Wauplin added 4 commits May 17, 2023 11:44

fix typing for 3.7

51cf2bd

minor changes

714920c

make style

1686cfc

ciode qualiryt

741e201

LysandreJik approved these changes May 17, 2023

View reviewed changes

Wauplin merged commit 87ac63a into main May 17, 2023

Wauplin deleted the 939-run-hfapi-in-background branch May 17, 2023 16:33

This was referenced May 19, 2023

Support non-blocking push with non-git methods #939

Closed

Experimental hf logger #1456

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run `HfApi` methods in the background (`run_as_future`) #1458

Run `HfApi` methods in the background (`run_as_future`) #1458

Wauplin commented Apr 28, 2023 •

edited

Loading

sgugger commented Apr 28, 2023

Wauplin commented Apr 28, 2023 •

edited

Loading

patrickvonplaten commented May 3, 2023 •

edited by Wauplin

Loading

HuggingFaceDocBuilderDev commented May 9, 2023 •

edited

Loading

codecov bot commented May 15, 2023 •

edited

Loading

LysandreJik commented May 15, 2023

patrickvonplaten commented May 16, 2023

Wauplin commented May 16, 2023 •

edited

Loading

Wauplin commented May 16, 2023 •

edited

Loading

LysandreJik commented May 16, 2023

Wauplin commented May 17, 2023 •

edited

Loading

patrickvonplaten commented May 17, 2023

LysandreJik left a comment

LysandreJik May 17, 2023

Wauplin commented May 17, 2023

Run HfApi methods in the background (run_as_future) #1458

Run HfApi methods in the background (run_as_future) #1458

Conversation

Wauplin commented Apr 28, 2023 • edited Loading

sgugger commented Apr 28, 2023

Wauplin commented Apr 28, 2023 • edited Loading

patrickvonplaten commented May 3, 2023 • edited by Wauplin Loading

HuggingFaceDocBuilderDev commented May 9, 2023 • edited Loading

codecov bot commented May 15, 2023 • edited Loading

Codecov Report

LysandreJik commented May 15, 2023

patrickvonplaten commented May 16, 2023

Wauplin commented May 16, 2023 • edited Loading

Wauplin commented May 16, 2023 • edited Loading

LysandreJik commented May 16, 2023

Wauplin commented May 17, 2023 • edited Loading

patrickvonplaten commented May 17, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik May 17, 2023

Choose a reason for hiding this comment

Wauplin commented May 17, 2023

Run `HfApi` methods in the background (`run_as_future`) #1458

Run `HfApi` methods in the background (`run_as_future`) #1458

Wauplin commented Apr 28, 2023 •

edited

Loading

Wauplin commented Apr 28, 2023 •

edited

Loading

patrickvonplaten commented May 3, 2023 •

edited by Wauplin

Loading

HuggingFaceDocBuilderDev commented May 9, 2023 •

edited

Loading

codecov bot commented May 15, 2023 •

edited

Loading

Wauplin commented May 16, 2023 •

edited

Loading

Wauplin commented May 16, 2023 •

edited

Loading

Wauplin commented May 17, 2023 •

edited

Loading