Merge branch 'main' into 1352-create-chunked-commits

huggingface · Apr 17, 2023 · e4f9a23 · e4f9a23
2 parents 167d320 + 25e20ef
commit e4f9a23
Show file tree

Hide file tree

Showing 33 changed files with 2,153 additions and 58 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -18,6 +18,8 @@
       title: Repository
     - local: guides/search
       title: Search
+    - local: guides/hf_file_system
+      title: HfFileSystem
     - local: guides/inference
       title: Inference
     - local: guides/community
@@ -30,6 +32,8 @@
       title: Manage your Space
     - local: guides/integrations
       title: Integrate a library
+    - local: guides/webhooks_server
+      title: Webhooks server
 - title: "Conceptual guides"
   sections:
     - local: concepts/git_vs_http
@@ -52,6 +56,8 @@
       title: Mixins & serialization methods
     - local: package_reference/inference_api
       title: Inference API
+    - local: package_reference/hf_file_system
+      title: HfFileSystem
     - local: package_reference/utilities
       title: Utilities
     - local: package_reference/community
@@ -62,3 +68,5 @@
       title: Repo Cards and Repo Card Data
     - local: package_reference/space_runtime
       title: Space runtime
+    - local: package_reference/webhooks_server
+      title: Webhooks server
diff --git a/docs/source/guides/hf_file_system.mdx b/docs/source/guides/hf_file_system.mdx
@@ -0,0 +1,105 @@
+# Interact with the Hub through the Filesystem API
+
+In addition to the [`HfApi`], the `huggingface_hub` library provides [`HfFileSystem`], a pythonic [fsspec-compatible](https://filesystem-spec.readthedocs.io/en/latest/) file interface to the Hugging Face Hub. The [`HfFileSystem`] builds of top of the [`HfApi`] and offers typical filesystem style operations like `cp`, `mv`, `ls`, `du`, `glob`, `get_file`, and `put_file`.
+
+## Usage
+
+```python
+>>> from huggingface_hub import HfFileSystem
+>>> fs = HfFileSystem()
+
+>>> # List all files in a directory
+>>> fs.ls("datasets/my-username/my-dataset-repo/data", detail=False)
+['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']
+
+>>> # List all ".csv" files in a repo
+>>> fs.glob("datasets/my-username/my-dataset-repo/**.csv")
+['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']
+
+>>> # Read a remote file 
+>>> with fs.open("datasets/my-username/my-dataset-repo/data/train.csv", "r") as f:
+...     train_data = f.readlines()
+
+>>> # Read the content of a remote file as a string
+>>> train_data = fs.read_text("datasets/my-username/my-dataset-repo/data/train.csv", revision="dev")
+
+>>> # Write a remote file
+>>> with fs.open("datasets/my-username/my-dataset-repo/data/validation.csv", "w") as f:
+...     f.write("text,label")
+...     f.write("Fantastic movie!,good")
+```
+
+The optional `revision` argument can be passed to run an operation from a specific commit such as a branch, tag name, or a commit hash.
+
+Unlike Python's built-in `open`, `fsspec`'s `open` defaults to binary mode, `"rb"`. This means you must explicitly set mode as `"r"` for reading and `"w"` for writing in text mode. Appending to a file (modes `"a"` and `"ab"`) is not supported yet.
+
+## Integrations
+
+The [`HfFileSystem`] can be used with any library that integrates `fsspec`, provided the URL follows the scheme:
+
+```
+hf://[<repo_type_prefix>]<repo_id>[@<revision>]/<path/in/repo>
+```
+
+The `repo_type_prefix` is `datasets/` for datasets, `spaces/` for spaces, and models don't need a prefix in the URL.
+
+Some interesting integrations where [`HfFileSystem`] simplifies interacting with the Hub are listed below:
+
+* Reading/writing a [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-writing-remote-files) DataFrame from/to a Hub repository:
+
+  ```python
+  >>> import pandas as pd
+
+  >>> # Read a remote CSV file into a dataframe
+  >>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")
+
+  >>> # Write a dataframe to a remote CSV file
+  >>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
+  ```
+
+The same workflow can also be used for [Dask](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html) and [Polars](https://pola-rs.github.io/polars/py-polars/html/reference/io.html) DataFrames.
+
+* Querying (remote) Hub files with [DuckDB](https://duckdb.org/docs/guides/python/filesystems):
+
+  ```python
+  >>> from huggingface_hub import HfFileSystem
+  >>> import duckdb
+
+  >>> fs = HfFileSystem()
+  >>> duckdb.register_filesystem(fs)
+  >>> # Query a remote file and get the result back as a dataframe
+  >>> fs_query_file = "hf://datasets/my-username/my-dataset-repo/data_dir/data.parquet"
+  >>> df = duckdb.query(f"SELECT * FROM '{fs_query_file}' LIMIT 10").df()
+  ```
+
+* Using the Hub as an array store with [Zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec):
+
+  ```python
+  >>> import numpy as np
+  >>> import zarr
+
+  >>> embeddings = np.random.randn(50000, 1000).astype("float32")
+
+  >>> # Write an array to a repo
+  >>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
+  ...    foo = root.create_group("embeddings")
+  ...    foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
+  ...    foobar[:] = embeddings
+
+  >>> # Read an array from a repo
+  >>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
+  ...    first_row = root["embeddings/experiment_0"][0]
+  ```
+
+## Authentication
+
+In many cases, you must be logged in with a Hugging Face account to interact with the Hub. Refer to the [Login](../quick-start#login) section of the documentation to learn more about authentication methods on the Hub. 
+
+It is also possible to login programmatically by passing your `token` as an argument to [`HfFileSystem`]:
+
+```python
+>>> from huggingface_hub import HfFileSystem
+>>> fs = HfFileSystem(token=token)
+```
+
+If you login this way, be careful not to accidentally leak the token when sharing your source code!
diff --git a/docs/source/guides/overview.mdx b/docs/source/guides/overview.mdx
@@ -42,6 +42,15 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
       </p>
     </a>
 
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
+       href="./filesystem">
+      <div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
+        HfFileSystem
+      </div><p class="text-gray-700">
+        How to interact with the Hub through a convenient interface that mimics Python's file interface?
+      </p>
+    </a>
+
     <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
        href="./inference">
       <div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
@@ -96,5 +105,14 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
       </p>
     </a>
 
+    <a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
+       href="./webhooks_server">
+      <div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
+        Webhooks server
+      </div><p class="text-gray-700">
+        How to create a server to receive Webhooks and deploy it as a Space?
+      </p>
+    </a>
+
   </div>
 </div>
diff --git a/docs/source/guides/webhooks_server.mdx b/docs/source/guides/webhooks_server.mdx
@@ -0,0 +1,195 @@
+# Webhooks Server
+
+Webhooks are a foundation for MLOps-related features. They allow you to listen for new changes on specific repos or to
+all repos belonging to particular users/organizations you're interested in following. This guide will explain how to
+leverage `huggingface_hub` to create a server listening to webhooks and deploy it to a Space. It assumes you are
+familiar with the concept of webhooks on the Huggingface Hub. To learn more about webhooks themselves, you can read
+this [guide](https://huggingface.co/docs/hub/webhooks) first.
+
+The base class that we will use in this guide is [`WebhooksServer`]. It is a class for easily configuring a server that
+can receive webhooks from the Huggingface Hub. The server is based on a [Gradio](https://gradio.app/) app. It has a UI
+to display instructions for you or your users and an API to listen to webhooks.
+
+<Tip>
+
+To see a running example of a webhook server, check out the [Spaces CI Bot](https://huggingface.co/spaces/spaces-ci-bot/webhook)
+one. It is a Space that launches ephemeral environments when a PR is opened on a Space.
+
+</Tip>
+
+<Tip warning={true}>
+
+This is an [experimental feature](../package_reference/environment_variables#hfhubdisableexperimentalwarning). This
+means that we are still working on improving the API. Breaking changes might be introduced in the future without prior
+notice. Make sure to pin the version of `huggingface_hub` in your requirements.
+
+</Tip>
+
+
+## Create an endpoint
+
+Implementing a webhook endpoint is as simple as decorating a function. Let's see a first example to explain the main
+concepts:
+
+```python
+# app.py
+from huggingface_hub import webhook_endpoint, WebhookPayload
+
+@webhook_endpoint
+async def trigger_training(payload: WebhookPayload) -> None:
+    if payload.repo.type == "dataset" and payload.event.action == "update":
+        # Trigger a training job if a dataset is updated
+        ...
+```
+
+Save this snippet in a file called `'app.py'` and run it with `'python app.py'`. You should see a message like this:
+
+```text
+Webhook secret is not defined. This means your webhook endpoints will be open to everyone.
+To add a secret, set `WEBHOOK_SECRET` as environment variable or pass it at initialization: 
+        `app = WebhooksServer(webhook_secret='my_secret', ...)`
+For more details about webhook secrets, please refer to https://huggingface.co/docs/hub/webhooks#webhook-secret.
+Running on local URL:  http://127.0.0.1:7860
+Running on public URL: https://1fadb0f52d8bf825fc.gradio.live
+
+This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
+
+Webhooks are correctly setup and ready to use:
+  - POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training
+Go to https://huggingface.co/settings/webhooks to setup your webhooks.
+```
+
+Good job! You just launched a webhook server! Let's break down what happened exactly:
+
+1. By decorating a function with [`webhook_endpoint`], a [`WebhooksServer`] object has been created in the background.
+As you can see, this server is a Gradio app running on http://127.0.0.1:7860. If you open this URL in your browser, you
+will see a landing page with instructions about the registered webhooks.
+2. A Gradio app is a FastAPI server under the hood. A new POST route `/webhooks/trigger_training` has been added to it.
+This is the route that will listen to webhooks and run the `trigger_training` function when triggered. FastAPI will
+automatically parse the payload and pass it to the function as a [`WebhookPayload`] object. This is a `pydantic` object
+that contains all the information about the event that triggered the webhook.
+3. The Gradio app also opened a tunnel to receive requests from the internet. This is the interesting part: you can
+configure a Webhook on https://huggingface.co/settings/webhooks pointing to your local machine. This is useful for
+debugging your webhook server and quickly iterating before deploying it to a Space.
+4. Finally, the logs also tell you that your server is currently not secured by a secret. This is not problematic for
+local debugging but is to keep in mind for later.
+
+<Tip warning={true}>
+
+By default, the server is started at the end of your script. If you are running it in a notebook, you can start the
+server manually by calling `decorated_function.run()`. Since a unique server is used, you only have to start the server
+once even if you have multiple endpoints.
+
+</Tip>
+
+
+## Configure a Webhook
+
+Now that you have a webhook server running, you want to configure a Webhook to start receiving messages.
+Go to https://huggingface.co/settings/webhooks, click on "Add a new webhook" and configure your Webhook. Set the target
+repositories you want to watch and the Webhook URL, here `https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training`. 
+
+<div class="flex justify-center">
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/configure_webhook.png"/>
+</div>
+
+And that's it! You can now trigger that webhook by updating the target repository (e.g. push a commit). Check the
+Activity tab of your Webhook to see the events that have been triggered. Now that you have a working setup, you can
+test it and quickly iterate. If you modify your code and restart the server, your public URL might change. Make sure
+to update the webhook configuration on the Hub if needed.
+
+## Deploy to a Space
+
+Now that you have a working webhook server, the goal is to deploy it to a Space. Go to https://huggingface.co/new-space
+to create a Space. Give it a name, select the Gradio SDK and click on "Create Space". Upload your code to the Space
+in a file called `app.py`. Your Space will start automatically! For more details about Spaces, please refer to this
+[guide](https://huggingface.co/docs/hub/spaces-overview).
+
+Your webhook server is now running on a public Space. If most cases, you will want to secure it with a secret. Go to
+your Space settings > Section "Repository secrets" > "Add a secret". Set the `WEBHOOK_SECRET` environment variable to
+the value of your choice. Go back to the [Webhooks settings](https://huggingface.co/settings/webhooks) and set the
+secret in the webhook configuration. Now, only requests with the correct secret will be accepted by your server.
+
+And this is it! Your Space is now ready to receive webhooks from the Hub. Please keep in mind that if you run the Space
+on a free 'cpu-basic' hardware, it will be shut down after 48 hours of inactivity. If you need a permanent Space, you
+should consider setting to an [upgraded hardware](https://huggingface.co/docs/hub/spaces-gpus#hardware-specs).
+
+## Advanced usage
+
+The guide above explained the quickest way to setup a [`WebhooksServer`]. In this section, we will see how to customize
+it further.
+
+### Multiple endpoints
+
+You can register multiple endpoints on the same server. For example, you might want to have one endpoint to trigger
+a training job and another one to trigger a model evaluation. You can do this by adding multiple `@webhook_endpoint`
+decorators:
+
+```python
+# app.py
+from huggingface_hub import webhook_endpoint, WebhookPayload
+
+@webhook_endpoint
+async def trigger_training(payload: WebhookPayload) -> None:
+    if payload.repo.type == "dataset" and payload.event.action == "update":
+        # Trigger a training job if a dataset is updated
+        ...
+
+@webhook_endpoint
+async def trigger_evaluation(payload: WebhookPayload) -> None:
+    if payload.repo.type == "model" and payload.event.action == "update":
+        # Trigger an evaluation job if a model is updated
+        ...
+```
+
+Which will create two endpoints:
+
+```text
+(...)
+Webhooks are correctly setup and ready to use:
+  - POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training
+  - POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_evaluation
+```
+
+### Custom server
+
+To get more flexibility, you can also create a [`WebhooksServer`] object directly. This is useful if you want to
+customize the landing page of your server. You can do this by passing a [Gradio UI](https://gradio.app/docs/#blocks)
+that will overwrite the default one. For example, you can add instructions for your users or add a form to manually
+trigger the webhooks. When creating a [`WebhooksServer`], you can register new webhooks using the
+[`~WebhooksServer.add_webhook`] decorator.
+
+Here is a complete example:
+
+```python
+import gradio as gr
+from fastapi import Request
+from huggingface_hub import WebhooksServer, WebhookPayload
+
+# 1. Define  UI
+with gr.Blocks() as ui:
+    ...
+
+# 2. Create WebhooksServer with custom UI and secret
+app = WebhooksServer(ui=ui, webhook_secret="my_secret_key")
+
+# 3. Register webhook with explicit name
+@app.add_webhook("/say_hello")
+async def hello(payload: WebhookPayload):
+    return {"message": "hello"}
+
+# 4. Register webhook with implicit name
+@app.add_webhook
+async def goodbye(payload: WebhookPayload):
+    return {"message": "goodbye"}
+
+# 5. Start server (optional)
+app.run()
+```
+
+1. We define a custom UI using Gradio blocks. This UI will be displayed on the landing page of the server.
+2. We create a [`WebhooksServer`] object with a custom UI and a secret. The secret is optional and can be set with
+the `WEBHOOK_SECRET` environment variable.
+3. We register a webhook with an explicit name. This will create an endpoint at `/webhooks/say_hello`.
+4. We register a webhook with an implicit name. This will create an endpoint at `/webhooks/goodbye`.
+5. We start the server. This is optional as your server will automatically be started at the end of the script.
diff --git a/docs/source/package_reference/environment_variables.mdx b/docs/source/package_reference/environment_variables.mdx
@@ -115,6 +115,14 @@ to disable this warning.
 
 For more details, see [cache limitations](../guides/manage-cache#limitations).
 
+### HF_HUB_DISABLE_EXPERIMENTAL_WARNING
+
+Some features of `huggingface_hub` are experimental. This means you can use them but we do not guarantee they will be
+maintained in the future. In particular, we might update the API or behavior of such features without any deprecation
+cycle. A warning message is triggered when using an experimental feature to warn you about it. If you're comfortable debugging any potential issues using an experimental feature, you can set `HF_HUB_DISABLE_EXPERIMENTAL_WARNING=1` to disable the warning.
+
+If you are using an experimental feature, please let us know! Your feedback can help us design and improve it.
+
 ### HF_HUB_DISABLE_TELEMETRY
 
 By default, some data is collected by HF libraries (`transformers`, `datasets`, `gradio`,..) to monitor usage, debug issues and help prioritize features.

diff --git a/docs/source/package_reference/hf_file_system.mdx b/docs/source/package_reference/hf_file_system.mdx
@@ -0,0 +1,12 @@
+# Filesystem API
+
+The `HfFileSystem` class provides a pythonic file interface to the Hugging Face Hub based on [`fssepc`](https://filesystem-spec.readthedocs.io/en/latest/).
+
+## HfFileSystem
+
+`HfFileSystem` is based on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), so it is compatible with most of the APIs that it offers. For more details, check out [our guide](../guides/filesystem) and the fsspec's [API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem).
+
+[[autodoc]] HfFileSystem
+    - __init__
+    - resolve_path
+    - ls