Skip to content

Commit

Permalink
Merge branch 'main' into 1352-create-chunked-commits
Browse files Browse the repository at this point in the history
  • Loading branch information
Wauplin committed Apr 17, 2023
2 parents 167d320 + 25e20ef commit e4f9a23
Show file tree
Hide file tree
Showing 33 changed files with 2,153 additions and 58 deletions.
8 changes: 8 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
title: Repository
- local: guides/search
title: Search
- local: guides/hf_file_system
title: HfFileSystem
- local: guides/inference
title: Inference
- local: guides/community
Expand All @@ -30,6 +32,8 @@
title: Manage your Space
- local: guides/integrations
title: Integrate a library
- local: guides/webhooks_server
title: Webhooks server
- title: "Conceptual guides"
sections:
- local: concepts/git_vs_http
Expand All @@ -52,6 +56,8 @@
title: Mixins & serialization methods
- local: package_reference/inference_api
title: Inference API
- local: package_reference/hf_file_system
title: HfFileSystem
- local: package_reference/utilities
title: Utilities
- local: package_reference/community
Expand All @@ -62,3 +68,5 @@
title: Repo Cards and Repo Card Data
- local: package_reference/space_runtime
title: Space runtime
- local: package_reference/webhooks_server
title: Webhooks server
105 changes: 105 additions & 0 deletions docs/source/guides/hf_file_system.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Interact with the Hub through the Filesystem API

In addition to the [`HfApi`], the `huggingface_hub` library provides [`HfFileSystem`], a pythonic [fsspec-compatible](https://filesystem-spec.readthedocs.io/en/latest/) file interface to the Hugging Face Hub. The [`HfFileSystem`] builds of top of the [`HfApi`] and offers typical filesystem style operations like `cp`, `mv`, `ls`, `du`, `glob`, `get_file`, and `put_file`.

## Usage

```python
>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem()

>>> # List all files in a directory
>>> fs.ls("datasets/my-username/my-dataset-repo/data", detail=False)
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # List all ".csv" files in a repo
>>> fs.glob("datasets/my-username/my-dataset-repo/**.csv")
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # Read a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/train.csv", "r") as f:
... train_data = f.readlines()

>>> # Read the content of a remote file as a string
>>> train_data = fs.read_text("datasets/my-username/my-dataset-repo/data/train.csv", revision="dev")

>>> # Write a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/validation.csv", "w") as f:
... f.write("text,label")
... f.write("Fantastic movie!,good")
```

The optional `revision` argument can be passed to run an operation from a specific commit such as a branch, tag name, or a commit hash.

Unlike Python's built-in `open`, `fsspec`'s `open` defaults to binary mode, `"rb"`. This means you must explicitly set mode as `"r"` for reading and `"w"` for writing in text mode. Appending to a file (modes `"a"` and `"ab"`) is not supported yet.

## Integrations

The [`HfFileSystem`] can be used with any library that integrates `fsspec`, provided the URL follows the scheme:

```
hf://[<repo_type_prefix>]<repo_id>[@<revision>]/<path/in/repo>
```

The `repo_type_prefix` is `datasets/` for datasets, `spaces/` for spaces, and models don't need a prefix in the URL.

Some interesting integrations where [`HfFileSystem`] simplifies interacting with the Hub are listed below:

* Reading/writing a [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-writing-remote-files) DataFrame from/to a Hub repository:

```python
>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
```

The same workflow can also be used for [Dask](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html) and [Polars](https://pola-rs.github.io/polars/py-polars/html/reference/io.html) DataFrames.

* Querying (remote) Hub files with [DuckDB](https://duckdb.org/docs/guides/python/filesystems):

```python
>>> from huggingface_hub import HfFileSystem
>>> import duckdb

>>> fs = HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result back as a dataframe
>>> fs_query_file = "hf://datasets/my-username/my-dataset-repo/data_dir/data.parquet"
>>> df = duckdb.query(f"SELECT * FROM '{fs_query_file}' LIMIT 10").df()
```

* Using the Hub as an array store with [Zarr](https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec):

```python
>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
... foo = root.create_group("embeddings")
... foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
... foobar[:] = embeddings

>>> # Read an array from a repo
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
... first_row = root["embeddings/experiment_0"][0]
```

## Authentication

In many cases, you must be logged in with a Hugging Face account to interact with the Hub. Refer to the [Login](../quick-start#login) section of the documentation to learn more about authentication methods on the Hub.

It is also possible to login programmatically by passing your `token` as an argument to [`HfFileSystem`]:

```python
>>> from huggingface_hub import HfFileSystem
>>> fs = HfFileSystem(token=token)
```

If you login this way, be careful not to accidentally leak the token when sharing your source code!
18 changes: 18 additions & 0 deletions docs/source/guides/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,15 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
</p>
</a>

<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
href="./filesystem">
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
HfFileSystem
</div><p class="text-gray-700">
How to interact with the Hub through a convenient interface that mimics Python's file interface?
</p>
</a>

<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
href="./inference">
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
Expand Down Expand Up @@ -96,5 +105,14 @@ Take a look at these guides to learn how to use huggingface_hub to solve real-wo
</p>
</a>

<a class="!no-underline border dark:border-gray-700 p-5 rounded-lg shadow hover:shadow-lg"
href="./webhooks_server">
<div class="w-full text-center bg-gradient-to-br from-indigo-400 to-indigo-500 rounded-lg py-1.5 font-semibold mb-5 text-white text-lg leading-relaxed">
Webhooks server
</div><p class="text-gray-700">
How to create a server to receive Webhooks and deploy it as a Space?
</p>
</a>

</div>
</div>
195 changes: 195 additions & 0 deletions docs/source/guides/webhooks_server.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Webhooks Server

Webhooks are a foundation for MLOps-related features. They allow you to listen for new changes on specific repos or to
all repos belonging to particular users/organizations you're interested in following. This guide will explain how to
leverage `huggingface_hub` to create a server listening to webhooks and deploy it to a Space. It assumes you are
familiar with the concept of webhooks on the Huggingface Hub. To learn more about webhooks themselves, you can read
this [guide](https://huggingface.co/docs/hub/webhooks) first.

The base class that we will use in this guide is [`WebhooksServer`]. It is a class for easily configuring a server that
can receive webhooks from the Huggingface Hub. The server is based on a [Gradio](https://gradio.app/) app. It has a UI
to display instructions for you or your users and an API to listen to webhooks.

<Tip>

To see a running example of a webhook server, check out the [Spaces CI Bot](https://huggingface.co/spaces/spaces-ci-bot/webhook)
one. It is a Space that launches ephemeral environments when a PR is opened on a Space.

</Tip>

<Tip warning={true}>

This is an [experimental feature](../package_reference/environment_variables#hfhubdisableexperimentalwarning). This
means that we are still working on improving the API. Breaking changes might be introduced in the future without prior
notice. Make sure to pin the version of `huggingface_hub` in your requirements.

</Tip>


## Create an endpoint

Implementing a webhook endpoint is as simple as decorating a function. Let's see a first example to explain the main
concepts:

```python
# app.py
from huggingface_hub import webhook_endpoint, WebhookPayload

@webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
if payload.repo.type == "dataset" and payload.event.action == "update":
# Trigger a training job if a dataset is updated
...
```

Save this snippet in a file called `'app.py'` and run it with `'python app.py'`. You should see a message like this:

```text
Webhook secret is not defined. This means your webhook endpoints will be open to everyone.
To add a secret, set `WEBHOOK_SECRET` as environment variable or pass it at initialization:
`app = WebhooksServer(webhook_secret='my_secret', ...)`
For more details about webhook secrets, please refer to https://huggingface.co/docs/hub/webhooks#webhook-secret.
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://1fadb0f52d8bf825fc.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces
Webhooks are correctly setup and ready to use:
- POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training
Go to https://huggingface.co/settings/webhooks to setup your webhooks.
```

Good job! You just launched a webhook server! Let's break down what happened exactly:

1. By decorating a function with [`webhook_endpoint`], a [`WebhooksServer`] object has been created in the background.
As you can see, this server is a Gradio app running on http://127.0.0.1:7860. If you open this URL in your browser, you
will see a landing page with instructions about the registered webhooks.
2. A Gradio app is a FastAPI server under the hood. A new POST route `/webhooks/trigger_training` has been added to it.
This is the route that will listen to webhooks and run the `trigger_training` function when triggered. FastAPI will
automatically parse the payload and pass it to the function as a [`WebhookPayload`] object. This is a `pydantic` object
that contains all the information about the event that triggered the webhook.
3. The Gradio app also opened a tunnel to receive requests from the internet. This is the interesting part: you can
configure a Webhook on https://huggingface.co/settings/webhooks pointing to your local machine. This is useful for
debugging your webhook server and quickly iterating before deploying it to a Space.
4. Finally, the logs also tell you that your server is currently not secured by a secret. This is not problematic for
local debugging but is to keep in mind for later.

<Tip warning={true}>

By default, the server is started at the end of your script. If you are running it in a notebook, you can start the
server manually by calling `decorated_function.run()`. Since a unique server is used, you only have to start the server
once even if you have multiple endpoints.

</Tip>


## Configure a Webhook

Now that you have a webhook server running, you want to configure a Webhook to start receiving messages.
Go to https://huggingface.co/settings/webhooks, click on "Add a new webhook" and configure your Webhook. Set the target
repositories you want to watch and the Webhook URL, here `https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training`.

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/configure_webhook.png"/>
</div>

And that's it! You can now trigger that webhook by updating the target repository (e.g. push a commit). Check the
Activity tab of your Webhook to see the events that have been triggered. Now that you have a working setup, you can
test it and quickly iterate. If you modify your code and restart the server, your public URL might change. Make sure
to update the webhook configuration on the Hub if needed.

## Deploy to a Space

Now that you have a working webhook server, the goal is to deploy it to a Space. Go to https://huggingface.co/new-space
to create a Space. Give it a name, select the Gradio SDK and click on "Create Space". Upload your code to the Space
in a file called `app.py`. Your Space will start automatically! For more details about Spaces, please refer to this
[guide](https://huggingface.co/docs/hub/spaces-overview).

Your webhook server is now running on a public Space. If most cases, you will want to secure it with a secret. Go to
your Space settings > Section "Repository secrets" > "Add a secret". Set the `WEBHOOK_SECRET` environment variable to
the value of your choice. Go back to the [Webhooks settings](https://huggingface.co/settings/webhooks) and set the
secret in the webhook configuration. Now, only requests with the correct secret will be accepted by your server.

And this is it! Your Space is now ready to receive webhooks from the Hub. Please keep in mind that if you run the Space
on a free 'cpu-basic' hardware, it will be shut down after 48 hours of inactivity. If you need a permanent Space, you
should consider setting to an [upgraded hardware](https://huggingface.co/docs/hub/spaces-gpus#hardware-specs).

## Advanced usage

The guide above explained the quickest way to setup a [`WebhooksServer`]. In this section, we will see how to customize
it further.

### Multiple endpoints

You can register multiple endpoints on the same server. For example, you might want to have one endpoint to trigger
a training job and another one to trigger a model evaluation. You can do this by adding multiple `@webhook_endpoint`
decorators:

```python
# app.py
from huggingface_hub import webhook_endpoint, WebhookPayload

@webhook_endpoint
async def trigger_training(payload: WebhookPayload) -> None:
if payload.repo.type == "dataset" and payload.event.action == "update":
# Trigger a training job if a dataset is updated
...

@webhook_endpoint
async def trigger_evaluation(payload: WebhookPayload) -> None:
if payload.repo.type == "model" and payload.event.action == "update":
# Trigger an evaluation job if a model is updated
...
```

Which will create two endpoints:

```text
(...)
Webhooks are correctly setup and ready to use:
- POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_training
- POST https://1fadb0f52d8bf825fc.gradio.live/webhooks/trigger_evaluation
```

### Custom server

To get more flexibility, you can also create a [`WebhooksServer`] object directly. This is useful if you want to
customize the landing page of your server. You can do this by passing a [Gradio UI](https://gradio.app/docs/#blocks)
that will overwrite the default one. For example, you can add instructions for your users or add a form to manually
trigger the webhooks. When creating a [`WebhooksServer`], you can register new webhooks using the
[`~WebhooksServer.add_webhook`] decorator.

Here is a complete example:

```python
import gradio as gr
from fastapi import Request
from huggingface_hub import WebhooksServer, WebhookPayload

# 1. Define UI
with gr.Blocks() as ui:
...

# 2. Create WebhooksServer with custom UI and secret
app = WebhooksServer(ui=ui, webhook_secret="my_secret_key")

# 3. Register webhook with explicit name
@app.add_webhook("/say_hello")
async def hello(payload: WebhookPayload):
return {"message": "hello"}

# 4. Register webhook with implicit name
@app.add_webhook
async def goodbye(payload: WebhookPayload):
return {"message": "goodbye"}

# 5. Start server (optional)
app.run()
```

1. We define a custom UI using Gradio blocks. This UI will be displayed on the landing page of the server.
2. We create a [`WebhooksServer`] object with a custom UI and a secret. The secret is optional and can be set with
the `WEBHOOK_SECRET` environment variable.
3. We register a webhook with an explicit name. This will create an endpoint at `/webhooks/say_hello`.
4. We register a webhook with an implicit name. This will create an endpoint at `/webhooks/goodbye`.
5. We start the server. This is optional as your server will automatically be started at the end of the script.
8 changes: 8 additions & 0 deletions docs/source/package_reference/environment_variables.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,14 @@ to disable this warning.

For more details, see [cache limitations](../guides/manage-cache#limitations).

### HF_HUB_DISABLE_EXPERIMENTAL_WARNING

Some features of `huggingface_hub` are experimental. This means you can use them but we do not guarantee they will be
maintained in the future. In particular, we might update the API or behavior of such features without any deprecation
cycle. A warning message is triggered when using an experimental feature to warn you about it. If you're comfortable debugging any potential issues using an experimental feature, you can set `HF_HUB_DISABLE_EXPERIMENTAL_WARNING=1` to disable the warning.

If you are using an experimental feature, please let us know! Your feedback can help us design and improve it.

### HF_HUB_DISABLE_TELEMETRY

By default, some data is collected by HF libraries (`transformers`, `datasets`, `gradio`,..) to monitor usage, debug issues and help prioritize features.
Expand Down
12 changes: 12 additions & 0 deletions docs/source/package_reference/hf_file_system.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Filesystem API

The `HfFileSystem` class provides a pythonic file interface to the Hugging Face Hub based on [`fssepc`](https://filesystem-spec.readthedocs.io/en/latest/).

## HfFileSystem

`HfFileSystem` is based on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), so it is compatible with most of the APIs that it offers. For more details, check out [our guide](../guides/filesystem) and the fsspec's [API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem).

[[autodoc]] HfFileSystem
- __init__
- resolve_path
- ls
Loading

0 comments on commit e4f9a23

Please sign in to comment.