Skip to content
This repository has been archived by the owner on Apr 6, 2023. It is now read-only.

Add docs #13

Closed
wants to merge 13 commits into from
17 changes: 17 additions & 0 deletions .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Build documentation

on:
push:
branches:
- main
- doc-builder*
- v*-release

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: hffs
secrets:
token: ${{ secrets.HUGGINGFACE_PUSH }}
Comment on lines +16 to +17
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LysandreJik Can you help me set up this secret?

16 changes: 16 additions & 0 deletions .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Build PR Documentation

on:
pull_request:

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: hffs
13 changes: 13 additions & 0 deletions .github/workflows/delete_doc_comment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Delete dev documentation

on:
pull_request:
types: [ closed ]


jobs:
delete:
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment.yml@main
with:
pr_number: ${{ github.event.number }}
package: hffs
16 changes: 16 additions & 0 deletions .github/workflows/self-assign.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Self-assign
on:
issue_comment:
types: created
jobs:
one:
runs-on: ubuntu-latest
if: >-
(github.event.comment.body == '#take' ||
github.event.comment.body == '#self-assign')
&& !github.event.issue.assignee
steps:
- run: |
echo "Assigning issue ${{ github.event.issue.number }} to ${{ github.event.comment.user.login }}"
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -d '{"assignees": ["${{ github.event.comment.user.login }}"]}' https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/assignees
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -X "DELETE" https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/labels/help%20wanted
89 changes: 5 additions & 84 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# `hffs`

<a href="https://github.com/huggingface/hffs/actions/workflows/ci.yml?query=branch%3Amain"><img alt="Build" src="https://github.com/huggingface/hffs/actions/workflows/ci.yml/badge.svg?branch=main"></a>
<a href="https://github.com/huggingface/hffs/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/huggingface/hffs.svg"></a>
<a href="https://github.com/huggingface/hffs"><img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/hffs.svg"></a>
<a href="https://huggingface.co/docs/hffs/index"><img alt="Documentation" src="https://img.shields.io/website/http/huggingface.co/docs/hffs/index.svg?down_color=red&down_message=offline&up_message=online&label=doc"></a>

`hffs` builds on [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) and [`fsspec`](https://github.com/fsspec/filesystem_spec) to provide a convenient Python filesystem interface to 🤗 Hub.

## Basic usage
Expand Down Expand Up @@ -56,87 +61,3 @@ The prefix for datasets is "datasets/", the prefix for spaces is "spaces/" and m
```bash
pip install hffs
```

## Usage examples

* [`pandas`](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-writing-remote-files)/[`dask`](https://docs.dask.org/en/stable/how-to/connect-to-remote-data.html)

```python
>>> import pandas as pd

>>> # Read a remote CSV file into a dataframe
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv")

>>> # Write a dataframe to a remote CSV file
>>> df.to_csv("hf://datasets/my-username/my-dataset-repo/test.csv")
```

* [`datasets`](https://huggingface.co/docs/datasets/filesystems#load-and-save-your-datasets-using-your-cloud-storage-filesystem)

```python
>>> import datasets

>>> # Export a (large) dataset to a repo
>>> output_dir = "hf://datasets/my-username/my-dataset-repo"
>>> builder = datasets.load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, file_format="parquet")

>>> # Stream the dataset from the repo
>>> dset = datasets.load_dataset("my-username/my-dataset-repo", split="train", streaming=True)
>>> # Process the examples
>>> for ex in dset:
... ...
```

* [`zarr`](https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec)

```python
>>> import numpy as np
>>> import zarr

>>> embeddings = np.random.randn(50000, 1000).astype("float32")

>>> # Write an array to a repo acting as a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="w") as root:
... foo = root.create_group("embeddings")
... foobar = foo.zeros('experiment_0', shape=(50000, 1000), chunks=(10000, 1000), dtype='f4')
... foobar[:] = embeddings

>>> # Read from a remote zarr store
>>> with zarr.open_group("hf://my-username/my-model-repo/array-store", mode="r") as root:
... first_row = root["embeddings/experiment_0"][0]
```

* [`duckdb`](https://duckdb.org/docs/guides/python/filesystems)

```python
>>> import hffs
>>> import duckdb

>>> fs = hffs.HfFileSystem()
>>> duckdb.register_filesystem(fs)
>>> # Query a remote file and get the result as a dataframe
>>> df = duckdb.query("SELECT * FROM 'hf://datasets/my-username/my-dataset-repo/data.parquet' LIMIT 10").df()
```

## Authentication

To write to your repotitories or access your private repositorories; you can login by running

```bash
huggingface-cli login
```

Or pass a token (from your [HF settings](https://huggingface.co/settings/tokens)) to

```python
>>> import hffs
>>> fs = hffs.HfFileSystem(token=token)
```

or as `storage_options`:

```python
>>> storage_options = {"token": token}
>>> df = pd.read_csv("hf://datasets/my-username/my-dataset-repo/train.csv", storage_options=storage_options)
```
88 changes: 88 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<!---
Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Generating the documentation

To generate the documentation, you need to install our special tool that builds it:

```bash
pip install git+https://github.com/huggingface/doc-builder
```

---
**NOTE**

You only need to generate the documentation to inspect it locally (if you're planning changes and want to
check how they look before committing for instance). You don't have to commit the built documentation.

---

## Building the documentation

Once you have setup the `doc-builder` and additional packages, you can generate the documentation by
typing the following command:

```bash
doc-builder build hffs docs/source --build_dir ~/tmp/test-build
```

You can adapt the `--build_dir` to set any temporary folder that you prefer. This command will create it and generate
the MDX files that will be rendered as the documentation on the main website. You can inspect them in your favorite
Markdown editor.

## Previewing the documentation

To preview the docs, first install the `watchdog` module with:

```bash
pip install watchdog
```

Then run the following command:

```bash
doc-builder preview {package_name} {path_to_docs}
```

For example:

```bash
doc-builder preview hffs docs/source/
```

The docs will be viewable at [http://localhost:3000](http://localhost:3000). You can also preview the docs once you have opened a PR. You will see a bot add a comment to a link where the documentation with your changes lives.

---
**NOTE**

The `preview` command only works with existing doc files. When you add a completely new file, you need to update `_toctree.yml` & restart `preview` command (`ctrl-c` to stop it & call `doc-builder preview ...` again).

---

## Adding a new element to the navigation bar

Accepted files are Markdown (.md or .mdx).

Create a file with its extension and put it in the source directory. You can then link it to the toc-tree by putting
the filename without the extension in the [`_toctree.yml`](https://github.com/huggingface/hffs/blob/main/docs/source/_toctree.yml) file.

## Adding an image

Due to the rapidly growing repository, it is important to make sure that no files that would significantly weigh down the repository are added. This includes images, videos and other non-text files. We prefer to leverage a hf.co hosted `dataset` like
the ones hosted on [`hf-internal-testing`](https://huggingface.co/hf-internal-testing) in which to place these files and reference
them by URL. We recommend putting them in the following dataset: [huggingface/documentation-images](https://huggingface.co/datasets/huggingface/documentation-images).
If an external contribution, feel free to add the images to your PR and ask a Hugging Face member to migrate your images
to this dataset.
6 changes: 6 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
- title: Get Started
sections:
- local: index
title: 🤗 Filesystem
- local: integration_zoo
title: Integration Zoo
87 changes: 87 additions & 0 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Filesystem

🤗 Filesystem (`hffs`) is a package that provides a pythonic [fsspec-compatible](https://filesystem-spec.readthedocs.io/en/latest/) file interface to the [Hugging Face Hub](https://huggingface.co/). It builds on top of the [Hugging Face Hub client library](https://huggingface.co/docs/huggingface_hub/index) to read and write files and inspect repositories on the Hub.

## Installation

```bash
pip install hffs
```

## Usage

`HfFileSystem` is the library's main class that holds connection information and enables typical filesystem style operations like `cp`, `mv`, `ls`, `du`, `glob`, `get_file`, `put_file` etc.

```python
>>> from hffs import HfFileSystem
>>> fs = HfFileSystem()

>>> # List files in a directory
>>> fs.ls("datasets/my-username/my-dataset-repo/data", detail=False)
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # List all ".csv" files in a repo
>>> fs.glob("datasets/my-username/my-dataset-repo/**.csv")
['datasets/my-username/my-dataset-repo/data/train.csv', 'datasets/my-username/my-dataset-repo/data/test.csv']

>>> # Read the contents of a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/train.csv", "r") as f:
... train_data = f.readlines()

>>> # Read all the contents of a remote file at once as a string
>>> train_data = fs.read_text("datasets/my-username/my-dataset-repo/data/train.csv")

>>> # Write a remote file
>>> with fs.open("datasets/my-username/my-dataset-repo/data/validation.csv", "w") as f:
... f.write("text,label")
... f.write("Fantastic movie!,good")
```

The prefix for datasets is "datasets/", the prefix for spaces is "spaces/" and models don't need a prefix in the URL.

The optional `revision` argument can be passed to open a filesystem from a specific commit (any revision such as a branch or a tag name or a commit hash).
mariosasko marked this conversation as resolved.
Show resolved Hide resolved

Unlike Python's built-in `open`, `fsspec`'s `open` defaults to binary mode, `"rb"`. This means you must explicitly set encoding as `"r"` for reading and `"w"` for writing in text mode.

## Integration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it'd make more sense to move the Integration section before the Usage section? It might be good for the user to check if they can use a URL with an integration before they start using the filesystem operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This order comes from the s3fs docs, so I think I'll leave it as-is.


🤗 Filesystem can be used with any library that integrates `fsspec`, and the URL has the following structure:

```
hf://[<repo_type_prefix>]<repo_id>/<path/in/repo>
```

The `revision` parameter is optional. Most integrations also allow you to pass optional parameters to the filesystem's initializer as `storage_options`, a dictionary mapping parameter names to their values:

```python
>>> storage_options = {"revision": "main"}
```

## Authentication

In many cases, you must be logged in with a Hugging Face account to interact with the Hub:

```bash
huggingface-cli login
```

Refer to the [Login](https://huggingface.co/docs/huggingface_hub/quick-start#login) section of the Hugging Face Hub client library documentation to learn more about authentication methods on the Hub.

It is also possible to login programmatically by passing your `token` as an argument to `HfFileSystem`:

```python
>>> import hffs
>>> fs = hffs.HfFileSystem(token=token)
```

If you login this way, be careful not to accidentally leak the token when sharing your source code!

## API Reference

As 🤗 Filesystem is based on [fsspec](https://filesystem-spec.readthedocs.io/en/latest/), it is compatible with most of the APIs that it offers. For more details, check out the fsspec's [API Reference](https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem).


Read the [Integration Zoo](integration_zoo) guide to learn more about libraries that integrate with `fsspec`, allowing convenient access to the Hub through 🤗 Filesystem.

If you have questions about 🤗 Filesystem, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/).

Loading