diff --git a/docs/hub/adding-a-model.md b/docs/hub/adding-a-model.md index e9b126a58f..dc533294ef 100644 --- a/docs/hub/adding-a-model.md +++ b/docs/hub/adding-a-model.md @@ -10,7 +10,7 @@ Uploading models to the Hugging Face Hub has many [benefits](https://huggingface ## Accounts and organizations -The first step is to create an account at [Hugging Face](https://huggingface.co/login). The models are shared in the form of git-based repositories. You have control over your repository, so you can have checkpoints, configs and any files you might want to upload. +The first step is to create an account at [Hugging Face](https://huggingface.co/login). The models are shared in the form of Git-based repositories. You have control over your repository, so you can have checkpoints, configs and any files you might want to upload. The repository can be either linked with an individual, such as [osanseviero/fashion_brands_patterns](https://huggingface.co/osanseviero/fashion_brands_patterns) or with an organization, such as [facebook/bart-large-xsum](https://huggingface.co/facebook/bart-large-xsum). Organizations can be used if you want to upload models that are related to a company, community or library! If you choose an organization, the model will be featured on the organization’s page and every member of the organization will have the ability to contribute to the repository. You can create a new organization [here](https://huggingface.co/organizations/new). @@ -55,6 +55,10 @@ There is only one key difference if you have large files (over 10MB). These file 2. Run `git lfs install` to initialize **git-lfs**: +Do you have files larger than 10MB? Those files are tracked with `git-lfs`. We already provide a list of common file extensions for these files in `.gitattributes`, but you might need to add new extensions if they are not already handled. You can do so with `git lfs track "*.your_extension"`. + +Once ready, just run: + ``` git lfs install ``` @@ -74,7 +78,7 @@ Now's the time 🔥. You can add any files you want to the repository. 5. Commit and push your files -You can do this with the usual Git workflow +You can do this with the usual Git workflow: ``` git add . diff --git a/src/huggingface_hub/README.md b/src/huggingface_hub/README.md index 4fdf3ea8bc..09a25d6346 100644 --- a/src/huggingface_hub/README.md +++ b/src/huggingface_hub/README.md @@ -1,70 +1,110 @@ -# Hugging Face Client library +# Hugging Face Hub Client library -## Download files from the huggingface.co hub +## Download files from the Hub -Integration inside a library is super simple. We expose two functions, `hf_hub_url()` and `cached_download()`. +Three utility functions are provided to dowload files from the Hub. One +advantage of using them is that files are cached locally, so you won't have to +download the files multiple times. If there are changes in the repository, the +files will be automatically downloaded again. ### `hf_hub_url` -`hf_hub_url()` takes: -- a repo id (e.g. a model id like `julien-c/EsperBERTo-small` i.e. a user or organization name and a repo name, separated by `/`), -- a filename (like `pytorch_model.bin`), -- and an optional git revision id (can be a branch name, a tag, or a commit hash) +`hf_hub_url()` returns the url we'll use to download the actual files: +`https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin` -and returns the url we'll use to download the actual files: `https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin` - -If you check out this URL's headers with a `HEAD` http request (which you can do from the command line with `curl -I`) for a few different files, you'll see that: +Parameters: +- a repo id (e.g. a model id like `julien-c/EsperBERTo-small` i.e. a user or + organization name and a repo name, separated by `/`) +- a filename (like `pytorch_model.bin`) +- an optional Git revision id (can be a branch name, a tag, or a commit hash) + +If you check out this URL's headers with a `HEAD` http request (which you can do +from the command line with `curl -I`) for a few different files, you'll see +that: - small files are returned directly -- large files (i.e. the ones stored through [git-lfs](https://git-lfs.github.com/)) are returned via a redirect to a Cloudfront URL. Cloudfront is a Content Delivery Network, or CDN, that ensures that downloads are as fast as possible from anywhere on the globe. +- large files (i.e. the ones stored through + [git-lfs](https://git-lfs.github.com/)) are returned via a redirect to a + Cloudfront URL. Cloudfront is a Content Delivery Network, or CDN, that ensures + that downloads are as fast as possible from anywhere on the globe. ### `cached_download` -`cached_download()` takes the following parameters, downloads the remote file, stores it to disk (in a versioning-aware way) and returns its local file path. +`cached_download()` takes the following parameters, downloads the remote file, +stores it to disk (in a versioning-aware way) and returns its local file path. Parameters: - a remote `url` -- your library's name and version (`library_name` and `library_version`), which will be added to the HTTP requests' user-agent so that we can provide some usage stats. -- a `cache_dir` which you can specify if you want to control where on disk the files are cached. +- a `cache_dir` which you can specify if you want to control where on disk the + files are cached. -Check out the source code for all possible params (we'll create a real doc page in the future). +A common use case is to download the files from a download url -### Bonus: `snapshot_download` +```python +from huggingface_hub import hf_hub_url, cached_download +config_file_url = hf_hub_url("lysandre/arxiv-nlp", filename="config.json") +cached_download(config_file_url) +``` -`snapshot_download()` downloads all the files from the remote repository at the specified revision, -stores it to disk (in a versioning-aware way) and returns its local file path. +Check out the source code for all possible params (we'll create a real doc page +in the future). + +### `hf_hub_download` + +Since the use case of combining `hf_hub_url()` and `cached_download()` is very +common, we also provide a wrapper that calls both functions. Parameters: -- a `repo_id` in the format `namespace/repository` -- a `revision` on which the repository will be downloaded -- a `cache_dir` which you can specify if you want to control where on disk the files are cached. +- a repo id (e.g. a model id like `julien-c/EsperBERTo-small` i.e. a user or + organization name and a repo name, separated by `/`) +- a filename (like `pytorch_model.bin`) +- an optional Git revision id (can be a branch name, a tag, or a commit hash) +- a `cache_dir` which you can specify if you want to control where on disk the + files are cached. -
+```python +from huggingface_hub import hf_hub_download +hf_hub_download("lysandre/arxiv-nlp", filename="config.json") +``` -## Publish models to the huggingface.co hub +### `snapshot_download` -Uploading a model to the hub is super simple too: -- create a model repo directly from the website, at huggingface.co/new (models can be public or private, and are namespaced under either a user or an organization) -- clone it with git -- [download and install git lfs](https://git-lfs.github.com/) if you don't already have it on your machine (you can check by running a simple `git lfs`) -- add, commit and push your files, from git, as you usually do (from the CLI, or through the `Repository` wrapper class detailed below). +Using `hf_hub_download()` works well when you have a fixed repository structure; +for example a model file alongside a configuration file, both with static names. +There are cases in which you will prefer to download all the files of the remote +repository at a specified revision. That's what `snapshot_download()` does. It +downloads and stores a remote repository to disk (in a versioning-aware way) and +returns its local file path. -**We are intentionally not wrapping git too much, so that you can go on with the workflow you’re used to and the tools you already know.** +Parameters: +- a `repo_id` in the format `namespace/repository` +- a `revision` on which the repository will be downloaded +- a `cache_dir` which you can specify if you want to control where on disk the + files are cached -> 👀 To see an example of how we document the model sharing process in `transformers`, check out https://huggingface.co/transformers/model_sharing.html +
-Users add tags into their README.md model cards (e.g. your `library_name`, a domain tag like `audio`, etc.) to make sure their models are discoverable. +## Publish files to the Hub -**Documentation about the model hub itself is at https://huggingface.co/docs** +If you've used Git before, this will be very easy since Git is used to manage +files in the Hub. You can find a step-by-step guide on how to upload your model +to the Hub: https://huggingface.co/docs/hub/adding-a-model. -### API utilities in `hf_api.py` -You don't need them for the standard publishing workflow, however, if you need a programmatic way of creating a repo, deleting it (`⚠️ caution`), pushing a single file to a repo or listing models from the hub, you'll find helpers in `hf_api.py`. +### API utilities in `hf_api.py` -We also have an API to query models by specific tags (e.g. if you want to list models compatible to your library) +You don't need them for the standard publishing workflow, however, if you need a +programmatic way of creating a repo, deleting it (`⚠️ caution`), pushing a +single file to a repo or listing models from the Hub, you'll find helpers in +`hf_api.py`. Some examples: -### `huggingface-cli` +* `login()` +* `whoami()` +* `create_repo()` +* `delete_repo()` +* `update_repo_visibility()` +* `upload_file()` -Those API utilities are also exposed through a CLI: +Those API utilities are also exposed through the `huggingface-cli` CLI: ```bash huggingface-cli login @@ -73,64 +113,74 @@ huggingface-cli whoami huggingface-cli repo create ``` -### Need to upload large (>5GB) files? - -To upload large files (>5GB 🔥), you need to install the custom transfer agent for git-lfs, bundled in this package. - -To install, just run: - -```bash -$ huggingface-cli lfs-enable-largefiles -``` - -This should be executed once for each model repo that contains a model file >5GB. If you just try to push a file bigger than 5GB without running that command, you will get an error with a message reminding you to run it. +We also have an API to query models and datasets by specific tags (e.g. if you +want to list models compatible to your library). Look at `list_models()`, +`model_info()`, `list_datasets()`, and `dataset_info()`. -Finally, there's a `huggingface-cli lfs-multipart-upload` command but that one is internal (called by lfs directly) and is not meant to be called by the user. +### Advanced programmatic repository management +The `Repository` class helps manage both offline Git repositories and Hugging +Face Hub repositories. Using the `Repository` class requires `git` and `git-lfs` +to be installed. -## Managing a repository with `Repository` - -The `Repository` class helps manage both offline git repositories, and huggingface hub repositories. Using the -`Repository` class requires `git` and `git-lfs` to be installed. - -Instantiate a `Repository` object by calling it with a path to a local git clone/repository: +Instantiate a `Repository` object by calling it with a path to a local Git +clone/repository: ```python >>> from huggingface_hub import Repository >>> repo = Repository("//") ``` -The `Repository` takes a `clone_from` string as parameter. This can stay as `None` for offline management, but can -also be set to any URL pointing to a git repo to clone that repository in the specified directory: +The `Repository` takes a `clone_from` string as parameter. This can stay as +`None` for offline management, but can also be set to any URL pointing to a Git +repo to clone that repository in the specified directory: ```python >>> repo = Repository("huggingface-hub", clone_from="https://github.com/huggingface/huggingface_hub") ``` -The `clone_from` method can also take any Hugging Face model ID as input, and will clone that repository: +The `clone_from` method can also take any Hugging Face model ID as input, and +will clone that repository: ```python >>> repo = Repository("w2v2", clone_from="facebook/wav2vec2-large-960h-lv60") ``` -If the repository you're cloning is one of yours or one of your organisation's, then having the ability -to commit and push to that repository is important. In order to do that, you should make sure to be logged-in -using `huggingface-cli login`, and to have the `use_auth_token` parameter set to `True` (the default) when -instantiating the `Repository` object: +If the repository you're cloning is one of yours or one of your organisation's, +then having the ability to commit and push to that repository is important. In +order to do that, you should make sure to be logged-in using `huggingface-cli +login`, and to have the `use_auth_token` parameter set to `True` (the default) +when instantiating the `Repository` object: ```python >>> repo = Repository("my-model", clone_from="/", use_auth_token=True) ``` -This works for models, datasets and spaces repositories; but you will need to explicitely specify the type for the last two options: +This works for models, datasets and spaces repositories; but you will need to +explicitely specify the type for the last two options: ```python >>> repo = Repository("my-dataset", clone_from="/", use_auth_token=True, repo_type="dataset") ``` -Finally, you can choose to specify the git username and email attributed to that clone directly by using -the `git_user` and `git_email` parameters. When committing to that repository, git will therefore be aware -of who you are and who will be the author of the commits: +You can also change between branches: + +```python +>>> repo = Repository("huggingface-hub", clone_from="/", revision='branch1') +>>> repo.git_checkout("branch2") +``` + +The `clone_from` method can also take any Hugging Face model ID as input, and +will clone that repository: + +```python +>>> repo = Repository("w2v2", clone_from="facebook/wav2vec2-large-960h-lv60") +``` + +Finally, you can choose to specify the Git username and email attributed to that +clone directly by using the `git_user` and `git_email` parameters. When +committing to that repository, Git will therefore be aware of who you are and +who will be the author of the commits: ```python >>> repo = Repository( @@ -143,29 +193,33 @@ of who you are and who will be the author of the commits: ... ) ``` -The repository can be managed through this object, through wrappers of traditional git methods: +The repository can be managed through this object, through wrappers of +traditional Git methods: - `git_add(pattern: str, auto_lfs_track: bool)`. The `auto_lfs_track` flag - triggers auto tracking of large files (>10MB) with `git-lfs`. -- `git_commit(commit_message: str)`. -- `git_pull(rebase: bool)`. -- `git_push()`. + triggers auto tracking of large files (>10MB) with `git-lfs` +- `git_commit(commit_message: str)` +- `git_pull(rebase: bool)` +- `git_push()` +- `git_checkout(branch)` LFS-tracking methods: -- `lfs_track(pattern: Union[str, List[str]], filename: bool)`. - Setting `filename` to `True` will use the `--filename` parameter, which will consider the pattern(s) as - filenames, even if they contain special glob characters. +- `lfs_track(pattern: Union[str, List[str]], filename: bool)`. Setting + `filename` to `True` will use the `--filename` parameter, which will consider + the pattern(s) as filenames, even if they contain special glob characters. - `lfs_untrack()`. -- `auto_track_large_files()`: automatically tracks files that are larger than 10MB. Make sure to call this - after adding files to the index. - +- `auto_track_large_files()`: automatically tracks files that are larger than + 10MB. Make sure to call this after adding files to the index. + On top of these unitary methods lie some useful additional methods: -- `push_to_hub(commit_message)`: consecutively does `git_add`, `git_commit` and `git_push`. -- `commit(commit_message: str, track_large_files: bool)`: this is a context manager utility that handles - committing to a repository. This automatically tracks large files (>10Mb) with git-lfs. The `track_large_files` - argument can be set to `False` if you wish to ignore that behavior. +- `push_to_hub(commit_message)`: consecutively does `git_add`, `git_commit` and + `git_push`. +- `commit(commit_message: str, track_large_files: bool)`: this is a context + manager utility that handles committing to a repository. This automatically + tracks large files (>10Mb) with `git-lfs`. The `track_large_files` argument can + be set to `False` if you wish to ignore that behavior. Examples using the `commit` context manager: @@ -174,6 +228,7 @@ Examples using the `commit` context manager: ... with open("file.txt", "w+") as f: ... f.write(json.dumps({"hey": 8})) ``` + ```python >>> import torch >>> model = torch.nn.Transformer() @@ -181,18 +236,47 @@ Examples using the `commit` context manager: ... torch.save(model.state_dict(), "model.pt") ``` + +### Need to upload very large (>5GB) files? + +To upload large files (>5GB 🔥), you need to install the custom transfer agent +for git-lfs, bundled in this package. + +To install, just run: + +```bash +$ huggingface-cli lfs-enable-largefiles +``` + +This should be executed once for each model repo that contains a model file +>5GB. If you just try to push a file bigger than 5GB without running that +command, you will get an error with a message reminding you to run it. + +Finally, there's a `huggingface-cli lfs-multipart-upload` command but that one +is internal (called by lfs directly) and is not meant to be called by the user. +
## Using the Inference API wrapper -`huggingface_hub` comes with a wrapper client to make calls to the Inference API! You can find some examples below, but we encourage you to visit the Inference API [documentation](https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html) to review the specific parameters for the different tasks. +`huggingface_hub` comes with a wrapper client to make calls to the Inference +API! You can find some examples below, but we encourage you to visit the +Inference API +[documentation](https://api-inference.huggingface.co/docs/python/html/detailed_parameters.html) +to review the specific parameters for the different tasks. -When you instantiate the wrapper to the Inference API, you specify the model repository id. The pipeline (`text-classification`, `text-to-speech`, etc) is automatically extracted from the [repository](https://huggingface.co/docs/hub/main#how-is-a-models-type-of-inference-api-and-widget-determined), but you can also override it as shown below. +When you instantiate the wrapper to the Inference API, you specify the model +repository id. The pipeline (`text-classification`, `text-to-speech`, etc) is +automatically extracted from the +[repository](https://huggingface.co/docs/hub/main#how-is-a-models-type-of-inference-api-and-widget-determined), +but you can also override it as shown below. ### Examples -Here is a basic example of calling the Inference API for a `fill-mask` task using the `bert-base-uncased` model. The `fill-mask` task only expects a string (or list of strings) as input. +Here is a basic example of calling the Inference API for a `fill-mask` task +using the `bert-base-uncased` model. The `fill-mask` task only expects a string +(or list of strings) as input. ```python from huggingface_hub.inference_api import InferenceApi @@ -201,7 +285,8 @@ inference(inputs="The goal of life is [MASK].") >> [{'sequence': 'the goal of life is life.', 'score': 0.10933292657136917, 'token': 2166, 'token_str': 'life'}] ``` -This is an example of a task (`question-answering`) which requires a dictionary as input thas has the `question` and `context` keys. +This is an example of a task (`question-answering`) which requires a dictionary +as input thas has the `question` and `context` keys. ```python inference = InferenceApi("deepset/roberta-base-squad2", token=API_TOKEN) @@ -210,7 +295,8 @@ inference(inputs) >> {'score': 0.9326569437980652, 'start': 11, 'end': 16, 'answer': 'Clara'} ``` -Some tasks might also require additional params in the request. Here is an example using a `zero-shot-classification` model. +Some tasks might also require additional params in the request. Here is an +example using a `zero-shot-classification` model. ```python inference = InferenceApi("typeform/distilbert-base-uncased-mnli", token=API_TOKEN) @@ -220,10 +306,11 @@ inference(inputs, params) >> {'sequence': 'Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!', 'labels': ['refund', 'faq', 'legal'], 'scores': [0.9378499388694763, 0.04914155602455139, 0.013008488342165947]} ``` -Finally, there are some models that might support multiple tasks. For example, `sentence-transformers` models can do `sentence-similarity` and `feature-extraction`. You can override the configured task when initializing the API. +Finally, there are some models that might support multiple tasks. For example, +`sentence-transformers` models can do `sentence-similarity` and +`feature-extraction`. You can override the configured task when initializing the +API. ```python inference = InferenceApi("bert-base-uncased", task="feature-extraction", token=API_TOKEN) -``` - -## Feedback (feature requests, bugs, etc.) is super welcome 💙💚💛💜♥️🧡 +``` \ No newline at end of file