Skip to content

Commit

Permalink
DOC API docstring improvements (#731)
Browse files Browse the repository at this point in the history
* DOC API docstring improvements

* improve snapshot_download's docstring

* improve repo_type doc

* apply suggestions

* add references section

* use julien's wording

* add types to docstrings

* make sytle
  • Loading branch information
adrinjalali authored Mar 15, 2022
1 parent e6f374f commit df4c6d9
Show file tree
Hide file tree
Showing 2 changed files with 260 additions and 68 deletions.
264 changes: 205 additions & 59 deletions src/huggingface_hub/file_download.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,20 +112,53 @@ def hf_hub_url(
repo_type: Optional[str] = None,
revision: Optional[str] = None,
) -> str:
"""
Resolve a model identifier, a file name, and an optional revision id, to a huggingface.co-hosted url, redirecting
to Cloudfront (a Content Delivery Network, or CDN) for large files (more than a few MBs).
Cloudfront is replicated over the globe so downloads are way faster for the end user (and it also lowers our
bandwidth costs).
"""Construct the URL of a file from the given information.
Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here
because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront
in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache
can't ever be stale.
The resolved address can either be a huggingface.co-hosted url, or a link
to Cloudfront (a Content Delivery Network, or CDN) for large files which
are more than a few MBs.
In terms of client-side caching from this library, we base our caching on the objects' ETag. An object's ETag is:
its git-sha1 if stored in git, or its sha256 if stored in git-lfs.
Args:
repo_id (``str``):
A namespace (user or an organization) name and a repo name
seperated by a ``/``.
filename (``str``):
The name of the file in the repo.
subfolder (``str``, `optional`):
An optional value corresponding to a folder inside the repo.
repo_type (``str``, `optional`):
Set to :obj:`"dataset"` or :obj:`"space"` if uploading to a dataset
or space, :obj:`None` or :obj:`"model"` if uploading to a model.
Default is :obj:`None`.
revision (``str``, `optional`):
An optional Git revision id which can be a branch name, a tag, or a
commit hash.
Example:
>>> from huggingface_hub import hf_hub_url
>>> hf_hub_url(
... repo_id="julien-c/EsperBERTo-small", filename="pytorch_model.bin"
... )
'https://huggingface.co/julien-c/EsperBERTo-small/resolve/main/pytorch_model.bin'
Notes:
Cloudfront is replicated over the globe so downloads are way faster for
the end user (and it also lowers our bandwidth costs).
Cloudfront aggressively caches files by default (default TTL is 24
hours), however this is not an issue here because we implement a
git-based versioning system on huggingface.co, which means that we
store the files on S3/Cloudfront in a content-addressable way (i.e.,
the file name is its hash). Using content-addressable filenames means
cache can't ever be stale.
In terms of client-side caching from this library, we base our caching
on the objects' entity tag (`ETag`), which is an identifier of a
specific version of a resource [1]_. An object's ETag is: its git-sha1
if stored in git, or its sha256 if stored in git-lfs.
References:
.. [1] https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/ETag
"""
if subfolder is not None:
filename = f"{subfolder}/{filename}"
Expand All @@ -144,11 +177,22 @@ def hf_hub_url(


def url_to_filename(url: str, etag: Optional[str] = None) -> str:
"""
Convert `url` into a hashed filename in a repeatable way. If `etag` is specified, append its hash to the url's,
delimited by a period. If the url ends with .h5 (Keras HDF5 weights) adds '.h5' to the name so that TF 2.0 can
identify it as a HDF5 file (see
"""Generate a local filename from a url.
Convert `url` into a hashed filename in a reproducible way. If `etag` is
specified, append its hash to the url's, delimited by a period. If the url
ends with .h5 (Keras HDF5 weights) adds '.h5' to the name so that TF 2.0
can identify it as a HDF5 file (see
https://github.com/tensorflow/tensorflow/blob/00fad90125b18b80fe054de1055770cfb8fe4ba3/tensorflow/python/keras/engine/network.py#L1380)
Args:
url (``str``):
The address to the file.
etag (``str``, `optional`):
The ETag of the file.
Returns:
The generated filename.
"""
url_bytes = url.encode("utf-8")
filename = sha256(url_bytes).hexdigest()
Expand Down Expand Up @@ -194,8 +238,18 @@ def http_user_agent(
library_version: Optional[str] = None,
user_agent: Union[Dict, str, None] = None,
) -> str:
"""
Formats a user-agent string with basic info about a request.
"""Formats a user-agent string with basic info about a request.
Args:
library_name (``str``, `optional`):
The name of the library to which the object corresponds.
library_version (``str``, `optional`):
The version of the library.
user_agent (``str``, ``dict``, `optional`):
The user agent info in the form of a dictionary or a single string.
Returns:
The formated user-agent string.
"""
if library_name is not None:
ua = f"{library_name}/{library_version}"
Expand Down Expand Up @@ -237,18 +291,30 @@ def _request_with_retry(
timeout: float = 10.0,
**params,
) -> requests.Response:
"""Wrapper around requests to retry in case it fails with a ConnectTimeout, with exponential backoff.
"""Wrapper around requests to retry in case it fails with a ``ConnectTimeout`` \
, with exponential backoff.
Note that if the environment variable HF_HUB_OFFLINE is set to 1, then a OfflineModeIsEnabled error is raised.
Note that if the environment variable HF_HUB_OFFLINE is set to 1, then a
``OfflineModeIsEnabled`` error is raised.
Args:
method (str): HTTP method, such as 'GET' or 'HEAD'
url (str): The URL of the ressource to fetch
max_retries (int): Maximum number of retries, defaults to 0 (no retries)
base_wait_time (float): Duration (in seconds) to wait before retrying the first time. Wait time between
retries then grows exponentially, capped by max_wait_time.
max_wait_time (float): Maximum amount of time between two retries, in seconds
**params: Params to pass to `requests.request`
method (``str``):
HTTP method, such as 'GET' or 'HEAD'.
url (``str``):
The URL of the ressource to fetch.
max_retries (``int``, `optional`, defaults to ``0``):
Maximum number of retries, defaults to 0 (no retries).
base_wait_time (``float``, `optional`, defaults to ``0.5``):
Duration (in seconds) to wait before retrying the first time. Wait
time between retries then grows exponentially, capped by
``max_wait_time``.
max_wait_time (``float``, `optional`, defaults to ``2``):
Maximum amount of time between two retries, in seconds.
timeout (``float``, `optional`, defaults to ``10``):
How many seconds to wait for the server to send data before giving
up which is passed to ``requests.request``.
**params (``dict``, `optional`):
Params to pass to `requests.request`.
"""
_raise_if_offline_mode_is_enabled(f"Tried to reach {url}")
tries, success = 0, False
Expand Down Expand Up @@ -321,23 +387,66 @@ def cached_download(
library_version: Optional[str] = None,
cache_dir: Union[str, Path, None] = None,
user_agent: Union[Dict, str, None] = None,
force_download=False,
force_download: Optional[bool] = False,
force_filename: Optional[str] = None,
proxies=None,
etag_timeout=10,
resume_download=False,
proxies: Optional[Dict] = None,
etag_timeout: Optional[float] = 10,
resume_download: Optional[bool] = False,
use_auth_token: Union[bool, str, None] = None,
local_files_only=False,
local_files_only: Optional[bool] = False,
) -> Optional[str]: # pragma: no cover
"""
Given a URL, look for the corresponding file in the local cache. If it's not there, download it. Then return the
path to the cached file.
"""Download from a given URL and cache it if it's not already present in \
the local cache.
Given a URL, this function looks for the corresponding file in the local
cache. If it's not there, download it. Then return the path to the cached
file.
Args:
url (``str``):
The path to the file to be downloaded.
library_name (``str``, `optional`):
The name of the library to which the object corresponds.
library_version (``str``, `optional`):
The version of the library.
cache_dir (``str``, ``Path``, `optional`):
Path to the folder where cached files are stored.
user_agent (``dict``, ``str``, `optional`):
The user-agent info in the form of a dictionary or a string.
force_download (``bool``, `optional`, defaults to ``False``):
Whether the file should be downloaded even if it already exists in
the local cache.
force_filename (``str``, `optional`):
Use this name instead of a generated file name.
proxies (``dict``, `optional`):
Dictionary mapping protocol to the URL of the proxy passed to
``requests.request``.
etag_timeout (``float``, `optional` defaults to ``10``):
When fetching ETag, how many seconds to wait for the server to send
data before giving up which is passed to ``requests.request``.
resume_download (``bool``, `optional`, defaults to ``False``):
If ``True``, resume a previously interrupted download.
use_auth_token (``bool``, ``str``, `optional`):
A token to be used for the download.
- If ``True``, the token is read from the HuggingFace config
folder.
- If a string, it's used as the authentication token.
local_files_only (``bool``, `optional`, defaults to ``False``):
If ``True``, avoid downloading the file and return the path to the
local cached file if it exists.
Return:
Local path (string) of file or if networking is off, last version of file cached on disk.
Local path (string) of file or if networking is off, last version of
file cached on disk.
Raises:
In case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
- ``EnvironmentError`` if ``use_auth_token=True`` and the token cannot
be found.
- ``OSError`` if ETag cannot be determined.
- ``ValueError`` if the file cannot be downloaded and cannot be found
locally.
"""
if cache_dir is None:
cache_dir = HUGGINGFACE_HUB_CACHE
Expand Down Expand Up @@ -521,36 +630,73 @@ def hf_hub_download(
library_version: Optional[str] = None,
cache_dir: Union[str, Path, None] = None,
user_agent: Union[Dict, str, None] = None,
force_download=False,
force_download: Optional[bool] = False,
force_filename: Optional[str] = None,
proxies=None,
etag_timeout=10,
resume_download=False,
proxies: Optional[Dict] = None,
etag_timeout: Optional[float] = 10,
resume_download: Optional[bool] = False,
use_auth_token: Union[bool, str, None] = None,
local_files_only=False,
local_files_only: Optional[bool] = False,
):
"""
Resolve a model identifier, a file name, and an optional revision id, to a huggingface.co file distributed through
Cloudfront (a Content Delivery Network, or CDN) for large files (more than a few MBs).
"""Download a given file if it's not already present in the local cache.
The file is cached locally: look for the corresponding file in the local cache. If it's not there,
download it. Then return the path to the cached file.
Cloudfront is replicated over the globe so downloads are way faster for the end user.
Cloudfront aggressively caches files by default (default TTL is 24 hours), however this is not an issue here
because we implement a git-based versioning system on huggingface.co, which means that we store the files on S3/Cloudfront
in a content-addressable way (i.e., the file name is its hash). Using content-addressable filenames means cache
can't ever be stale.
In terms of client-side caching from this library, we base our caching on the objects' ETag. An object's ETag is:
its git-sha1 if stored in git, or its sha256 if stored in git-lfs.
Args:
repo_id (``str``):
A user or an organization name and a repo name seperated by a
``/``.
filename (``str``):
The name of the file in the repo.
subfolder (``str``, `optional`):
An optional value corresponding to a folder inside the model repo.
repo_type (``str``, `optional`):
Set to :obj:`"dataset"` or :obj:`"space"` if uploading to a dataset
or space, :obj:`None` or :obj:`"model"` if uploading to a model.
Default is :obj:`None`.
revision (``str``, `optional`):
An optional Git revision id which can be a branch name, a tag, or a
commit hash.
library_name (``str``, `optional`):
The name of the library to which the object corresponds.
library_version (``str``, `optional`):
The version of the library.
cache_dir (``str``, ``Path``, `optional`):
Path to the folder where cached files are stored.
user_agent (``dict``, ``str``, `optional`):
The user-agent info in the form of a dictionary or a string.
force_download (``bool``, `optional`, defaults to ``False``):
Whether the file should be downloaded even if it already exists in
the local cache.
force_filename (``str``, `optional`):
Use this name instead of a generated file name.
proxies (``dict``, `optional`):
Dictionary mapping protocol to the URL of the proxy passed to
``requests.request``.
etag_timeout (``float``, `optional`, defaults to ``10``):
When fetching ETag, how many seconds to wait for the server to send
data before giving up which is passed to ``requests.request``.
resume_download (``bool``, `optional`, defaults to ``False``):
If ``True``, resume a previously interrupted download.
use_auth_token (``str``, ``bool``, `optional`):
A token to be used for the download.
- If ``True``, the token is read from the HuggingFace config
folder.
- If a string, it's used as the authentication token.
local_files_only (``bool``, `optional`, defaults to ``False``):
If ``True``, avoid downloading the file and return the path to the
local cached file if it exists.
Return:
Local path (string) of file or if networking is off, last version of file cached on disk.
Local path (string) of file or if networking is off, last version of
file cached on disk.
Raises:
In case of non-recoverable file (non-existent or inaccessible url + no cache on disk).
- ``EnvironmentError`` if ``use_auth_token=True`` and the token cannot
be found.
- ``OSError`` if ETag cannot be determined.
- ``ValueError`` if the file cannot be downloaded and cannot be found
locally.
"""
url = hf_hub_url(
repo_id, filename, subfolder=subfolder, repo_type=repo_type, revision=revision
Expand Down
Loading

0 comments on commit df4c6d9

Please sign in to comment.