Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bars #261

Merged
merged 4 commits into from
Aug 10, 2021
Merged

Progress bars #261

merged 4 commits into from
Aug 10, 2021

Conversation

LysandreJik
Copy link
Member

This PR adds progress bars to the Repository utility. The progress bars are visible for large files (>10MB) and any other file tracked with git-lfs.

It adds a context manager to be used internally as a wrapper for subprocess calls. This uses the GIT_LFS_PROGRESS environment variable to output the progress to a file, tails the file and parses each line. It makes use of temporary directories in order to be used in parallel across the system. Usage is the following:

with lfs_log_progress():
    subprocess.run(f"git lfs clone {repo_url} .".split())

It works both in notebooks and scripts. Example with a clone:

Cloning in a script

from huggingface_hub import Repository

repository = Repository("distilbert", clone_from="distilbert-base-uncased")
Cloning https://huggingface.co/distilbert-base-uncased into local empty directory.
Download file tf_model.h5: 100%|████████████████████████████████████████| 347M/347M [06:59<00:00, 866kB/s]
Download file pytorch_model.bin: 100%|██████████████████████████████████| 256M/256M [06:57<00:00, 641kB/s]
Download file rust_model.ot: 100%|██████████████████████████████████████| 345M/345M [06:57<00:00, 866kB/s]
Clean file pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████| 256M/256M [01:20<00:00, 3.34MB/s]
Clean file rust_model.ot: 100%|█████████████████████████████████████████████████████████████████| 345M/345M [00:04<00:00, 88.9MB/s]
Clean file tf_model.h5: 100%|████████████████████████████████████████████████████████████████████| 347M/347M [00:02<00:00, 179MB/s]
Clean file pytorch_model.bin: 100%|█████████████████████████████████████████████████████████████| 256M/256M [01:20<00:00, 2.88MB/s]
Clean file rust_model.ot: 100%|██████████████████████████████████████████████████████████████████| 345M/345M [00:04<00:00, 173MB/s]
Clean file tf_model.h5: 100%|████████████████████████████████████████████████████████████████████| 347M/347M [00:02<00:00, 184MB/s]
Process finished with exit code 0

Cloning in a notebook

image

It is visible for the clone, pull and push git commands.

This is the first part of a logging overhaul; the second part will be to have a logging manager similar to transformers' and datasets'

)
with lfs_log_progress():
subprocess.run(
f"git lfs clone {repo_url} .".split(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the switch from git clone to git lfs clone. They're similar in terms of speed, but the smudge happens at different times.

Using git clone means that the process will not log any progress at first, before downloading all files - it will not log any "clean" operation. During at least half of the time, no progress will be shown to the user.

Using git lfs clone, the process will download the files before applying the clean filter. Both of these steps will be logged to the file, which can be shown to the user for feedback.

@@ -546,6 +644,8 @@ def auto_track_large_files(self, pattern="."):
# Cleanup the .gitattributes if files were deleted
self.lfs_untrack(deleted_files)

return files_to_be_staged
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning the files to be staged to the user so as to warn them that the add operation might take a bit of time.

@LysandreJik LysandreJik requested a review from osanseviero August 6, 2021 07:36
Copy link
Contributor

@osanseviero osanseviero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for it!

I took a couple of passes but this looks good to me.

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice! This is going to be a great addition :-)

Comment on lines +658 to +663
subprocess.run(
args,
check=True,
encoding="utf-8",
cwd=self.local_dir,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really can't fit in one line (at least all the args on a new line?). Not used to 80 chars max.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm contemplating switching the default to be the same as transformers' on a daily basis :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You definitely should!

Comment on lines +115 to +116
@contextmanager
def lfs_log_progress():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice API!

current_lfs_progress_value = os.environ.get("GIT_LFS_PROGRESS", "")

with tempfile.TemporaryDirectory() as tmpdir:
os.environ["GIT_LFS_PROGRESS"] = tmpdir + "/lfs_progress"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the "/" work on any OS (Windows)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! I'll update to os.path.join. Thank you!

@LysandreJik
Copy link
Member Author

Tested on Windows, it works! Working on a PR enabling Windows tests.

@LysandreJik LysandreJik merged commit f5a35a3 into main Aug 10, 2021
@LysandreJik LysandreJik deleted the progress-bars branch August 10, 2021 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try wrapping git-lfs to improve its ergonomics for large files
3 participants