Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface-cli delete-cache --disable-tui improvements #1997

Open
stas00 opened this issue Jan 23, 2024 · 11 comments
Open

huggingface-cli delete-cache --disable-tui improvements #1997

stas00 opened this issue Jan 23, 2024 · 11 comments
Labels

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 23, 2024

I tried huggingface-cli delete-cache --disable-tui for the first time. Great intention, very problematic usage when one had thousands of hub objects to cleanup. Once I understood its quirks I was able to hack around those problems.

  1. Confusing UX
huggingface-cli delete-cache --disable-tui
TUI is disabled. In order to select which revisions you want to delete, please edit
the following file using the text editor of your choice. Instructions for manual
editing are located at the beginning of the file. Edit the file, save it and confirm
to continue.
File to edit: /tmp/tmpundr7lky.txt
0 revisions selected counting for 0.0. Continue ? (y/N)

The doc says to hit y but the program exits if y is hit and all careful manual editing is lost and the user has to start from scratch (ouch!)

I suspect this is a bug or a problem in the workflow.

  1. Please don't use a true temp file, use a file that won't get deleted and a user can re-use it should they hit the wrong button - see Issue 1 above as an example.

  2. sorting to have main last consistently would help. e.g. the first few entries had a consistent-main-last listing as in:

# Model mistralai/Mistral-7B-v0.1 (29.5G, used 19 hours ago)
    5e9c98b96d071dce59368012254c55b0ec6f8658 # Refs: (detached) # modified 3 months ago
    17688039ea7b7001c702797eed2ab7716a0cc3c2 # Refs: (detached) # modified 6 weeks ago
#    26bca36bde8333b5d7f72e9ed20ccda6a618af24 # Refs: main # modified 6 weeks ago

so I started to manually uncomment lines thinking main is always last, w/o paying close attention, but luckily I caught this was inconsistent as then I run into:

# Model mistralai/Mixtral-8x7B-v0.1 (93.4G, used 10 hours ago)
#    985aa055896a8f943d4a9f2572e6ea1341823841 # Refs: main # modified 5 weeks ago
    58301445dc1378584211722b7ebf8743ec4e192b # Refs: (detached) # modified 5 weeks ago

and many other variations. For those who need to edit hundreds of these, it'd be great to have main first or last - probably actually first would be the easiest.

  1. give a user a way to delete all non-main revisions in one go

I tried to do it manually and it was super slow and I was concerned my edits will get lost again if I hit Y instead of the confusing N (see Issue 1)

at the end I resorted to this hack:

cp /tmp/tmpedbz00ox.txt cache.txt
perl -pi -e 's|^#(.*detached.*)|$1|' cache.txt
cat cache.txt >>  /tmp/tmpundr7lky.txt

and hit N, Y, Y

so I wiped out hundreds of old revisions in a second w/o manual editing.

This is usually what users want - keep the main, get rid of old revs - would it be possible to create such option?

Thank you!

@Wauplin
Copy link
Contributor

Wauplin commented Jan 23, 2024

Thanks for the great feedback @stas00! I haven't looked into this for a long time but all suggestions seems to make sense. Will keep you updated when I start working on that :)

@Wauplin Wauplin added this to the in next release? milestone Jan 23, 2024
@stas00
Copy link
Contributor Author

stas00 commented Jan 23, 2024

Super! The foundation is awesome, @Wauplin - just needs a bit of polish on top.

The other problem this tool isn't take care of is cleaning downloads/ - I checked and we had 2M of files there! I had to do:

sudo find /data/huggingface/datasets/downloads -type f -mtime +3 -exec rm {} \+
sudo find /data/huggingface/datasets/downloads -type d -empty -delete

I'm not sure if huggingface-cli could take care of datasets as well, since they come from the hub. Please let me know if I should open a separate issue, since it's related but not the same.

@Wauplin
Copy link
Contributor

Wauplin commented Jan 24, 2024

@stas00 datasets cache is another topic! At the moment datasets doesn't use the default cache shared between huggingface_hub, transformers, diffusers, etc. We will have to fix that first before providing a CLI to clean this cache too (cc @lhoestq for visibility).

@stas00
Copy link
Contributor Author

stas00 commented Jan 24, 2024

Do you want me to open an Issue here or on the datasets repo?

@Wauplin
Copy link
Contributor

Wauplin commented Jan 24, 2024

Better for the datasets repo but please tag me. Thanks!

@stas00
Copy link
Contributor Author

stas00 commented Jan 24, 2024

Done: huggingface/datasets#6614

@lappemic
Copy link
Contributor

Hey @Wauplin, thanks a lot for pointing to another issue i could work on! I could not yet fully get into it, but i think i can help out here and would love to work on this! As soon as i was able to work out something i will open a draft PR (or some clarifying questions 😄). Will probably be next week.

@lappemic
Copy link
Contributor

Hey @Wauplin, found finally some time to look into this (later than expected, sorry about this). Not sure if i grasped this all correctly, so i try to lay out the way forward to avoid confusion.

Suggestion for moving forward from here:

  1. Implement a new subcommand structure (from PR Fixes #2219 - adding a subcommand to enable deleting parts of a repo in the local cache #2221 )
> huggingface-cli cache delete-repo repo_id
> huggingface-cli cache delete-repo repo_id --repo-type=dataset
> huggingface-cli cache delete-repo repo_id --include (glob)
> huggingface-cli cache delete-repo repo_id --exclude (glob)
> huggingface-cli cache delete-repo repo_id --files=file1,file2,file3,file4
  • Mark the delete-cache command as deprecated in the documentation and in the CLI itself
  1. Improve the temporary file handling
  • Replace the current temporary file with a fixed file (e.g., ./tmp_hf_cleaner.txt) when TUI is disabled
  • Ensure that user edits are not lost if they accidentally choose the wrong option
  1. Enhance revision selection
  • Add an option to select all "detached" revisions at once, keeping only main or specific tags
  • Implement consistent sorting of revisions, preferably with the main revision always in the same position (first or last)
  1. Clarify the confirmation process
  • Revise the wording when confirming deletions to avoid confusion
  • Ensure that the confirmation process is consistent and intuitive
  1. Address the .DS_Store issue
  • Fix the problem where hidden files like .DS_Store prevent the complete deletion of directories

Is this correct? If yes i would start by implementing the new subcommand structure and the options in separate PR's.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 23, 2024

Hi @lappemic, thanks for getting back to me. I don't think 1. and 5. should be tackled as part of this issue but rather in #2221 directly. I find them a bit unrelated to 2. 3. and 4. which are focused on improving the current workflow.

I think that 2. and 4. can be tackled as part of the same PR while 3. (which adds a new option) can be done in a separate one. What do you tihnk?

@lappemic
Copy link
Contributor

Thanks a lot for the feedback! Sounds good to me! Will open a draft PR next week as soon as i get to it!

@pranavktrpl
Copy link

Hi, so a weird thing, I am trying to run the above command and uncommenting the whole thing, yet my terminal shows 0 revisions even after I save the file and when I press y, it just exits and says Deletion cancelled... can anybody troubleshoot, or suggest a solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants