Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep lock files in a /locks folder to prevent rare concurrency issue #1659

Merged
merged 3 commits into from
Oct 16, 2023

Conversation

beeender
Copy link
Contributor

@beeender beeender commented Sep 12, 2023

The lock file could be removed when the 2nd process gets the lock.
And then the 3rd process will lock on a different lock file handle.
Although the lock path stays the same.

1st proc gets the lock
2nd proc waits for the lock
1st proc releases the lock
1st proc removes the lock file
2nd proc gets the lock
3rd proc creates a new lock file and gets the lock

Demo code to show the problem:

from filelock import FileLock
import threading
import os
import time

lock_path = "/tmp/test.lock"

def lock1():
   with FileLock(lock_path):
      print("lock acquired in thread1")
      time.sleep(1)

   print("lock released in thread1")
   time.sleep(1)
   os.remove(lock_path)
   print("lock file removed in thread1")

def lock2():
   time.sleep(1)
   with FileLock(lock_path):
      print("lock acquired in thread2")
      time.sleep(10)

   print("lock released in thread2")

def lock3():
   time.sleep(3)
   with FileLock(lock_path):
      print("lock acquired in thread3")
      time.sleep(10)


t1 = threading.Thread(target=lock1)
t2 = threading.Thread(target=lock2)
t3 = threading.Thread(target=lock3)
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()

Add ENV HF_HUB_KEEP_LOCK_FILES to give user option to keep the lock files to prevent concurrency issues.

EDIT on 2023/1014

Windows doesn't have this problem.

This commit moves the lock files to a subdirectory of the hub cache,
and don't remove it after downloading.

The lock files are named with their etag and .lock extension, placed
in the HUGGINGFACE_HUB_CACHE/.locks/repo_folder_name directory. The
repo_folder_name is generated from repo_id and repo_type, to avoid
same etag in different repos.

@beeender beeender changed the title Removing lock file causes concurrency issues Removing download lock file causes concurrency issues Sep 12, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Sep 12, 2023

Hi @beeender thanks for raising the question here. I think the reason why we want to delete the lock file is to avoid polluting the cache system. Filelock has different ways of handling the file lock depending on the platform:

What we could do to ensure that the file lock is removed in any case is to use only the WindowsFileLock and SoftFileLock, instead of UnixFileLock (taking the logic from here). Not an expert here so I'd certainly need to double-check the possible side effect.
EDIT: not a fan of doing that either. SoftFileLock does seem to be more error-prone especially if a process gets killed in the middle of a process - which is quite likely.

cc-ing @julien-c who implemented it in #801

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 18, 2023

The documentation is not available anymore as the PR was closed or merged.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 19, 2023

@beeender after thinking about it and some discussions with @julien-c, we ended up with the conclusion:

  • We want to use filelock with the default Unix and Windows implementations that relies not only on the file existence but also some low-level to check if a process currently owns the lock. Therefore I'd rather not use SoftFileLock by default everywhere. It's a good proxy but it'd be quite easy to end-up in a dead-lock.
  • By default, filelock doesn't remove the lock files. There have been issues with that in the past (see Random assertions failing in test.py  tox-dev/filelock#31) and there are no satisfying good practices to remove them (see Best practice removal of lock file? tox-dev/filelock#76).
  • We still prefer to remove the lock files in order to leave a "clean" cache. Leaving lock files is not wrong but looks messy to users. Also leaving lock might break some downstream tools (at least not scan/delete cache, not snapshot_download but we might be missing something?)
  • Deleting lock files as we do it now can lead to concurrency issues but very few cases have been reported in 18 months, so hopefully not very problematic for most of our users.

That being said what we can do is to add a new flag (HF_HUB_KEEP_LOCK_FILES?) for power-users like you that have multiple processes downloading a lot of files in parallel. What do you think @beeender ? That would be a bit more work on the PR (both logic-wise and docs-wise) but it seems to me a good compromise between usage and robustness.

(edit: another interesting read)

@beeender
Copy link
Contributor Author

That being said what we can do is to add a new flag (HF_HUB_KEEP_LOCK_FILES?)

This sounds good to me! Let me add that in the PR.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 19, 2023

Great, thanks!

To do so, I would:

And that should be it. Please let me know if you need any help. Thanks in advance!

@beeender
Copy link
Contributor Author

beeender commented Sep 19, 2023

+    def test_keep_lock_file(self):
+        """Lock files should be kept if HF_HUB_KEEP_LOCK_FILES is True
+        """
+        with SoftTemporaryDirectory() as tmpdir:
+            with patch("huggingface_hub.constants.HF_HUB_KEEP_LOCK_FILES", False):
+                hf_hub_download(DUMMY_MODEL_ID, filename=CONFIG_NAME, cache_dir=tmpdir)
+            for subdir, dirs, files in os.walk(tmpdir):
+                for file in files:
+                    # No lock files should exist
+                    self.assertNotRegex(file, ".*\.lock")
+
+        with SoftTemporaryDirectory() as tmpdir:
+            with patch("huggingface_hub.constants.HF_HUB_KEEP_LOCK_FILES", True):
+                hf_hub_download(DUMMY_MODEL_ID, filename=CONFIG_NAME, cache_dir=tmpdir)
+            lock_file_exist = False
+            for subdir, dirs, files in os.walk(tmpdir):
+                for file in files:
+                    if file.endWith(".lock"):
+                        lock_file_exist = True
+                        break
+            self.assertTrue(lock_file_exist)

@Wauplin My constants mocking doesn't work, do you have any idea?

@beeender beeender changed the title Removing download lock file causes concurrency issues HF_HUB_KEEP_LOCK_FILES to prevent rare concurrency issue Sep 19, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Sep 19, 2023

My constants mocking doesn't work, do you have any idea?

Since you are importing the constant directly in file_download, you should patch the value there:

with patch("huggingface_hub.file_download.HF_HUB_KEEP_LOCK_FILES", True):

and it should work.

@beeender
Copy link
Contributor Author

My constants mocking doesn't work, do you have any idea?

Since you are importing the constant directly in file_download, you should patch the value there:

with patch("huggingface_hub.file_download.HF_HUB_KEEP_LOCK_FILES", True):

and it should work.

Magic! And tests have been added.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @beeender! Everything looks good to me. I'll update the branch and run the CI. We are good to merge it once it's ✔️ ! 🎉 Second thoughts: we will remove the env variable. See #1659 (comment).

docs/source/en/package_reference/environment_variables.md Outdated Show resolved Hide resolved
src/huggingface_hub/constants.py Outdated Show resolved Hide resolved
tests/test_file_download.py Outdated Show resolved Hide resolved
tests/test_file_download.py Show resolved Hide resolved
@Wauplin
Copy link
Contributor

Wauplin commented Sep 19, 2023

@beeender it seems the lock file is deleted on the CI Windows machine even when HF_HUB_KEEP_LOCK_FILES=1. This is most likely because filelock already deleted it. Let's just skip the test on windows and that's it. I'll make the update :)

from logs (see https://github.com/huggingface/huggingface_hub/actions/runs/6238074721/job/16933112571?pr=1659):

        with SoftTemporaryDirectory() as tmpdir:
            with patch("huggingface_hub.file_download.HF_HUB_KEEP_LOCK_FILES", True):
                hf_hub_download(DUMMY_MODEL_ID, filename=CONFIG_NAME, cache_dir=tmpdir)
            lock_file_exist = False
            for subdir, dirs, files in os.walk(tmpdir):
                for file in files:
                    if file.endswith(".lock"):
                        lock_file_exist = True
                        break
>           self.assertTrue(lock_file_exist)
E           AssertionError: False is not true

EDIT: problem fixed (see #1659 (comment)).

@beeender
Copy link
Contributor Author

In another thought, shall we just create the lock file in the temp folder, so it does not have to be deleted.
Like:

/tmp/huggingface/<CHECKSUM_FILE_PATH>.lock

This may not work if the cache folder is a sharing folder (nfs/samba) and downloading happens on multiple hosts. But I doubt the flock ever works with nfs/samba.
HF_HUB_KEEP_LOCK_FILES won't be needed then.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 20, 2023

I like the idea!

What do you think about leaving the lock files in a separate directory but still in the $HF_HOME folder? This way we ensure that locks on shared drive still works as they do now (even if it's only based on file existence, it's better than nothing). That would give us something like this:

➜  ~ ls -la ~/.cache/huggingface/hub                                  
total 104
drwxrwxr-x 21 wauplin wauplin 16384 sept. 20 08:57 .
drwxrwxr-x  6 wauplin wauplin  4096 sept. 18 16:39 ..
drwxrwxr-x  5 wauplin wauplin  4096 avril  6 17:07 datasets--hf-internal-testing--dummy-flac-single-example
drwxrwxr-x  5 wauplin wauplin  4096 mai   26 10:22 datasets--Narsil--image_dummy
drwxrwxr-x  2 wauplin wauplin  4096 sept. 20 08:50 .locks
drwxrwxr-x  5 wauplin wauplin  4096 avril 25 14:53 models--admruul--anything-v3.0
drwxrwxr-x  5 wauplin wauplin  4096 juil. 10 12:51 models--CompVis--stable-diffusion-v1-4
drwxrwxr-x  5 wauplin wauplin  4096 sept. 11 12:50 models--gpt2
drwxrwxr-x  5 wauplin wauplin  4096 sept. 19 13:43 models--hf-internal-testing--dummy-gated-model
drwxrwxr-x  6 wauplin wauplin  4096 avril 24 10:04 models--hf-internal-testing--tiny-random-bert
(...)

➜  ~ tree -la ~/.cache/huggingface/hub/.locks                                                
/home/wauplin/.cache/huggingface/hub/.locks
└── 46a9d58622b4675b29564da2d9ba73e702241c5fa969f12c387cad4aa984276a.lock

0 directories, 1 file

with lock_path = os.path.join(locks_dir, f"{filename}.lock"). Filename is based on the etag of the file on the Hub. I'll need to double check that etags are unique across repos (I think so but better to check). In any case it should be easy to build a unique lock path per blob file. Etags are not unique (see EDIT below).

That way we still have lock files but not messing around everywhere in the cache. We could have an utility to clean the locks folder (either after snapshot_download or when scanning/deleting the cache huggingface-cli delete-cache). "Cleaning the locks folder" could be "delete all locks older than 12h" for instance. Whatever, this part is not so critical.

EDIT: etags are not unique across repos. Etag for an LFS file is its sha256 checksum. This means that if a same file is uploaded to several repos, it shares the same Etag. And if a user wants to download it from different repos, the blob files must be re-downloaded (that's the current behavior and we won't change that). Meaning that the lock_path must be based on both the Etag and repo_id/repo_type. Something like:

lock_path = os.path.join(locks_dir, repo_folder_name(repo_id=repo_id, repo_type=repo_type), f"{etag}.lock")
# e.g.: ~/.cache/huggingface/hub/.locks/models--julien-c--EsperBERTo-small/46a9d58622b4675b29564da2d9ba73e702241c5fa969f12c387cad4aa984276a.lock

@julien-c
Copy link
Member

yes, great idea!

and a CLI script could take care of deleting .locks from time to time if needed.

@Wauplin Wauplin changed the title HF_HUB_KEEP_LOCK_FILES to prevent rare concurrency issue Keep lock files in a /locks folder to prevent rare concurrency issue Oct 5, 2023
Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making some cleaning on the repo so I renamed this PR accordingly with what we discussed in #1659 (comment). @beeender are you still interested in implementing it? :)

@beeender
Copy link
Contributor Author

I'm making some cleaning on the repo so I renamed this PR accordingly with what we discussed in #1659 (comment). @beeender are you still interested in implementing it? :)

Yes! I was on vacation for the last week. Will create a PR soon.

The lock file could be removed when the 2nd process gets the lock.
And then the 3rd process will lock on a different lock file handle.
Although the lock path stays the same.

1st proc gets the lock
2nd proc waits for the lock
1st proc releases the lock
1st proc removes the lock file
2nd proc gets the lock
3rd proc creates a new lock file and gets the lock

Windows doesn't have this problem.

This commit moves the lock files to a subdirectory of the hub cache,
and don't remove it after downloading.

The lock files are named with their 'etag' and '.lock' extension, placed
in the 'HUGGINGFACE_HUB_CACHE/.locks/repo_folder_name' directory. The
repo_folder_name is generated from 'repo_id' and 'repo_type', to avoid
same 'etag' in different repos.

Co-authored-by: Lucain <[email protected]>
@beeender
Copy link
Contributor Author

@Wauplin PR has been updated. 🆙

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean PR! Thanks for making the change @beeender. I'm glad that we won't introduce a new environment variable like HF_HUB_KEEP_LOCK_FILES and still be robust across processes! 🔥

@@ -1420,6 +1422,7 @@ def hf_hub_download(
if os.name == "nt" and len(os.path.abspath(blob_path)) > 255:
blob_path = "\\\\?\\" + os.path.abspath(blob_path)

Path(lock_path).parent.mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Wauplin
Copy link
Contributor

Wauplin commented Oct 16, 2023

(fix tests in e670797. Scan-cache was failing to scan the .locks folder)

@Wauplin Wauplin merged commit 85b7753 into huggingface:main Oct 16, 2023
@beeender beeender deleted the mc/rm_lock branch October 16, 2023 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants