Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove find implementation #285

Closed
wants to merge 2 commits into from
Closed

remove find implementation #285

wants to merge 2 commits into from

Conversation

skshetry
Copy link
Member

fsspec by default provides a find implementation that is built on top of ls and info. Looking at the current implementation, it did not seem like it was recursive (maybe I am wrong here?), and looked complicated that I'd rather simplify and default to fsspec, and specialize only if needed.

@skshetry skshetry requested review from shcheklein and efiop June 19, 2023 03:21
@skshetry skshetry temporarily deployed to internal June 19, 2023 03:22 — with GitHub Actions Inactive
@skshetry skshetry changed the base branch from main to dvc-issue-9607 June 19, 2023 03:22
@shcheklein
Copy link
Member

@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.

@skshetry
Copy link
Member Author

@skshetry this is implementation most likely needed (it creates a cache of top level folders most likely and their IDs). I think btw we need to update it to handle the new cache format - it might be broken, not sure.

Shouldn't that happen as part of ls(), which should be caching too?

self._cache_path_id(root_path, *dir_ids)

@shcheklein
Copy link
Member

Since ls was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.

@skshetry
Copy link
Member Author

skshetry commented Jun 19, 2023

Since ls was there before as well, I would be careful here. I can try to find some time to restore some context. Overall, the whole implementation is pretty bad and broken from the fsspec perspective, it was made optimal for DVC at that time.

yup, agreed. find only worked for the root of the remote, and this worked fine for us in dvc when remote cache was at the root level, but now as we moved to <remote>/files/md5, find no longer works. Although I'd prefer we merge this, as all the tests passes for dvc in iterative/dvc-gdrive#28, we can generalize the filesystem at a later time.

@skshetry
Copy link
Member Author

Again, all the changes to cache structure can be and should be done without implementing find. ls and info are appropriate place for change.

(We will only need to implement find if there's a cheaper way to recursively list files in gdrive).

@shcheklein
Copy link
Member

Although I'd prefer we merge this, as all the tests passes for dvc in iterative/dvc-gdrive#28, we can generalize the filesystem at a later time.

I don't understand the implications of this, If have time to research - compare the number of API requests, performance, MT safety that would be great. Or at least some explanation why did we have this. W/o that let's not merge this please.

We need to test it on the new cache format and fix it if needed.

@shcheklein
Copy link
Member

Some context that I remember. In Google drive:

  • names are not unique
  • you need an ID to access any directory (e.g. to list files inside)
  • API calls are expensive (rate-limited) - can become easily a bottleneck

Thus it was important to do:

  • create once (and only once!) all the 00, 01, ..... ff and get their IDs
  • make sure that we have them always in memory and don't hit APIs multiple times for that

find might have been related to all of that.

And, again, we need to review it for the new structure, otherwise it's indeed can be quite broken for DVC.

@skshetry
Copy link
Member Author

I see that in find, it does:

self._cache_path_id(item_path, item["id"])

which should be implemented in ls() too.

contents = []
for item in self._gdrive_list_ids(dir_ids):
item_path = posixpath.join(root_path, item["title"])
if item["mimeType"] == FOLDER_MIME_TYPE:
contents.append(
. Will look into it.

@skshetry skshetry temporarily deployed to internal June 20, 2023 10:34 — with GitHub Actions Inactive
@skshetry
Copy link
Member Author

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

Regarding multithreading, it should be safe (thanks to GIL).

@skshetry
Copy link
Member Author

otherwise it's indeed can be quite broken for DVC.

Note that it is broken for DVC right now.

@skshetry skshetry temporarily deployed to internal June 20, 2023 11:28 — with GitHub Actions Inactive
@skshetry
Copy link
Member Author

The failure in Python 3.7 on macOS is unrelated, see actions/setup-python#682.

It is also fixed in fsspec, see fsspec/filesystem_spec#1295.
Anyway, we can pin Python to "3.7.16" for now (in a separate PR):

pyv: ["3.7", "3.8", "3.9", "3.10"]

@shcheklein
Copy link
Member

Just to be on the same page. Even with this patch it's broken for the new DVC. E.g. here when we initialize we cache root ids, we don't cache all of the now. Upstream we should be also creating them once, etc.

For this patch. How do we use / used find upstream, in DVC - could point me to the code please?

@skshetry
Copy link
Member Author

Even with this patch it's broken for the new DVC

I got confirmation from user that it's broken for legacy ODB from a user. I need to investigate that.

E.g. here when we initialize we cache root ids, we don't cache all of the now.

It gets cached when _ids_cache is invoked, no?

cached = base in self._ids_cache["dirs"]

For this patch. How do we use / used find upstream, in DVC - could point me to the code please?

This is where we use find.

https://github.com/iterative/dvc-objects/blob/7e07ccf42cf85f9fbea6b142a0609d796cba0539/src/dvc_objects/db.py#L216

@efiop
Copy link
Contributor

efiop commented Jun 21, 2023

Seems like as always gdrive is very involved and the state of pydrive2 and related stuff is not the best. Along with pydrive2 using legacy API version that might get dropped in the future, I just want to say, for the record, that dropping gdrive is also an option that we should consider here if we can't fix it quickly. It is great for onboarding, but probably doesn't contribute anything to the customers and we need to carefully consider if we can spend time on this. Not saying we should drop it now, but that we should keep that as a possible solution.

@shcheklein
Copy link
Member

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

This find implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.

  1. As soon as DVC starts it caches all 00, 01, .... ff. (on create dir we update it in thread-safety way). It's a single API calls.
  2. When we run find we take only those (we don't care about file under root, etc). And run a single query to start fetching w/o listing the root again (each time an API call) + then making potentially multiple calls per each directory inside.

Things that are potentially broken / not optimal is that we don't pre-cache file/md5/00 ... file/md5/ff roots now (unless we initialize two different FSs instances with different roots?)

Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.

@skshetry
Copy link
Member Author

The patch now caches the ID of folders on ls too. Previous implementation of find only used one query id per directory which was reused from _ids_cache. But I don't think we can make an assumption that all the dir ids that exists are in the cache, (and hence we have to do a list anyway?).

This find implementation is very DVC-specific and GDrive-specific. It optimizes the number of API calls we make.

1. As soon as DVC starts it caches all `00`, `01`, .... `ff`. (on create dir we update it in thread-safety way). It's a single API calls.

This does cache ids lazily now during ls which should have similar effect

2. When we run find we take only those (we don't care about file under root, etc). And run a _single_ query to start fetching w/o listing the root again (each time an API call) + then making potentially multiple calls per each directory inside.

This is the one that is broken at this time, and this PR avoids that for simplicity.

Otherwise I don't see why it would be broken. Were you able to reproduce it? I can run and probably fix it quickly then.

I have an alternative patch that caches path provided to fs.find(). Will create a PR.

@shcheklein
Copy link
Member

@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?

Number of API calls is critical for GDrive DVC- thus this optimization. Unless we really have a food reason for this, let's not try to simplify it for now please?

@skshetry
Copy link
Member Author

skshetry commented Jun 22, 2023

@skshetry was the user problem reproduced? Can we start with that please. If there is way to reproduce there should be probably a simple fix?

@shcheklein, there were two issues:

  1. fs.ls("files/md5") was returning None when the directory did not exist (instead of raising FileNotFoundError), which is fixed in GDriveFileSystem: Raise FIleNotFoundError on ls. #283.
  2. fs.find("files/md5") returns an empty list, because it is only using dir_ids cached from root, and does not populate dir_ids from files/md5.

@skshetry
Copy link
Member Author

Closing in favour of #286.

@skshetry skshetry closed this Jun 22, 2023
@skshetry skshetry deleted the remove-find branch June 22, 2023 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants