-
-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage to S3, LIST operations #460
Comments
Have you looked at consolidated metadata? |
Not sure consolidated metadata would help in this case as open_consolidated returns a read only store. Afaik, you cannot save an array with that method right? |
Hi @Cedric-LG, this is a place where the current (v2) protocol doesn't work well, it's come up in design work on the next protocol version (v3) and it's something we need to fix. For a short-term workaround, I'm looking at the current code base right now and seeing if anything could be done. @leroygr you're right that consolidated metadata is read-only and so can't help here, unless we allowed the store to support write operations which write-through to the underlying S3 store, but that could get a little confusing. Another option might be to add some options to disable various checks during array creation, which are the cause of all the listing. There are two main checks we could disable, the first is a check to see if the parent group exists (and if not create it), the second is a check to see if an array or group exists at the requested path for the new array. Your code very probably does not need these checks, and so it could be reasonable to skip them. |
So, e.g., a concrete solution here could be to add |
Thanks for the quick answer @alimanfoo. It looks like it would indeed be a good solution to add |
Hi @Cedric-LG, yes this should be fairly quick to implement. On reflection it might be slightly better to use the argument name
9 (or 0). Write some unit tests to verify things are behaving as expected, i.e., no calls to If you were able to get started on any of the above I'd be very grateful, I could chip in e.g. with some tests. |
Hi @alimanfoo. We've looked further into the problem and discovered that the main problem was the fact that we were using |
@Cedric-LG do you know if this has improved in the last few years at all, or is it still an issue for you? |
I have a zarr file on S3 where I am storing data on every ten minutes. I'm using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:
As my zarr file is growing I'm noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array
zarr.create()
on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I'm having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:Is there a work around for this that doesn't require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn't require that the array exists?
Thanks
Cedric
The text was updated successfully, but these errors were encountered: