Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage to S3, LIST operations #460

Open
Cedric-LG opened this issue Jul 24, 2019 · 8 comments
Open

Storage to S3, LIST operations #460

Cedric-LG opened this issue Jul 24, 2019 · 8 comments
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)

Comments

@Cedric-LG
Copy link

I have a zarr file on S3 where I am storing data on every ten minutes. I'm using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:

/
 └── yyyy
         └── mm
                    └── dd
                             └── HHMM
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32

As my zarr file is growing I'm noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array zarr.create() on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I'm having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:

urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2Fcth%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None

Is there a work around for this that doesn't require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn't require that the array exists?

Thanks

Cedric

@jakirkham
Copy link
Member

Have you looked at consolidated metadata?

@leroygr
Copy link

leroygr commented Jul 24, 2019

Not sure consolidated metadata would help in this case as open_consolidated returns a read only store. Afaik, you cannot save an array with that method right?

@alimanfoo
Copy link
Member

Hi @Cedric-LG, this is a place where the current (v2) protocol doesn't work well, it's come up in design work on the next protocol version (v3) and it's something we need to fix.

For a short-term workaround, I'm looking at the current code base right now and seeing if anything could be done.

@leroygr you're right that consolidated metadata is read-only and so can't help here, unless we allowed the store to support write operations which write-through to the underlying S3 store, but that could get a little confusing.

Another option might be to add some options to disable various checks during array creation, which are the cause of all the listing. There are two main checks we could disable, the first is a check to see if the parent group exists (and if not create it), the second is a check to see if an array or group exists at the requested path for the new array. Your code very probably does not need these checks, and so it could be reasonable to skip them.

@alimanfoo
Copy link
Member

Another option might be to add some options to disable various checks during array creation, which are the cause of all the listing. There are two main checks we could disable, the first is a check to see if the parent group exists (and if not create it), the second is a check to see if an array or group exists at the requested path for the new array. Your code very probably does not need these checks, and so it could be reasonable to skip them.

So, e.g., a concrete solution here could be to add check_parent and check_exists keyword arguments to the array creation functions, which would be passed through to init_array. These would default to True which gives current behaviour, but the user could make them False in which case the checks would be skipped.

@Cedric-LG
Copy link
Author

Thanks for the quick answer @alimanfoo. It looks like it would indeed be a good solution to add check_parent and check_exists as keyword arguments to the array creation functions. I guess that this would be a quick fix for the issue. I'm not familiar with the zarr code but I'm willing to help on this issue or is it something that could be quickly implemented? As this issue is quite critical for us.

@alimanfoo
Copy link
Member

Hi @Cedric-LG, yes this should be fairly quick to implement. On reflection it might be slightly better to use the argument name require_parent rather than check_parent. Steps to implement would be roughly as follows:

  1. Add keyword argument check_exists=True to function zarr.storage._init_array_metadata. Within that function, add some conditional logic so that the calls to contains_array and contains_group only happen if check_exists is True. Do something similar for function zarr.storage._init_group_metadata.

  2. Add keyword arguments require_parent=True and check_exists=True to function zarr.storage.init_array. Within that function, add conditional logic to only call _require_parent_group() if require_parent is True. Also pass check_exists through to _init_array_metadata(). Do something similar for function zarr.storage.init_group.

  3. Add the new keyword arguments to zarr.creation.create, and pass the values through to init_array().

  4. Add the new keyword arguments to function zarr.creation.open_array, and pass them through to init_array() but only if mode == 'w'.

  5. Add the new keyword arguments to function zarr.hierarchy.group, and pass them through to init_group().

  6. Add the new keyword arguments to function zarr.hierarchy.open_group. Pass them through to init_group() only if mode == 'w'.

  7. Add the new arguments to the docstring for zarr.hierarch.Group.create_dataset. Don't think any implementation change needed there though as kwargs should already get passed through.

  8. Add the new arguments to zarr.hierarchy.Group.create_group, and pass down via _create_group_nosync() to init_group().

9 (or 0). Write some unit tests to verify things are behaving as expected, i.e., no calls to __contains__ or __iter__ or keys on the store.

If you were able to get started on any of the above I'd be very grateful, I could chip in e.g. with some tests.

@Cedric-LG
Copy link
Author

Hi @alimanfoo. We've looked further into the problem and discovered that the main problem was the fact that we were using overwrite=True. This was causing a listing of all the arrays in the zarr file while saving as it was checking whether the array already existed. Setting this to False greatly reduced our costs. None the less, the changes suggested above would further reduce the cost so I've implemented them and created a pull request #464. However, I haven't created any unit tests yet as the version 2.3.3.dev4 is not working for me as it gives an error in the indexing when it iterates over the chunks.

@dstansby dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Dec 12, 2024
@dstansby
Copy link
Contributor

@Cedric-LG do you know if this has improved in the last few years at all, or is it still an issue for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

5 participants