Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second zarr design doc #552

Merged
merged 4 commits into from
Oct 20, 2021
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions doc/design/zarr-support-2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Abstract

Zarr files are stored in their entirety in S3.
Zarr files are represented by a single Asset through the API.
A treehash is computed as the zarr file is uploaded.
Each node in the treehash in the hash is stored in S3 to avoid bloating the DB.
When published, the zarr file is copied to a separate location in the bucket so that changes to the draft cannot affect the published file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expensive operation, would cause publish'ing possibly take days...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sadly. IIRC Immediate publishes were not a requirement, just a nice feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can instead copy to a new draft location and leave that entity published. so a user may have to explicitly initiate a draft for a zarr-based dandisets. i would flip this operation, but there are a few other details to consider.


# Requirements

1. Zarr files are stored in a "directory" in S3.
1. Each zarr file corresponds to a single Asset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zarr file" has been used to describe the zarr "directory" (well, different backends might store differently, including into a file). But I wonder if we should set terminology a little more straight, and call it "zarr folder" or "zarr archive" or "zarr blob" to not confuse with individual "zarr files" constituting zarr file/folder/archive/blobg? Otherwise above suggests that "each zarr file [among those zarr files mentioned above] corresponds to a single Asset" which is not the case. So what about

Suggested change
1. Zarr files are stored in a "directory" in S3.
1. Each zarr file corresponds to a single Asset.
1. A Zarr blob is stored as files in a "directory" in S3.
1. Each zarr blob corresponds to a single Asset.

? (and reflect that everywhere else in the document text/API endpoint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this document I've been referring to the entire zarr thing as a "zarr file", and the files that comprise it as "subfiles". If this design is worth investing more time into I can rename things appropriately.

1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file.
1. The API verifies the checksum _immediately_ after upload.
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
1. Zarr metadata must be mutable for zarr files in draft dandisets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although metadata is likely to be the one to mutate more often, AFAIK there is no reason to discriminate between zarr metadata and data files (we aren't to bother semantically distinguish metadata from data files in zarr folder, are we?) So may be just

Suggested change
1. Zarr metadata must be mutable for zarr files in draft dandisets.
1. Zarr files must be mutable in draft dandisets.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement as I recall it was that metadata needs to be mutable, while the rest of the files should be mutable, but don't have to be. In this list I'm trying to capture things that are needed, not things we'd like to have.

In this implementation all files are mutable, but there might be some alternative proposal where they are not. The choice of implementation should not affect the list of requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then I would prefer if we make it a requirement now that any of the "zarr files" should be modifiable, or we would be in the same boat quite quickly IMHO. WDYT @satra?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the more general formulation here. in draft state everything is mutable. but the most likely change will be the metadata. we don't have to optimize for this at this stage. simply allow files to be overwritten/deleted/added.


# Implementation

## API endpoints

* POST /api/zarr_files/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know we are calling this zarr_files but i would suggest a more generic name. zarr is one of many folder based structures.

Start a new zarr file.
Returns a zarr ID
* GET /api/zarr_files/{zarr_id}/
Returns some information about the zarr file, like name, S3 path+url, and checksum
* POST /api/zarr_files/{zarr_id}/batch/start/
Ask to upload a batch of subfiles as part of the zarr file.
No more batch uploads are permitted until this one is completed or aborted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (just curious if there is a good existing one to match): what would be return code to inform about ongoing batch upload?
will different batches from different users be allowed? how would we guarantee consistency of the entire "zarr file" instead of a particular (partial "file" batch)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409 seems appropriate.

Nope, only one batch at a time per zarr, so no concurrent uploads. If an upload crashes and someone else needs to take over, they can cancel the pending batch upload. If we every need multiple simultaneous uploads, we could have a batch ID and allow multiple batches simultaneously, but that just sounds needlessly complex right now for a feature no one asked for.

My theory was that a single CLI instance would request a batch of a reasonable size, then upload all the parts simultaneously to keep the pipe to S3 as full as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree -- no simultaneous batches, but allow different users to upload (whenever there is no ongoing batch) to the same "zarr file".

The question remains -- how will we ensure the consistency / integrity of the entire "zarr file" and not just a particular batch? (in particular in the aforementioned "multiple users can upload", but can manifest in the same user upload while local "zarr file" gets changed between batches) . I guess it could be up to a CLI to get a full "checksum" for a "zarr file" upon completion and compare with the local version (which might require heavy traversal, but may be fscache would be of help)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, in the current design the checksum is not actually checked anywhere. We could simply delegate to the CLI to independently verify that the current reported checksum for the uploaded zarr file is the same as the locally computed one.

It's worth pointing out that each file in each batch does have it's ETag checked during the batch complete step.

Requires a list of file paths and ETags (md5 checksums)
The file paths may include already uploaded files; this is how updates are done.
Returns a list of presigned upload URLs
* POST /api/zarr_files/{zarr_id}/batch/complete/
Completes an upload of a batch of subfiles.
Fail if any of the checksums do not match.
Also, recursively update the tree hash for every parent node for each child.
* DELETE /api/zarr_files/{zarr_id}/batch/
Cancels a batch upload.
Any files already uploaded are deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to tree hash now that some nodes are to be removed -- I guess the DELETE endpoint from below is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be recomputed by the API server for that directory level exactly as if a new node had been added. There's a little more complexity around deleting the last file in a directory, but for design purposes right now, the API server will ensure the treehash is correct when completing a batch upload (create/update) or a batch delete (delete).

A new batch upload can then be started.
* DELETE /api/zarr_files/{zarr_id}/batch/delete/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/batch in this endpoint confused me -- this one is not about batch upload operation, but rather just about deleting some portion of an existing 'zarr file', right? so why not just

DELETE /api/zarr_files/{zarr_id}

with mandatory paths (how large in size that parameter could be?).
note: we do not allow for DELETE of blobs (they are to be GCed), hence I guess here we aren't to provide DELETE for entire zarr "folder".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that this was the delete-y counterpart to batch/start/ and /batch/complete/ endpoints, so it should live next to them. DELETE /batch makes more sense as the cancel the upload operation, and DELETE /zarr_files/{zarr_id}/ seemed confusing to me that it would refer to deleting parts of a zarr file instead of deleting the whole thing.

Also, as specified, it accepts a list of paths to delete, so it is technically deleting a batch of subfiles. It's less involved than the upload process, but it's still arguably a batch.

I don't feel super strongly about naming semantics, we can hash out API details more if the overall design is acceptable.

Deletes subfiles from S3, and updates the treehash accordingly.
Requires a list of file paths

* POST /api/dandisets/{...}/versions/{...}/assets/zarr/{zarr_id}/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just reuse existing

POST /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/

which is provided blob_id ATM, but I would expect blob_id and zarr_id to be not colliding, so code could figure out if it was a blob or zarr . Moreover such endpoint with /{zarr_id}/ isn't consistent with this mentioned post where blob_id is provided among parameters to the call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of accepting a /{blob_id} or a /{zarr_id} in an endpoint because those two flavors of ID identify different (related) objects, and allowing them to be used interchangeably makes for a more confusing API. I have a personal preference towards multiple well-scope endpoints instead of a single, "clever" endpoint that can accept anything.

Creates a new Asset that points to this zarr data.
Return an asset ID

## Upload flow

TODO the only gotcha is that published zarr files can't be uploaded to/updated/deleted from
The slowest API request in the upload flow is the `.../batch/complete/` operation, since it needs to update several `.checksum` files in S3.
I recommend that the CLI dynamically adjust batch size to keep the `.../batch/complete/` operation under 30s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLIs should not be burdened with timing server operations! max batch size should be set by/on the server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can enforce some maximum batch size to ensure the finalization never takes more than 30s, but I thought allowing the CLI to choose it's batch size was a neat feature.

Suggested change
I recommend that the CLI dynamically adjust batch size to keep the `.../batch/complete/` operation under 30s.
We will enforce a maximum batch size to ensure clients never upload batches that are too big to hash in a timely manner.


## Modification flow

TODO this is basically the same as the upload flow, but with deletes allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: so it will be up for client to first get a full list of paths of the "zarr files" for a "zarr file", figure out what to delete, delete them, upload modified. Without locking -- chance for a race condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure if you're modifying, you can just request to upload to an existing path and S3 will just overwrite the old object, so no delete necessary.


## Hashing scheme

To avoid having to store a checksum for every directory in every zarr file, we will delegate the storage of the checksums to S3.
Update times will be substantially slower since each level of the tree needs to be updated.
Storage costs will be offloaded to S3, and the treehash will be conveniently accessible to clients.

Let's say the zarr files are kept under `/zarr/{zarr_id}/...` in S3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most likely we would like to have similar to /blobs 2 extra levels in the hierarchy, i.e. have it as

Suggested change
Let's say the zarr files are kept under `/zarr/{zarr_id}/...` in S3.
Let's say the zarr files are kept under `/zarr/{zarr_id[:3]}/{zarr_id[3:6]}/{zarr_id}/...` in S3.

A subfile `1/2/3/4.txt` of zarr file `foobar` would be kept at `/zarr/foobar/1/2/3/4.txt`.
The ETag can be trivially determined by a `HEAD` request to that S3 object.

After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this would be "spreading" zarr relevant metadata in distant locations (withing /zarr and /zarr_checksums hierarchies) making it a little odd. Since foobar is really our zarr_id (thus we have control over its name), we can use /zarr/foobar.checksums/ or alike. That would keep them nearby - easier for anyone to discover/use/etc.

note: AFAIK we could have /zarr/foobar.checksum file and /zarr/foobar.checksum/1 file on S3, i.e. /zarr/foobar.checksum could be both a "file" and a "directory". "nice" feature of S3 but we ideally should avoid using that to not complicate local backups etc since wouldn't be representable on a regular file system.
Then it could be /zarr/foobar.checksum/.checksum and so on down the hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to keeping them more proximate, I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage.
I explicitly decided against storing checksums as objects with paths that are directory names for exactly the reasons you outlined. I see no reason to do that at any point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage

yes, we should not pollute it with anything ad-hoc.

This file contains a list of all the files/directories currently in the directory, their checksums, and an aggregated checksum of the entire directory contents.
If `.checksum` already exists, it is updated with the new file+ETag and the final checksum is updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the size of such a file for a sizeable (e.g. 1mil in 'zarr files') zarr file -- i.e. wouldn't it be prohibitive?

ideally we should check within zarr ecosystem/community if something like that is already implemented or thought about As I have mentioned elsewhere there is already some support for consolidated metadata and issues (e.g. this one) of slow listing of large "zarr files" on S3 already raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's basically a .checksum file per directory. If your zarr file has a directory that contains 100,000 files/directories, yes indeed cost would be prohibitive. My understanding was that the whole value proposition of zarr is the tree-structure used to store the subfiles, so if a zarr file has a profoundly unbalanced tree, it's not very useful.

If there is data that ugly that might get uploaded to dandi, then that would be a great requirement to know about.


The API server then recursively travels up another level and creates `/zarr_checksums/foobar/1/2/.checksum`.
This `.checksum` will contain an entry for the directory `3` and the final checksum from `3/.checksum`.
This recursion continues all the way to the top, yielding a `/zarr_checksums/foobar/.checksum` file which contains an aggregated checksum for the entire zarr file.

### .checksum file format

TODO
TODO include sha256 as it is calculated? are we even calculating sha256 for zarr files?
Copy link
Member

@yarikoptic yarikoptic Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question: might become a huge load on the backend server to send 1mil queries for offband checksuming tiny files... probably we can get away and just compute md5/etag?

edit: FWIW knowing sha256 might be useful for dataladification, but not strictly required, thus probably should not be wasted effort on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my thought was that sha256 might come in handy for a .nwb file, but taking the sha256 of a zarr file doesn't make any sense to me. If we were to sha256 each subfile we could throw it into the .checksum file somewhere, but actually cascading it upwards seems redundant with the ETag based treehash we are already computing.


### Treehash algorithm

TODO it obviously needs to include all the checksums for all the files in the directory, but assembled how?

## Publishing

When a dandiset containing a zarr file is published, normal Assets are simply added to the new published dandiset, which is a DB only operation and therefore very fast.
This doesn't work for zarr files, since we want to make sure that subsequent modifications to the draft zarr file don't impact the published zarr file.

Fortunately, fast publishing is not a requirement for dandisets, just a happy coincidedence of our deduplication scheme.
On publish, simply trigger a job that copies the zarr file S3 data to a new location (TODO where is that exactly?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would require locking of that 'zarr file' from modifications interim. copying could take hours or days... also would blow the size of archive twice (or more if multiple versions published with e.g. tiny changes to 'zarr file' content) unnecessarily

I really think we should look into utilizing S3 versioning + external "manifest" file with some custom backend for zarr or may be somehow fsspec_reference_maker (e.g. as used here for indexing access to .nc files).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about locking. I think we could add an error message if clients try to use {zarr_file}/batch operations while a publish is happening, which would effectively lock them out of any modifications.

Yep, lots of tiny publishes would be bad. I don't see any way to store the file accessibly in S3 without copying on publish. The alternative would involve a custom zarr backend. I delegate designing that to @yarikoptic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To "store" we don't need backend, we just need to get a listing with S3 versionIds per each key.

It is to access some previous state/version where a custom zarr backend OR may be some support at the level of fsspec which would be needed. Ok -- I will inquire within zarr/fsspec communities on that (and possibly other) aspects.

Copy link
Member

@yarikoptic yarikoptic Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while composing a question on SO ran into zarr-developers/zarr-specs#82 which I think hits the use case but it is still just in a "mind storming" mode :-/ need to digest it.

edit: an idea to check further -- to implement it at fsspec level: zarr-developers/zarr-specs#82 (comment)

Immediately after publishing, the zarr data will be incomplete while the S3 copy is still in progress.
The UI communicates the current status of the published version and alerts the user that because their dandiset contains zarr files, publish may take some time.