-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second zarr design doc #552
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,94 @@ | ||||||||||
# Abstract | ||||||||||
|
||||||||||
Zarr files are stored in their entirety in S3. | ||||||||||
Zarr files are represented by a single Asset through the API. | ||||||||||
A treehash is computed as the zarr file is uploaded. | ||||||||||
Each node in the treehash in the hash is stored in S3 to avoid bloating the DB. | ||||||||||
When published, the zarr file is copied to a separate location in the bucket so that changes to the draft cannot affect the published file. | ||||||||||
|
||||||||||
# Requirements | ||||||||||
|
||||||||||
1. Zarr files are stored in a "directory" in S3. | ||||||||||
1. Each zarr file corresponds to a single Asset. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "zarr file" has been used to describe the zarr "directory" (well, different backends might store differently, including into a file). But I wonder if we should set terminology a little more straight, and call it "zarr folder" or "zarr archive" or "zarr blob" to not confuse with individual "zarr files" constituting zarr file/folder/archive/blobg? Otherwise above suggests that "each zarr file [among those zarr files mentioned above] corresponds to a single Asset" which is not the case. So what about
Suggested change
? (and reflect that everywhere else in the document text/API endpoint) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In this document I've been referring to the entire zarr thing as a "zarr file", and the files that comprise it as "subfiles". If this design is worth investing more time into I can rename things appropriately. |
||||||||||
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file. | ||||||||||
1. The API verifies the checksum _immediately_ after upload. | ||||||||||
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes. | ||||||||||
1. Zarr metadata must be mutable for zarr files in draft dandisets. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. although metadata is likely to be the one to mutate more often, AFAIK there is no reason to discriminate between zarr metadata and data files (we aren't to bother semantically distinguish metadata from data files in zarr folder, are we?) So may be just
Suggested change
? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The requirement as I recall it was that metadata needs to be mutable, while the rest of the files should be mutable, but don't have to be. In this list I'm trying to capture things that are needed, not things we'd like to have. In this implementation all files are mutable, but there might be some alternative proposal where they are not. The choice of implementation should not affect the list of requirements. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok, then I would prefer if we make it a requirement now that any of the "zarr files" should be modifiable, or we would be in the same boat quite quickly IMHO. WDYT @satra? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i agree with the more general formulation here. in draft state everything is mutable. but the most likely change will be the metadata. we don't have to optimize for this at this stage. simply allow files to be overwritten/deleted/added. |
||||||||||
|
||||||||||
# Implementation | ||||||||||
|
||||||||||
## API endpoints | ||||||||||
|
||||||||||
* POST /api/zarr_files/ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i know we are calling this |
||||||||||
Start a new zarr file. | ||||||||||
Returns a zarr ID | ||||||||||
* GET /api/zarr_files/{zarr_id}/ | ||||||||||
Returns some information about the zarr file, like name, S3 path+url, and checksum | ||||||||||
* POST /api/zarr_files/{zarr_id}/batch/start/ | ||||||||||
Ask to upload a batch of subfiles as part of the zarr file. | ||||||||||
No more batch uploads are permitted until this one is completed or aborted. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor (just curious if there is a good existing one to match): what would be return code to inform about ongoing batch upload? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409 seems appropriate. Nope, only one batch at a time per zarr, so no concurrent uploads. If an upload crashes and someone else needs to take over, they can cancel the pending batch upload. If we every need multiple simultaneous uploads, we could have a batch ID and allow multiple batches simultaneously, but that just sounds needlessly complex right now for a feature no one asked for. My theory was that a single CLI instance would request a batch of a reasonable size, then upload all the parts simultaneously to keep the pipe to S3 as full as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree -- no simultaneous batches, but allow different users to upload (whenever there is no ongoing batch) to the same "zarr file". The question remains -- how will we ensure the consistency / integrity of the entire "zarr file" and not just a particular batch? (in particular in the aforementioned "multiple users can upload", but can manifest in the same user upload while local "zarr file" gets changed between batches) . I guess it could be up to a CLI to get a full "checksum" for a "zarr file" upon completion and compare with the local version (which might require heavy traversal, but may be fscache would be of help)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, in the current design the checksum is not actually checked anywhere. We could simply delegate to the CLI to independently verify that the current reported checksum for the uploaded zarr file is the same as the locally computed one. It's worth pointing out that each file in each batch does have it's ETag checked during the batch complete step. |
||||||||||
Requires a list of file paths and ETags (md5 checksums) | ||||||||||
The file paths may include already uploaded files; this is how updates are done. | ||||||||||
Returns a list of presigned upload URLs | ||||||||||
* POST /api/zarr_files/{zarr_id}/batch/complete/ | ||||||||||
Completes an upload of a batch of subfiles. | ||||||||||
Fail if any of the checksums do not match. | ||||||||||
Also, recursively update the tree hash for every parent node for each child. | ||||||||||
yarikoptic marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
* DELETE /api/zarr_files/{zarr_id}/batch/ | ||||||||||
Cancels a batch upload. | ||||||||||
Any files already uploaded are deleted. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what happens to tree hash now that some nodes are to be removed -- I guess the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it would be recomputed by the API server for that directory level exactly as if a new node had been added. There's a little more complexity around deleting the last file in a directory, but for design purposes right now, the API server will ensure the treehash is correct when completing a batch upload (create/update) or a batch delete (delete). |
||||||||||
A new batch upload can then be started. | ||||||||||
* DELETE /api/zarr_files/{zarr_id}/batch/delete/ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
with mandatory There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My thought was that this was the delete-y counterpart to Also, as specified, it accepts a list of paths to delete, so it is technically deleting a batch of subfiles. It's less involved than the upload process, but it's still arguably a batch. I don't feel super strongly about naming semantics, we can hash out API details more if the overall design is acceptable. |
||||||||||
Deletes subfiles from S3, and updates the treehash accordingly. | ||||||||||
Requires a list of file paths | ||||||||||
|
||||||||||
* POST /api/dandisets/{...}/versions/{...}/assets/zarr/{zarr_id}/ | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why not just reuse existing
which is provided There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't like the idea of accepting a |
||||||||||
Creates a new Asset that points to this zarr data. | ||||||||||
Return an asset ID | ||||||||||
|
||||||||||
## Upload flow | ||||||||||
|
||||||||||
TODO the only gotcha is that published zarr files can't be uploaded to/updated/deleted from | ||||||||||
The slowest API request in the upload flow is the `.../batch/complete/` operation, since it needs to update several `.checksum` files in S3. | ||||||||||
I recommend that the CLI dynamically adjust batch size to keep the `.../batch/complete/` operation under 30s. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CLIs should not be burdened with timing server operations! max batch size should be set by/on the server There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can enforce some maximum batch size to ensure the finalization never takes more than 30s, but I thought allowing the CLI to choose it's batch size was a neat feature.
Suggested change
|
||||||||||
|
||||||||||
## Modification flow | ||||||||||
|
||||||||||
TODO this is basically the same as the upload flow, but with deletes allowed. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note: so it will be up for client to first get a full list of paths of the "zarr files" for a "zarr file", figure out what to delete, delete them, upload modified. Without locking -- chance for a race condition. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pretty sure if you're modifying, you can just request to upload to an existing path and S3 will just overwrite the old object, so no delete necessary. |
||||||||||
|
||||||||||
## Hashing scheme | ||||||||||
|
||||||||||
To avoid having to store a checksum for every directory in every zarr file, we will delegate the storage of the checksums to S3. | ||||||||||
Update times will be substantially slower since each level of the tree needs to be updated. | ||||||||||
Storage costs will be offloaded to S3, and the treehash will be conveniently accessible to clients. | ||||||||||
|
||||||||||
Let's say the zarr files are kept under `/zarr/{zarr_id}/...` in S3. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. most likely we would like to have similar to
Suggested change
|
||||||||||
A subfile `1/2/3/4.txt` of zarr file `foobar` would be kept at `/zarr/foobar/1/2/3/4.txt`. | ||||||||||
The ETag can be trivially determined by a `HEAD` request to that S3 object. | ||||||||||
|
||||||||||
After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix). | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note: this would be "spreading" zarr relevant metadata in distant locations (withing note: AFAIK we could have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no objection to keeping them more proximate, I just didn't want to include There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yes, we should not pollute it with anything ad-hoc. |
||||||||||
This file contains a list of all the files/directories currently in the directory, their checksums, and an aggregated checksum of the entire directory contents. | ||||||||||
If `.checksum` already exists, it is updated with the new file+ETag and the final checksum is updated. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What would be the size of such a file for a sizeable (e.g. 1mil in 'zarr files') zarr file -- i.e. wouldn't it be prohibitive? ideally we should check within zarr ecosystem/community if something like that is already implemented or thought about As I have mentioned elsewhere there is already some support for consolidated metadata and issues (e.g. this one) of slow listing of large "zarr files" on S3 already raised. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's basically a If there is data that ugly that might get uploaded to dandi, then that would be a great requirement to know about. |
||||||||||
|
||||||||||
The API server then recursively travels up another level and creates `/zarr_checksums/foobar/1/2/.checksum`. | ||||||||||
This `.checksum` will contain an entry for the directory `3` and the final checksum from `3/.checksum`. | ||||||||||
This recursion continues all the way to the top, yielding a `/zarr_checksums/foobar/.checksum` file which contains an aggregated checksum for the entire zarr file. | ||||||||||
|
||||||||||
### .checksum file format | ||||||||||
|
||||||||||
TODO | ||||||||||
TODO include sha256 as it is calculated? are we even calculating sha256 for zarr files? | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good question: might become a huge load on the backend server to send 1mil queries for offband checksuming tiny files... probably we can get away and just compute md5/etag? edit: FWIW knowing sha256 might be useful for dataladification, but not strictly required, thus probably should not be wasted effort on. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, my thought was that sha256 might come in handy for a .nwb file, but taking the sha256 of a zarr file doesn't make any sense to me. If we were to sha256 each subfile we could throw it into the .checksum file somewhere, but actually cascading it upwards seems redundant with the ETag based treehash we are already computing. |
||||||||||
|
||||||||||
### Treehash algorithm | ||||||||||
|
||||||||||
TODO it obviously needs to include all the checksums for all the files in the directory, but assembled how? | ||||||||||
|
||||||||||
## Publishing | ||||||||||
|
||||||||||
When a dandiset containing a zarr file is published, normal Assets are simply added to the new published dandiset, which is a DB only operation and therefore very fast. | ||||||||||
This doesn't work for zarr files, since we want to make sure that subsequent modifications to the draft zarr file don't impact the published zarr file. | ||||||||||
|
||||||||||
Fortunately, fast publishing is not a requirement for dandisets, just a happy coincidedence of our deduplication scheme. | ||||||||||
On publish, simply trigger a job that copies the zarr file S3 data to a new location (TODO where is that exactly?). | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would require locking of that 'zarr file' from modifications interim. copying could take hours or days... also would blow the size of archive twice (or more if multiple versions published with e.g. tiny changes to 'zarr file' content) unnecessarily I really think we should look into utilizing S3 versioning + external "manifest" file with some custom backend for zarr or may be somehow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point about locking. I think we could add an error message if clients try to use Yep, lots of tiny publishes would be bad. I don't see any way to store the file accessibly in S3 without copying on publish. The alternative would involve a custom zarr backend. I delegate designing that to @yarikoptic There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To "store" we don't need backend, we just need to get a listing with S3 versionIds per each key. It is to access some previous state/version where a custom zarr backend OR may be some support at the level of fsspec which would be needed. Ok -- I will inquire within zarr/fsspec communities on that (and possibly other) aspects. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. while composing a question on SO ran into zarr-developers/zarr-specs#82 which I think hits the use case but it is still just in a "mind storming" mode :-/ need to digest it. edit: an idea to check further -- to implement it at fsspec level: zarr-developers/zarr-specs#82 (comment) |
||||||||||
Immediately after publishing, the zarr data will be incomplete while the S3 copy is still in progress. | ||||||||||
The UI communicates the current status of the published version and alerts the user that because their dandiset contains zarr files, publish may take some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expensive operation, would cause publish'ing possibly take days...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sadly. IIRC Immediate publishes were not a requirement, just a nice feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can instead copy to a new draft location and leave that entity published. so a user may have to explicitly initiate a draft for a zarr-based dandisets. i would flip this operation, but there are a few other details to consider.