Second zarr design doc #552

dchiquito · 2021-10-05T20:53:29Z

I said I would do some thinking about alternatives to having an alternative zarr backend, and this is what I came up with.
I think the API/upload bits are pretty uncontroversial, no need to critique too deeply right now.
The .checksum file is newly written down, but I think it was @satra's idea so also uncontroversial I hope.

The bit at the end about publishing is the downside. Thoughts appreciated.

This is a reformulation of #295, meant to subsume it.

yarikoptic

did initial pass and left comments. I think we better reach out to zarr/fsspec community(ies) with our use case/desires (large sizes, updates, checksumming, versioning, use of S3 but ideally should not rely on S3 features) and see if they have pointers/ideas for the most efficient implementation ATM. Otherwise we would be just creating a poorly informed ad-hoc design.

yarikoptic · 2021-10-06T12:31:15Z

doc/design/zarr-support-2.md

+Zarr files are represented by a single Asset through the API.
+A treehash is computed as the zarr file is uploaded.
+Each node in the treehash in the hash is stored in S3 to avoid bloating the DB.
+When published, the zarr file is copied to a separate location in the bucket so that changes to the draft cannot affect the published file.


expensive operation, would cause publish'ing possibly take days...

Yes, sadly. IIRC Immediate publishes were not a requirement, just a nice feature.

we can instead copy to a new draft location and leave that entity published. so a user may have to explicitly initiate a draft for a zarr-based dandisets. i would flip this operation, but there are a few other details to consider.

yarikoptic · 2021-10-06T12:35:05Z

doc/design/zarr-support-2.md

+1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file.
+1. The API verifies the checksum _immediately_ after upload.
+1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
+1. Zarr metadata must be mutable for zarr files in draft dandisets.


although metadata is likely to be the one to mutate more often, AFAIK there is no reason to discriminate between zarr metadata and data files (we aren't to bother semantically distinguish metadata from data files in zarr folder, are we?) So may be just

Suggested change

1. Zarr metadata must be mutable for zarr files in draft dandisets.

1. Zarr files must be mutable in draft dandisets.

?

The requirement as I recall it was that metadata needs to be mutable, while the rest of the files should be mutable, but don't have to be. In this list I'm trying to capture things that are needed, not things we'd like to have.

In this implementation all files are mutable, but there might be some alternative proposal where they are not. The choice of implementation should not affect the list of requirements.

ok, then I would prefer if we make it a requirement now that any of the "zarr files" should be modifiable, or we would be in the same boat quite quickly IMHO. WDYT @satra?

i agree with the more general formulation here. in draft state everything is mutable. but the most likely change will be the metadata. we don't have to optimize for this at this stage. simply allow files to be overwritten/deleted/added.

yarikoptic · 2021-10-06T12:46:08Z

doc/design/zarr-support-2.md

+1. Zarr files are stored in a "directory" in S3.
+1. Each zarr file corresponds to a single Asset.


"zarr file" has been used to describe the zarr "directory" (well, different backends might store differently, including into a file). But I wonder if we should set terminology a little more straight, and call it "zarr folder" or "zarr archive" or "zarr blob" to not confuse with individual "zarr files" constituting zarr file/folder/archive/blobg? Otherwise above suggests that "each zarr file [among those zarr files mentioned above] corresponds to a single Asset" which is not the case. So what about

Suggested change

1. Zarr files are stored in a "directory" in S3.

1. Each zarr file corresponds to a single Asset.

1. A Zarr blob is stored as files in a "directory" in S3.

1. Each zarr blob corresponds to a single Asset.

? (and reflect that everywhere else in the document text/API endpoint)

In this document I've been referring to the entire zarr thing as a "zarr file", and the files that comprise it as "subfiles". If this design is worth investing more time into I can rename things appropriately.

yarikoptic · 2021-10-06T12:50:11Z

doc/design/zarr-support-2.md

+  Returns some information about the zarr file, like name, S3 path+url, and checksum
+* POST /api/zarr_files/{zarr_id}/batch/start/
+  Ask to upload a batch of subfiles as part of the zarr file.
+  No more batch uploads are permitted until this one is completed or aborted.


minor (just curious if there is a good existing one to match): what would be return code to inform about ongoing batch upload?
will different batches from different users be allowed? how would we guarantee consistency of the entire "zarr file" instead of a particular (partial "file" batch)?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409 seems appropriate.

Nope, only one batch at a time per zarr, so no concurrent uploads. If an upload crashes and someone else needs to take over, they can cancel the pending batch upload. If we every need multiple simultaneous uploads, we could have a batch ID and allow multiple batches simultaneously, but that just sounds needlessly complex right now for a feature no one asked for.

My theory was that a single CLI instance would request a batch of a reasonable size, then upload all the parts simultaneously to keep the pipe to S3 as full as possible.

agree -- no simultaneous batches, but allow different users to upload (whenever there is no ongoing batch) to the same "zarr file".

The question remains -- how will we ensure the consistency / integrity of the entire "zarr file" and not just a particular batch? (in particular in the aforementioned "multiple users can upload", but can manifest in the same user upload while local "zarr file" gets changed between batches) . I guess it could be up to a CLI to get a full "checksum" for a "zarr file" upon completion and compare with the local version (which might require heavy traversal, but may be fscache would be of help)?

Good point, in the current design the checksum is not actually checked anywhere. We could simply delegate to the CLI to independently verify that the current reported checksum for the uploaded zarr file is the same as the locally computed one.

It's worth pointing out that each file in each batch does have it's ETag checked during the batch complete step.

doc/design/zarr-support-2.md

yarikoptic · 2021-10-06T13:45:38Z

doc/design/zarr-support-2.md

+A subfile `1/2/3/4.txt` of zarr file `foobar` would be kept at `/zarr/foobar/1/2/3/4.txt`.
+The ETag can be trivially determined by a `HEAD` request to that S3 object.
+
+After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix).


note: this would be "spreading" zarr relevant metadata in distant locations (withing /zarr and /zarr_checksums hierarchies) making it a little odd. Since foobar is really our zarr_id (thus we have control over its name), we can use /zarr/foobar.checksums/ or alike. That would keep them nearby - easier for anyone to discover/use/etc.

note: AFAIK we could have /zarr/foobar.checksum file and /zarr/foobar.checksum/1 file on S3, i.e. /zarr/foobar.checksum could be both a "file" and a "directory". "nice" feature of S3 but we ideally should avoid using that to not complicate local backups etc since wouldn't be representable on a regular file system.
Then it could be /zarr/foobar.checksum/.checksum and so on down the hierarchy.

I have no objection to keeping them more proximate, I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage.
I explicitly decided against storing checksums as objects with paths that are directory names for exactly the reasons you outlined. I see no reason to do that at any point.

... I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage

yes, we should not pollute it with anything ad-hoc.

yarikoptic · 2021-10-06T13:58:52Z

doc/design/zarr-support-2.md

+
+After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix).
+This file contains a list of all the files/directories currently in the directory, their checksums, and an aggregated checksum of the entire directory contents.
+If `.checksum` already exists, it is updated with the new file+ETag and the final checksum is updated.


What would be the size of such a file for a sizeable (e.g. 1mil in 'zarr files') zarr file -- i.e. wouldn't it be prohibitive?

ideally we should check within zarr ecosystem/community if something like that is already implemented or thought about As I have mentioned elsewhere there is already some support for consolidated metadata and issues (e.g. this one) of slow listing of large "zarr files" on S3 already raised.

There's basically a .checksum file per directory. If your zarr file has a directory that contains 100,000 files/directories, yes indeed cost would be prohibitive. My understanding was that the whole value proposition of zarr is the tree-structure used to store the subfiles, so if a zarr file has a profoundly unbalanced tree, it's not very useful.

If there is data that ugly that might get uploaded to dandi, then that would be a great requirement to know about.

doc/design/zarr-support-2.md

yarikoptic · 2021-10-06T14:04:55Z

doc/design/zarr-support-2.md

+### .checksum file format
+
+TODO
+TODO include sha256 as it is calculated? are we even calculating sha256 for zarr files?


good question: might become a huge load on the backend server to send 1mil queries for offband checksuming tiny files... probably we can get away and just compute md5/etag?

edit: FWIW knowing sha256 might be useful for dataladification, but not strictly required, thus probably should not be wasted effort on.

Yeah, my thought was that sha256 might come in handy for a .nwb file, but taking the sha256 of a zarr file doesn't make any sense to me. If we were to sha256 each subfile we could throw it into the .checksum file somewhere, but actually cascading it upwards seems redundant with the ETag based treehash we are already computing.

yarikoptic · 2021-10-06T14:09:11Z

doc/design/zarr-support-2.md

+This doesn't work for zarr files, since we want to make sure that subsequent modifications to the draft zarr file don't impact the published zarr file.
+
+Fortunately, fast publishing is not a requirement for dandisets, just a happy coincidedence of our deduplication scheme.
+On publish, simply trigger a job that copies the zarr file S3 data to a new location (TODO where is that exactly?).


would require locking of that 'zarr file' from modifications interim. copying could take hours or days... also would blow the size of archive twice (or more if multiple versions published with e.g. tiny changes to 'zarr file' content) unnecessarily

I really think we should look into utilizing S3 versioning + external "manifest" file with some custom backend for zarr or may be somehow fsspec_reference_maker (e.g. as used here for indexing access to .nc files).

Good point about locking. I think we could add an error message if clients try to use {zarr_file}/batch operations while a publish is happening, which would effectively lock them out of any modifications.

Yep, lots of tiny publishes would be bad. I don't see any way to store the file accessibly in S3 without copying on publish. The alternative would involve a custom zarr backend. I delegate designing that to @yarikoptic

To "store" we don't need backend, we just need to get a listing with S3 versionIds per each key.

It is to access some previous state/version where a custom zarr backend OR may be some support at the level of fsspec which would be needed. Ok -- I will inquire within zarr/fsspec communities on that (and possibly other) aspects.

while composing a question on SO ran into zarr-developers/zarr-specs#82 which I think hits the use case but it is still just in a "mind storming" mode :-/ need to digest it.

edit: an idea to check further -- to implement it at fsspec level: zarr-developers/zarr-specs#82 (comment)

Co-authored-by: Yaroslav Halchenko <[email protected]>

satra · 2021-10-08T15:20:12Z

doc/design/zarr-support-2.md

+
+## API endpoints
+
+* POST /api/zarr_files/


i know we are calling this zarr_files but i would suggest a more generic name. zarr is one of many folder based structures.

dchiquito · 2021-10-20T22:45:09Z

Merging this into the archived design docs directory in favor of #574

Properly handle Django user initials

Second zarr design doc

d6c998f

dchiquito requested review from yarikoptic, satra and waxlamp October 5, 2021 20:53

dchiquito added the design-doc Involves creating or discussing a design document label Oct 5, 2021

yarikoptic reviewed Oct 6, 2021

View reviewed changes

Fix typo in doc/design/zarr-support-2.md

0dcef08

Co-authored-by: Yaroslav Halchenko <[email protected]>

satra reviewed Oct 8, 2021

View reviewed changes

dchiquito mentioned this pull request Oct 15, 2021

Zarr upload API #566

Closed

waxlamp added this to the Sprint - October 18, 2021 to November 1, 2021 milestone Oct 18, 2021

dchiquito added 2 commits October 20, 2021 17:20

Add archived message to zarr-support-2.md

a5ca9a3

Archive zarr-support-2.md

06b48f7

dchiquito merged commit 315620c into master Oct 20, 2021

dchiquito deleted the enh-zarr-support-design-2 branch October 20, 2021 22:45

waxlamp pushed a commit that referenced this pull request Jan 7, 2022

Merge pull request #552 from dandi/user-initials

2872351

Properly handle Django user initials

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second zarr design doc #552

Second zarr design doc #552

dchiquito commented Oct 5, 2021 •

edited by waxlamp

Loading

yarikoptic left a comment

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

satra Oct 8, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

satra Oct 8, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 8, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021 •

edited

Loading

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

dchiquito Oct 6, 2021

yarikoptic Oct 6, 2021

yarikoptic Oct 6, 2021 •

edited

Loading

satra Oct 8, 2021

dchiquito commented Oct 20, 2021

	1. Zarr metadata must be mutable for zarr files in draft dandisets.
	1. Zarr files must be mutable in draft dandisets.

		1. Zarr files are stored in a "directory" in S3.
		1. Each zarr file corresponds to a single Asset.

Second zarr design doc #552

Second zarr design doc #552

Conversation

dchiquito commented Oct 5, 2021 • edited by waxlamp Loading

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Oct 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchiquito commented Oct 20, 2021

dchiquito commented Oct 5, 2021 •

edited by waxlamp

Loading

yarikoptic Oct 6, 2021 •

edited

Loading

yarikoptic Oct 6, 2021 •

edited

Loading