Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second zarr design doc #552

Merged
merged 4 commits into from
Oct 20, 2021
Merged

Second zarr design doc #552

merged 4 commits into from
Oct 20, 2021

Conversation

dchiquito
Copy link
Contributor

@dchiquito dchiquito commented Oct 5, 2021

I said I would do some thinking about alternatives to having an alternative zarr backend, and this is what I came up with.
I think the API/upload bits are pretty uncontroversial, no need to critique too deeply right now.
The .checksum file is newly written down, but I think it was @satra's idea so also uncontroversial I hope.

The bit at the end about publishing is the downside. Thoughts appreciated.

This is a reformulation of #295, meant to subsume it.

@dchiquito dchiquito added the design-doc Involves creating or discussing a design document label Oct 5, 2021
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did initial pass and left comments. I think we better reach out to zarr/fsspec community(ies) with our use case/desires (large sizes, updates, checksumming, versioning, use of S3 but ideally should not rely on S3 features) and see if they have pointers/ideas for the most efficient implementation ATM. Otherwise we would be just creating a poorly informed ad-hoc design.

Zarr files are represented by a single Asset through the API.
A treehash is computed as the zarr file is uploaded.
Each node in the treehash in the hash is stored in S3 to avoid bloating the DB.
When published, the zarr file is copied to a separate location in the bucket so that changes to the draft cannot affect the published file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expensive operation, would cause publish'ing possibly take days...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sadly. IIRC Immediate publishes were not a requirement, just a nice feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can instead copy to a new draft location and leave that entity published. so a user may have to explicitly initiate a draft for a zarr-based dandisets. i would flip this operation, but there are a few other details to consider.

1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file.
1. The API verifies the checksum _immediately_ after upload.
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
1. Zarr metadata must be mutable for zarr files in draft dandisets.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although metadata is likely to be the one to mutate more often, AFAIK there is no reason to discriminate between zarr metadata and data files (we aren't to bother semantically distinguish metadata from data files in zarr folder, are we?) So may be just

Suggested change
1. Zarr metadata must be mutable for zarr files in draft dandisets.
1. Zarr files must be mutable in draft dandisets.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement as I recall it was that metadata needs to be mutable, while the rest of the files should be mutable, but don't have to be. In this list I'm trying to capture things that are needed, not things we'd like to have.

In this implementation all files are mutable, but there might be some alternative proposal where they are not. The choice of implementation should not affect the list of requirements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then I would prefer if we make it a requirement now that any of the "zarr files" should be modifiable, or we would be in the same boat quite quickly IMHO. WDYT @satra?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with the more general formulation here. in draft state everything is mutable. but the most likely change will be the metadata. we don't have to optimize for this at this stage. simply allow files to be overwritten/deleted/added.

Comment on lines 11 to 12
1. Zarr files are stored in a "directory" in S3.
1. Each zarr file corresponds to a single Asset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"zarr file" has been used to describe the zarr "directory" (well, different backends might store differently, including into a file). But I wonder if we should set terminology a little more straight, and call it "zarr folder" or "zarr archive" or "zarr blob" to not confuse with individual "zarr files" constituting zarr file/folder/archive/blobg? Otherwise above suggests that "each zarr file [among those zarr files mentioned above] corresponds to a single Asset" which is not the case. So what about

Suggested change
1. Zarr files are stored in a "directory" in S3.
1. Each zarr file corresponds to a single Asset.
1. A Zarr blob is stored as files in a "directory" in S3.
1. Each zarr blob corresponds to a single Asset.

? (and reflect that everywhere else in the document text/API endpoint)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this document I've been referring to the entire zarr thing as a "zarr file", and the files that comprise it as "subfiles". If this design is worth investing more time into I can rename things appropriately.

Returns some information about the zarr file, like name, S3 path+url, and checksum
* POST /api/zarr_files/{zarr_id}/batch/start/
Ask to upload a batch of subfiles as part of the zarr file.
No more batch uploads are permitted until this one is completed or aborted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor (just curious if there is a good existing one to match): what would be return code to inform about ongoing batch upload?
will different batches from different users be allowed? how would we guarantee consistency of the entire "zarr file" instead of a particular (partial "file" batch)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409 seems appropriate.

Nope, only one batch at a time per zarr, so no concurrent uploads. If an upload crashes and someone else needs to take over, they can cancel the pending batch upload. If we every need multiple simultaneous uploads, we could have a batch ID and allow multiple batches simultaneously, but that just sounds needlessly complex right now for a feature no one asked for.

My theory was that a single CLI instance would request a batch of a reasonable size, then upload all the parts simultaneously to keep the pipe to S3 as full as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree -- no simultaneous batches, but allow different users to upload (whenever there is no ongoing batch) to the same "zarr file".

The question remains -- how will we ensure the consistency / integrity of the entire "zarr file" and not just a particular batch? (in particular in the aforementioned "multiple users can upload", but can manifest in the same user upload while local "zarr file" gets changed between batches) . I guess it could be up to a CLI to get a full "checksum" for a "zarr file" upon completion and compare with the local version (which might require heavy traversal, but may be fscache would be of help)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, in the current design the checksum is not actually checked anywhere. We could simply delegate to the CLI to independently verify that the current reported checksum for the uploaded zarr file is the same as the locally computed one.

It's worth pointing out that each file in each batch does have it's ETag checked during the batch complete step.

A subfile `1/2/3/4.txt` of zarr file `foobar` would be kept at `/zarr/foobar/1/2/3/4.txt`.
The ETag can be trivially determined by a `HEAD` request to that S3 object.

After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this would be "spreading" zarr relevant metadata in distant locations (withing /zarr and /zarr_checksums hierarchies) making it a little odd. Since foobar is really our zarr_id (thus we have control over its name), we can use /zarr/foobar.checksums/ or alike. That would keep them nearby - easier for anyone to discover/use/etc.

note: AFAIK we could have /zarr/foobar.checksum file and /zarr/foobar.checksum/1 file on S3, i.e. /zarr/foobar.checksum could be both a "file" and a "directory". "nice" feature of S3 but we ideally should avoid using that to not complicate local backups etc since wouldn't be representable on a regular file system.
Then it could be /zarr/foobar.checksum/.checksum and so on down the hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objection to keeping them more proximate, I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage.
I explicitly decided against storing checksums as objects with paths that are directory names for exactly the reasons you outlined. I see no reason to do that at any point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I just didn't want to include .checksums in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage

yes, we should not pollute it with anything ad-hoc.


After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix).
This file contains a list of all the files/directories currently in the directory, their checksums, and an aggregated checksum of the entire directory contents.
If `.checksum` already exists, it is updated with the new file+ETag and the final checksum is updated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the size of such a file for a sizeable (e.g. 1mil in 'zarr files') zarr file -- i.e. wouldn't it be prohibitive?

ideally we should check within zarr ecosystem/community if something like that is already implemented or thought about As I have mentioned elsewhere there is already some support for consolidated metadata and issues (e.g. this one) of slow listing of large "zarr files" on S3 already raised.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's basically a .checksum file per directory. If your zarr file has a directory that contains 100,000 files/directories, yes indeed cost would be prohibitive. My understanding was that the whole value proposition of zarr is the tree-structure used to store the subfiles, so if a zarr file has a profoundly unbalanced tree, it's not very useful.

If there is data that ugly that might get uploaded to dandi, then that would be a great requirement to know about.

### .checksum file format

TODO
TODO include sha256 as it is calculated? are we even calculating sha256 for zarr files?
Copy link
Member

@yarikoptic yarikoptic Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question: might become a huge load on the backend server to send 1mil queries for offband checksuming tiny files... probably we can get away and just compute md5/etag?

edit: FWIW knowing sha256 might be useful for dataladification, but not strictly required, thus probably should not be wasted effort on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my thought was that sha256 might come in handy for a .nwb file, but taking the sha256 of a zarr file doesn't make any sense to me. If we were to sha256 each subfile we could throw it into the .checksum file somewhere, but actually cascading it upwards seems redundant with the ETag based treehash we are already computing.

This doesn't work for zarr files, since we want to make sure that subsequent modifications to the draft zarr file don't impact the published zarr file.

Fortunately, fast publishing is not a requirement for dandisets, just a happy coincidedence of our deduplication scheme.
On publish, simply trigger a job that copies the zarr file S3 data to a new location (TODO where is that exactly?).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would require locking of that 'zarr file' from modifications interim. copying could take hours or days... also would blow the size of archive twice (or more if multiple versions published with e.g. tiny changes to 'zarr file' content) unnecessarily

I really think we should look into utilizing S3 versioning + external "manifest" file with some custom backend for zarr or may be somehow fsspec_reference_maker (e.g. as used here for indexing access to .nc files).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about locking. I think we could add an error message if clients try to use {zarr_file}/batch operations while a publish is happening, which would effectively lock them out of any modifications.

Yep, lots of tiny publishes would be bad. I don't see any way to store the file accessibly in S3 without copying on publish. The alternative would involve a custom zarr backend. I delegate designing that to @yarikoptic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To "store" we don't need backend, we just need to get a listing with S3 versionIds per each key.

It is to access some previous state/version where a custom zarr backend OR may be some support at the level of fsspec which would be needed. Ok -- I will inquire within zarr/fsspec communities on that (and possibly other) aspects.

Copy link
Member

@yarikoptic yarikoptic Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while composing a question on SO ran into zarr-developers/zarr-specs#82 which I think hits the use case but it is still just in a "mind storming" mode :-/ need to digest it.

edit: an idea to check further -- to implement it at fsspec level: zarr-developers/zarr-specs#82 (comment)


## API endpoints

* POST /api/zarr_files/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know we are calling this zarr_files but i would suggest a more generic name. zarr is one of many folder based structures.

@dchiquito
Copy link
Contributor Author

Merging this into the archived design docs directory in favor of #574

@dchiquito dchiquito merged commit 315620c into master Oct 20, 2021
@dchiquito dchiquito deleted the enh-zarr-support-design-2 branch October 20, 2021 22:45
waxlamp pushed a commit that referenced this pull request Jan 7, 2022
Properly handle Django user initials
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Involves creating or discussing a design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants