-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second zarr design doc #552
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did initial pass and left comments. I think we better reach out to zarr/fsspec community(ies) with our use case/desires (large sizes, updates, checksumming, versioning, use of S3 but ideally should not rely on S3 features) and see if they have pointers/ideas for the most efficient implementation ATM. Otherwise we would be just creating a poorly informed ad-hoc design.
doc/design/zarr-support-2.md
Outdated
Zarr files are represented by a single Asset through the API. | ||
A treehash is computed as the zarr file is uploaded. | ||
Each node in the treehash in the hash is stored in S3 to avoid bloating the DB. | ||
When published, the zarr file is copied to a separate location in the bucket so that changes to the draft cannot affect the published file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expensive operation, would cause publish'ing possibly take days...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sadly. IIRC Immediate publishes were not a requirement, just a nice feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can instead copy to a new draft location and leave that entity published. so a user may have to explicitly initiate a draft for a zarr-based dandisets. i would flip this operation, but there are a few other details to consider.
doc/design/zarr-support-2.md
Outdated
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file. | ||
1. The API verifies the checksum _immediately_ after upload. | ||
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes. | ||
1. Zarr metadata must be mutable for zarr files in draft dandisets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although metadata is likely to be the one to mutate more often, AFAIK there is no reason to discriminate between zarr metadata and data files (we aren't to bother semantically distinguish metadata from data files in zarr folder, are we?) So may be just
1. Zarr metadata must be mutable for zarr files in draft dandisets. | |
1. Zarr files must be mutable in draft dandisets. |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirement as I recall it was that metadata needs to be mutable, while the rest of the files should be mutable, but don't have to be. In this list I'm trying to capture things that are needed, not things we'd like to have.
In this implementation all files are mutable, but there might be some alternative proposal where they are not. The choice of implementation should not affect the list of requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, then I would prefer if we make it a requirement now that any of the "zarr files" should be modifiable, or we would be in the same boat quite quickly IMHO. WDYT @satra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with the more general formulation here. in draft state everything is mutable. but the most likely change will be the metadata. we don't have to optimize for this at this stage. simply allow files to be overwritten/deleted/added.
doc/design/zarr-support-2.md
Outdated
1. Zarr files are stored in a "directory" in S3. | ||
1. Each zarr file corresponds to a single Asset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"zarr file" has been used to describe the zarr "directory" (well, different backends might store differently, including into a file). But I wonder if we should set terminology a little more straight, and call it "zarr folder" or "zarr archive" or "zarr blob" to not confuse with individual "zarr files" constituting zarr file/folder/archive/blobg? Otherwise above suggests that "each zarr file [among those zarr files mentioned above] corresponds to a single Asset" which is not the case. So what about
1. Zarr files are stored in a "directory" in S3. | |
1. Each zarr file corresponds to a single Asset. | |
1. A Zarr blob is stored as files in a "directory" in S3. | |
1. Each zarr blob corresponds to a single Asset. |
? (and reflect that everywhere else in the document text/API endpoint)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this document I've been referring to the entire zarr thing as a "zarr file", and the files that comprise it as "subfiles". If this design is worth investing more time into I can rename things appropriately.
doc/design/zarr-support-2.md
Outdated
Returns some information about the zarr file, like name, S3 path+url, and checksum | ||
* POST /api/zarr_files/{zarr_id}/batch/start/ | ||
Ask to upload a batch of subfiles as part of the zarr file. | ||
No more batch uploads are permitted until this one is completed or aborted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor (just curious if there is a good existing one to match): what would be return code to inform about ongoing batch upload?
will different batches from different users be allowed? how would we guarantee consistency of the entire "zarr file" instead of a particular (partial "file" batch)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/409 seems appropriate.
Nope, only one batch at a time per zarr, so no concurrent uploads. If an upload crashes and someone else needs to take over, they can cancel the pending batch upload. If we every need multiple simultaneous uploads, we could have a batch ID and allow multiple batches simultaneously, but that just sounds needlessly complex right now for a feature no one asked for.
My theory was that a single CLI instance would request a batch of a reasonable size, then upload all the parts simultaneously to keep the pipe to S3 as full as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree -- no simultaneous batches, but allow different users to upload (whenever there is no ongoing batch) to the same "zarr file".
The question remains -- how will we ensure the consistency / integrity of the entire "zarr file" and not just a particular batch? (in particular in the aforementioned "multiple users can upload", but can manifest in the same user upload while local "zarr file" gets changed between batches) . I guess it could be up to a CLI to get a full "checksum" for a "zarr file" upon completion and compare with the local version (which might require heavy traversal, but may be fscache would be of help)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, in the current design the checksum is not actually checked anywhere. We could simply delegate to the CLI to independently verify that the current reported checksum for the uploaded zarr file is the same as the locally computed one.
It's worth pointing out that each file in each batch does have it's ETag checked during the batch complete step.
doc/design/zarr-support-2.md
Outdated
A subfile `1/2/3/4.txt` of zarr file `foobar` would be kept at `/zarr/foobar/1/2/3/4.txt`. | ||
The ETag can be trivially determined by a `HEAD` request to that S3 object. | ||
|
||
After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this would be "spreading" zarr relevant metadata in distant locations (withing /zarr
and /zarr_checksums
hierarchies) making it a little odd. Since foobar
is really our zarr_id
(thus we have control over its name), we can use /zarr/foobar.checksums/
or alike. That would keep them nearby - easier for anyone to discover/use/etc.
note: AFAIK we could have /zarr/foobar.checksum
file and /zarr/foobar.checksum/1
file on S3, i.e. /zarr/foobar.checksum
could be both a "file" and a "directory". "nice" feature of S3 but we ideally should avoid using that to not complicate local backups etc since wouldn't be representable on a regular file system.
Then it could be /zarr/foobar.checksum/.checksum
and so on down the hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objection to keeping them more proximate, I just didn't want to include .checksum
s in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage.
I explicitly decided against storing checksums as objects with paths that are directory names for exactly the reasons you outlined. I see no reason to do that at any point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I just didn't want to include
.checksum
s in the actual zarr directories themselves to avoid "polluting" real data with our checksum garbage
yes, we should not pollute it with anything ad-hoc.
doc/design/zarr-support-2.md
Outdated
|
||
After this subfile is uploaded, the API server creates a file `/zarr_checksums/foobar/1/2/3/.checksum` if it doesn't already exist (note the altered prefix). | ||
This file contains a list of all the files/directories currently in the directory, their checksums, and an aggregated checksum of the entire directory contents. | ||
If `.checksum` already exists, it is updated with the new file+ETag and the final checksum is updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the size of such a file for a sizeable (e.g. 1mil in 'zarr files') zarr file -- i.e. wouldn't it be prohibitive?
ideally we should check within zarr ecosystem/community if something like that is already implemented or thought about As I have mentioned elsewhere there is already some support for consolidated metadata and issues (e.g. this one) of slow listing of large "zarr files" on S3 already raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's basically a .checksum
file per directory. If your zarr file has a directory that contains 100,000 files/directories, yes indeed cost would be prohibitive. My understanding was that the whole value proposition of zarr is the tree-structure used to store the subfiles, so if a zarr file has a profoundly unbalanced tree, it's not very useful.
If there is data that ugly that might get uploaded to dandi, then that would be a great requirement to know about.
doc/design/zarr-support-2.md
Outdated
### .checksum file format | ||
|
||
TODO | ||
TODO include sha256 as it is calculated? are we even calculating sha256 for zarr files? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question: might become a huge load on the backend server to send 1mil queries for offband checksuming tiny files... probably we can get away and just compute md5/etag?
edit: FWIW knowing sha256 might be useful for dataladification, but not strictly required, thus probably should not be wasted effort on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my thought was that sha256 might come in handy for a .nwb file, but taking the sha256 of a zarr file doesn't make any sense to me. If we were to sha256 each subfile we could throw it into the .checksum file somewhere, but actually cascading it upwards seems redundant with the ETag based treehash we are already computing.
doc/design/zarr-support-2.md
Outdated
This doesn't work for zarr files, since we want to make sure that subsequent modifications to the draft zarr file don't impact the published zarr file. | ||
|
||
Fortunately, fast publishing is not a requirement for dandisets, just a happy coincidedence of our deduplication scheme. | ||
On publish, simply trigger a job that copies the zarr file S3 data to a new location (TODO where is that exactly?). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would require locking of that 'zarr file' from modifications interim. copying could take hours or days... also would blow the size of archive twice (or more if multiple versions published with e.g. tiny changes to 'zarr file' content) unnecessarily
I really think we should look into utilizing S3 versioning + external "manifest" file with some custom backend for zarr or may be somehow fsspec_reference_maker
(e.g. as used here for indexing access to .nc files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about locking. I think we could add an error message if clients try to use {zarr_file}/batch
operations while a publish is happening, which would effectively lock them out of any modifications.
Yep, lots of tiny publishes would be bad. I don't see any way to store the file accessibly in S3 without copying on publish. The alternative would involve a custom zarr backend. I delegate designing that to @yarikoptic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To "store" we don't need backend, we just need to get a listing with S3 versionIds per each key.
It is to access some previous state/version where a custom zarr backend OR may be some support at the level of fsspec which would be needed. Ok -- I will inquire within zarr/fsspec communities on that (and possibly other) aspects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while composing a question on SO ran into zarr-developers/zarr-specs#82 which I think hits the use case but it is still just in a "mind storming" mode :-/ need to digest it.
edit: an idea to check further -- to implement it at fsspec level: zarr-developers/zarr-specs#82 (comment)
Co-authored-by: Yaroslav Halchenko <[email protected]>
doc/design/zarr-support-2.md
Outdated
|
||
## API endpoints | ||
|
||
* POST /api/zarr_files/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know we are calling this zarr_files
but i would suggest a more generic name. zarr is one of many folder based structures.
Merging this into the archived design docs directory in favor of #574 |
Properly handle Django user initials
I said I would do some thinking about alternatives to having an alternative zarr backend, and this is what I came up with.
I think the API/upload bits are pretty uncontroversial, no need to critique too deeply right now.
The
.checksum
file is newly written down, but I think it was @satra's idea so also uncontroversial I hope.The bit at the end about publishing is the downside. Thoughts appreciated.
This is a reformulation of #295, meant to subsume it.