Design for Zarr support #295

satra · 2021-05-19T13:50:59Z

This design doc is intended to help us move towards zarr support on the server. It is increasingly clear that we will need this soon. We are forcing people to use HDF5 or tiff in the short term, but will need to move this to NGFF, which uses Zarr. This may also come into play for NWB at some point in time.

yarikoptic · 2021-05-19T16:00:58Z

doc/design/zarr-support.md

+
+1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI 
+   should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
+   at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.


note: on DataLad end I might make each .zarr into a dedicated dataset. There are cons in that (e.g. no shared keystore between different .zarr files sharing some data) but it is the only way I see it to be done in a scalable fashion

doc/design/zarr-support.md

yarikoptic · 2021-05-19T16:09:38Z

doc/design/zarr-support.md

+Given these considerations here are questions for implementation
+1. Is there a way to upload a folder to a given prefix using an API key without having to create 100k signed URLs?
+1. Should the tree structure be stored somewhere so that diffs can be ascertained?
+1. Given that each zarr file may contain 100k+ files, how will dandi-cli handle alterations?    


I think we should be able to make it "ok" experience. Would likely to be slow. Might need to optimize "directory" support in fscacher - might come handy for the composite etag computing etc.

yarikoptic · 2021-05-19T16:11:26Z

doc/design/zarr-support.md

+1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI 
+   should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
+   at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.
+3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix, 


not yet sure if it wouldn't be wiser to keep that folderblobs/ separate from blobs/

i like folderblobs !

Co-authored-by: Yaroslav Halchenko <[email protected]>

satra · 2021-09-02T01:29:52Z

doc/design/zarr-support.md

+1. zarr files are stored in a "directory" in S3.
+1. Each zarr file corresponds to a single Asset.
+1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file. The API verifies this checksum _immediately_ after upload; it's not good enough to download the entire zarr file to calculate it after upload.
+1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.


now i remember why each file is less, the chunks are compressed. and the file calculation is 64*64*64*datatype_bytes

Suggested change

1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.

1. The system can theoretically handle zarr files with ~1 million subfiles, each of size `zip(64 * 64 * 64 * {datatype}) bytes ~<~ 262 kilobytes`.

Do you have an estimate for an upper bound for the file size?

satra · 2021-09-02T01:32:33Z

doc/design/zarr-support.md

+]
+```
+3. API responds with a corresponding list of presigned upload URLs (**TODO** where to upload?) in the S3 bucket.
+The size limit for each upload is 5GB.


since we are doing a single upload (as opposed to multipart) and the etag is being computed, we could build the md5 into the presigning process.

dandi-etag is the S3 etag, which for this case is just the MD5 of the file, so it already is, basically. Do you mean that we should use the etag to generate the presigned URL so that only a file with that etag can be uploaded?

so that only a file with that etag can be uploaded

yes - we couldn't do it for multipart, but should work for single part.

yarikoptic · 2021-09-03T16:25:20Z

doc/design/zarr-support.md

+   should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
+   at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.
+3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix,
+   a zarr-compliant software can navigate the zarr metadata/structure using relative path rules.


later we might end up with non-zarr folders. Should we include indication of underlying "folder format" within folder or the subfolder name?
e.g. could be d65b541b-885a-4bb4-badd-2a57b1bebab0.zarr or may be better d65b541b-885a-4bb4-badd-2a57b1bebab0/zarr/

this way we can actually support storing multiple representations for the same data withing the blob store (e.g. KEY - original nwb or whatnot, KEY/zarr/ - zarr, KEY/ipfs/ - ipfs blocks if that is a thing ;)) without causing conflicts/ambiguity when looking at a specific PREFIX and immediate "sub-folders"

I think it would be better to store things at zarr/KEY/ and ipfs/KEY/, much like we already store things in blobs/KEY/. We can still have the same KEY in multiple stores.

Works for me. Then wording above should avoid "Blob store" since it is blobs/ for me, and mention zarr/ store and layout within to match the one of blobs

On db/api side then it wouldn't be blobs table/endpoint right?

doc/design/zarr-support.md

yarikoptic · 2021-09-03T16:35:46Z

doc/design/zarr-support.md

+A simple scheme that I think would work:
+
+* initial value is `sha256("{etag}:{path}")` for the first subfile
+* the next value is `sha256("{prev_sha256}:{etag}:{path}")`, ad infinitum


if we are not to go for a proper "tree hash" of some kind which would provide more efficient way of computing besides "serial", I think we should just use the same dandi-etaging approach: {md5(sorted((files_etags: dict).items()))}-{len(files_etags)}, and call it dandi-treetag or alike

actually we might just keep it named dandi-etag so DB and API stays consistent across many existing end points regardless either it is a file or a directory

One of the benefits of this particular upload algorithm is that it doesn't need to store the hashes of all the uploaded subfiles, just a single checksum that is updated with each subfile uploaded. Your algorithm involves hashing all the etags at the end, so it does not have that benefit.

This is a viable alternative if we decide on a different scheme that does store the hashes, though.

Are you expecting each client to finalize each file upload in the specified order even if uploading async/parallel and some larger files might clog reporting on smaller files upload?

md5 hash can be updated with new data as more comes in (that is how it is computed ATM on a stream isn't it?), no need to store them all - just an implementation detail IMHO. Or am I wrong?

for a single part upload the presigned url can ensure that the correct content is upload.

yarikoptic · 2021-09-03T16:38:45Z

doc/design/zarr-support.md

+
+### Before upload (technically optional)
+1. CLI calculates the checksum of the zarr file.
+1. CLI queries the API (**TODO** URL?) if the checksum has already been uploaded.


exactly the same as for an upload of an individual file, since why should there be difference?

yarikoptic · 2021-09-03T16:44:33Z

doc/design/zarr-support.md

+1. If so, it proceeds with the already uploaded zarr file and skips upload entirely.
+
+### Upload
+1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`.


the same /uploads/initiate, contentSize - sum of all files sizes (although not used for checksum compute - needed anyways for DB/web ui display), but either we have a different name for the checksum or provide an explicit extra option to specify the format='zarr' (defaulting to 'file' as that is what we support now).

I left the actual API definition for later so we can discuss it after we agree on the requirements and that the overall implementation will meet those requirements.

yarikoptic · 2021-09-03T16:47:57Z

doc/design/zarr-support.md

+### Upload
+1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`.
+1. CLI requests a batch of presigned URLs (**TODO** URL?).
+The files must be ordered in the same way used to calculate the checksum.


I thought that idea was then to operate via providing IAM credentials with write access to target upload prefix.
If that is the case, no need for presigned URLs, but rather instead of parts in response provide IAM credentials (access key, secret key, access token, expiration). What we might need though is a dedicated API endpoint to provide renewed credentials if prior ones are (about to) expire(d).

If we were to operate on per-file upload basis (thus ridiculous amount of presigning) and provide this list -- I would not rely on "ordered in the same way". It should be explicitly ordered by checksumming (we need to assume that CLI and dandi-api both use the same algorithm) which would sort (or not) internally to that algorithm consistently. We should not rely on external to algorithm sorting to provide that sorted order to both checksum compute and this upload end point.

This design doesn't involve IAM credentials. If it doesn't solve the requirements, we can throw out this plan and come up with something else, possibly involving IAM.

If possible, I would rather avoid having to do all that IAM management. It's a technically viable option, but I don't thing we have to do it that way.

I will stop further analysis below since it roots at the idea of "presigning" and I probably incorrectly thought that we would like to avoid that for 100000s of files zarr will be. Please clarify @dchiquito and @satra

After consideration my opinion is that presigning is the least bad thing. It involves number of files / files per batch requests, but ultimately that's much less data than fully proxying the entire upload. Any direct upload scheme like the IAM idea has to figure out a way to calculate the checksum.

I think this scheme I laid out meets the requirements. For now it would be helpful to identify any shortcomings it has (I listed a few in the Pros/Cons section) so we can modify it or discard it in favor of something else.

waxlamp · 2021-09-08T13:25:45Z

doc/design/zarr-support.md

+
+## Benchmarks
+I mocked up API endpoints that would behave more or less like the ones described above.
+I recommend throwing the code away, but it should give a good estimate for performance.


I strongly suggest not throwing the code away, but instead transferring it to a gist so we don't have to keep a weird branch around in the codebase.

jwodder · 2021-09-08T13:49:39Z

doc/design/zarr-support.md

+1. A well-defined ordering of the subfiles in the zarr file must exist.
+The checksum must be computed on the subfiles in this order.
+1. It must be applicable on one subfile at a time.
+1. It must be able save its state between subfiles.


Save state where? If it's to a file, the only way to do that with Python's hashlib classes is via pickle, which opens up security issues.

My idea here was that the hash must be computable incrementally across an arbitrary number of requests, which means the state must be saved somehow in between requests. My hope is for a hashing scheme that uses intermediate hash values rather than saving the state of a hashlib hasher, as that doesn't have a clean solution.

The scheme I outlined below has this property, as would using Merkle trees to recursively concatenate hashes.

satra · 2021-09-08T14:12:01Z

this is an optimization step, but one that i think we should consider in the design.

i would use a treehash or something similar for the simple reason that updating a zarr file should not require uploading all subfiles. it should only upload the changes and remove any missing pieces (so more like a sync operation).

on the api side, the id of the file can stay the same and the hashes would change. the api would only care about the aggregate hash that it can compute just like sha256 computation now using an out of band process. for the actual upload it just needs to know that all files have been uploaded. so an upload init would require number of files to be uploaded and each batch just needs confirmation that those files have been uploaded. the md5 check is done by AWS.

generally this would work, except for partial updates:

initialize upload with the total number of files and size
repeat (API maintains continuous hash):
1. request presigned urls with dandi-etag for a batch in ordered tree mode
2. upload to AWS
3. update batch uploads on API
complete upload. API stores this hash and should match etag computation on the client side for a folder.

we need to think of a way to do partial updates and still be able to update the overall etag at the point of finishing the update.

waxlamp · 2021-09-21T12:25:37Z

this is an optimization step, but one that i think we should consider in the design.

i would use a treehash or something similar for the simple reason that updating a zarr file should not require uploading all subfiles. it should only upload the changes and remove any missing pieces (so more like a sync operation).

on the api side, the id of the file can stay the same and the hashes would change. the api would only care about the aggregate hash that it can compute just like sha256 computation now using an out of band process. for the actual upload it just needs to know that all files have been uploaded. so an upload init would require number of files to be uploaded and each batch just needs confirmation that those files have been uploaded. the md5 check is done by AWS.

generally this would work, except for partial updates:

initialize upload with the total number of files and size

repeat (API maintains continuous hash):

request presigned urls with dandi-etag for a batch in ordered tree mode

upload to AWS

update batch uploads on API

complete upload. API stores this hash and should match etag computation on the client side for a folder.

we need to think of a way to do partial updates and still be able to update the overall etag at the point of finishing the update.

My understanding is that any approach that supports updateable draft Zarr files will require a much heavier storage footprint than one that doesn't (because file hashes on the order of the number of files in the Zarr archive will need to be recorded until the dandiset is published). As such, it would make sense to include a non-updateable method now because it's simpler to implement and stresses the resources less, while keeping something heavier in our back pocket for later if and when the need arises concretely. I don't object to keeping this design (or one like it) in the design document, but I'm thinking it would be a separate mode of upload added later. So you'd be presented with an option at upload time along the lines of "NWB", "Zarr (static)", "Zarr (updateable)" (of course we can use better names, and offer help text, but this is just for demonstration purposes).

In order to move ahead with this design, are you ok with committing to the simpler, more static mode now, while sketching out a more complex design to possibly be added later?

satra · 2021-09-21T13:35:11Z

for this PR, i just wanted to list the current status of one of the dandisets (000108). i have converted about 929 zarr files to hdf5 (about 360 of these are on the archive at the moment) for the current upload (while we implement zarr support). this is about 47TB of data. and there is another 280 files being converted (another 12 - 15TB of data).

each zarr file in this dataset can have around 700000 files. and it’s a multiresolution stack, so metadata affects about 16 files, the rest are binary chunks. there are multiple acquisitions which are planned, and the standard for this kind of file continues to evolve, which means metadata changes are inevitable. while we are considering overlay methods in the short term for our h5 files because some metadata changes simply involve filename adjustments with zarr this kind of overlay is not useful. zarr is designed for lightweight updates and not designed for provenance tracking.

a whole folder upload should absolutely be the first priority, but given the scales of these datasets, it may be inefficient almost immediately given that reuploads because of metadata changes will take more than a week of uploading when it would have taken minutes..

so let's implement whole folder upload, but immediately tackle the partial upload/update.

yarikoptic · 2021-09-21T14:35:08Z

I am afraid that as long as we want to retain "zarr hierarchy" within that "zarr folder", we will never be able to come up with a reasonable design for an "updateable" zarr folder on S3. If we are to make them updatable only within draft (until published):

how to guarantee that we do not modify an already published zarr? Possible solution: we can refuse to mint upload URLs for anything under that folder if it was already published, and refuse to publish if there is a known (not yet expired?) upload possibly going on.
- side-effect: if zarr is published and needs to be updated for draft/next publication -- would need a full upload to a new zarr "folderblob" id :-/
how to guarantee "integrity" of an existing zarr folder to which we allow to "update"? E.g. after an initial upload, we "finalized" and ensured that it is all consistent (it was not yet published). Then the request came to update it but never got finished/finalized -- that folder might be in a dirty state. Possible solution: besides tree hash, we store versionIds (actually we might be good with just a datestamp whenever upload was initiated!) in the bucket of all keys in that folder upon "finalize". Whenever "update" upload comes, we mark that folder "dirty". Whenever upload is finalized - new treehash/collection of versionIds, unmark "dirty". If upload is not finalized and expired: we remove all keys (and DeleteMarkers) in that folder after the upload initiation date. I believe then the folder would look as it was before upload was initiated

I started to wonder if in the long run we aren't really doomed to provide some layer/service on top which would (similarly to what we have for assets) provide "zarr view" over a collection of blobs? But not even sure if zarr libraries would work this way - whenever a url for a specific file redirects to S3 -- most likely they then would continue constructing next URL based of that S3 URL? @satra -- did you check? If it could be done, then we actually could provide quite an efficient solution! (although not sure if it would not violate S3 T&C since those blobs would not be immediately usable, and creating a "zarr manifest" also would be somewhat useless I guess)

satra · 2021-09-21T14:49:54Z

for published + updateable zarr, the current notion of zarr and s3 will not work, since it requires presenting the tree to a reader. this is where something like ipfs keystore would have to provide additional layer that supports updating while providing different key. in the short term, i would say updating published files is a no go.

in terms of zarr support there are examples for reading files using zarr (https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec) but they all point to an s3:// url, which is feasible for our public dandiset, but i don't know how it would work with embargo, if embargo goes private. we could even test by uploading one of the ngff files to our bucket somewhere. here is the bucket for zarr's tiny example: http://zarr-demo.s3-eu-west-2.amazonaws.com/

yarikoptic · 2021-09-21T15:33:02Z

note:

there is also "consolidated metadata" which was introduced to overcome shortcoming of zarr needing to list the entire hierarchy to IIRC introduce an update etc see e.g. discussion here. Cons: requires a dedicated zarr.open_consolidated call to open such hierarchy and not yet sure if it could be (ab)used to provide views of multiple "versions", most likely not, but something may be to recommend to zarr DANDI users?

an idea:

similarly to how openneuro uses datalad on their public s3 bucket we use versioning of s3
upload to the zarrfolder goes normally. S3 "view" of that zarrfolder always corresponds to the latest/"draft" version
- users can read that draft (or may be some published if no changes to draft) version directly from S3
- we can provide "cleanup of incomplete upload" how I described above to 'fsck' the zarr on S3
zarr supports custom stores
- upon "finalize" of the upload we establish the listing of versions for the keys in that upload. Populate some extra file with information within S3 (thus "self-containing" all necessary information on S3, so in principle could be used directly, e.g. .zarrdandi.{version})
- we provide custom store which would then allows to access any specific published version of an archive based on those .zarrdandi.{version} files
we advocate using this custom store if reproducibility is desired/required
- nothing unprecedented really. draft on S3 is just draft and not guaranteed to be "immutable"
we do not have "immediately" implement it. Just implement upload to zarrfolder as being worked out in this PR to start with (possibly even without worrying about removing some stale components in the zarr hierarchy? although ideally should cleanup) and then if we decide with above to proceed - we can develop it. We could even "mint" those .zarrdandi files later based on keys history metadata/timestamps (pretty much similarly to what S3 datalad-crawler does).
cons: S3 specific since would rely on versioning

relevant discussions: zarr-developers/zarr-specs#76 (Versioned Arrays)

satra · 2021-09-21T20:21:45Z

s3 versioning would get around some of the issues in the short term. whatever we end up doing, zarrfolder upload needs to be implemented. so i would suggest we move ahead with that.

dchiquito · 2021-10-20T22:44:23Z

Merging this into the archived design docs directory in favor of #574

Add code of conduct and Netlify link to footer

Create zarr-support.md

6d1a852

satra added the design-doc Involves creating or discussing a design document label May 19, 2021

yarikoptic reviewed May 19, 2021

View reviewed changes

doc/design/zarr-support.md Outdated Show resolved Hide resolved

yarikoptic reviewed May 19, 2021

View reviewed changes

doc/design/zarr-support.md Outdated Show resolved Hide resolved

yarikoptic reviewed May 19, 2021

View reviewed changes

yarikoptic mentioned this pull request May 19, 2021

asset updates: where things should go in metadata dandi/dandi-cli#642

Closed

Apply suggestions from code review

f567a6a

Co-authored-by: Yaroslav Halchenko <[email protected]>

satra mentioned this pull request Jun 14, 2021

Support Zarr datasets dandi/helpdesk#13

Closed

waxlamp requested a review from dchiquito August 30, 2021 21:00

dchiquito added 2 commits September 1, 2021 17:39

Merge branch 'master' into enh-zarr-support-design

7f8bfac

First draft of zarr implementation

575d7c2

satra commented Sep 2, 2021

View reviewed changes

yarikoptic reviewed Sep 3, 2021

View reviewed changes

doc/design/zarr-support.md Outdated Show resolved Hide resolved

yarikoptic reviewed Sep 3, 2021

View reviewed changes

Add benchmarking data to zarr design doc

0f6bb20

waxlamp reviewed Sep 8, 2021

View reviewed changes

jwodder reviewed Sep 8, 2021

View reviewed changes

Add more benchmarking data to zarr design doc

58db86d

dchiquito marked this pull request as draft October 5, 2021 20:57

waxlamp mentioned this pull request Oct 18, 2021

Second zarr design doc #552

Merged

dchiquito added 2 commits October 20, 2021 17:17

Add archived message to zarr-support.md

825f33c

Archive zarr-support.md

73b88fd

dchiquito marked this pull request as ready for review October 20, 2021 21:18

dchiquito merged commit a064d2a into master Oct 20, 2021

dchiquito deleted the enh-zarr-support-design branch October 20, 2021 22:44

kabilar added a commit to kabilar/dandi-archive that referenced this pull request Jan 30, 2025

Merge pull request dandi#295 from kabilar/netlify2

ad8445c

Add code of conduct and Netlify link to footer

	1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
	1. The system can theoretically handle zarr files with ~1 million subfiles, each of size `zip(64 * 64 * 64 * {datatype}) bytes ~<~ 262 kilobytes`.

Design for Zarr support #295

Design for Zarr support #295

Conversation

satra commented May 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satra Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

yarikoptic Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

satra commented Sep 8, 2021

waxlamp commented Sep 21, 2021

satra commented Sep 21, 2021

yarikoptic commented Sep 21, 2021 • edited Loading

satra commented Sep 21, 2021

yarikoptic commented Sep 21, 2021

satra commented Sep 21, 2021

dchiquito commented Oct 20, 2021

satra Sep 3, 2021 •

edited

Loading

yarikoptic Sep 3, 2021 •

edited

Loading

yarikoptic Sep 3, 2021 •

edited

Loading

yarikoptic Sep 3, 2021 •

edited

Loading

yarikoptic Sep 3, 2021 •

edited

Loading

yarikoptic commented Sep 21, 2021 •

edited

Loading