Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for Zarr support #295

Merged
merged 8 commits into from
Oct 20, 2021
Merged

Design for Zarr support #295

merged 8 commits into from
Oct 20, 2021

Conversation

satra
Copy link
Member

@satra satra commented May 19, 2021

This design doc is intended to help us move towards zarr support on the server. It is increasingly clear that we will need this soon. We are forcing people to use HDF5 or tiff in the short term, but will need to move this to NGFF, which uses Zarr. This may also come into play for NWB at some point in time.

@satra satra added the design-doc Involves creating or discussing a design document label May 19, 2021

1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI
should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: on DataLad end I might make each .zarr into a dedicated dataset. There are cons in that (e.g. no shared keystore between different .zarr files sharing some data) but it is the only way I see it to be done in a scalable fashion

Given these considerations here are questions for implementation
1. Is there a way to upload a folder to a given prefix using an API key without having to create 100k signed URLs?
1. Should the tree structure be stored somewhere so that diffs can be ascertained?
1. Given that each zarr file may contain 100k+ files, how will dandi-cli handle alterations?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to make it "ok" experience. Would likely to be slow. Might need to optimize "directory" support in fscacher - might come handy for the composite etag computing etc.

1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI
should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.
3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet sure if it wouldn't be wiser to keep that folderblobs/ separate from blobs/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like folderblobs !

Co-authored-by: Yaroslav Halchenko <[email protected]>
1. zarr files are stored in a "directory" in S3.
1. Each zarr file corresponds to a single Asset.
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file. The API verifies this checksum _immediately_ after upload; it's not good enough to download the entire zarr file to calculate it after upload.
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now i remember why each file is less, the chunks are compressed. and the file calculation is 64*64*64*datatype_bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes.
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size `zip(64 * 64 * 64 * {datatype}) bytes ~<~ 262 kilobytes`.

Do you have an estimate for an upper bound for the file size?

]
```
3. API responds with a corresponding list of presigned upload URLs (**TODO** where to upload?) in the S3 bucket.
The size limit for each upload is 5GB.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are doing a single upload (as opposed to multipart) and the etag is being computed, we could build the md5 into the presigning process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dandi-etag is the S3 etag, which for this case is just the MD5 of the file, so it already is, basically. Do you mean that we should use the etag to generate the presigned URL so that only a file with that etag can be uploaded?

Copy link
Member Author

@satra satra Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that only a file with that etag can be uploaded

yes - we couldn't do it for multipart, but should work for single part.

should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind.
3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix,
a zarr-compliant software can navigate the zarr metadata/structure using relative path rules.
Copy link
Member

@yarikoptic yarikoptic Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

later we might end up with non-zarr folders. Should we include indication of underlying "folder format" within folder or the subfolder name?
e.g. could be d65b541b-885a-4bb4-badd-2a57b1bebab0.zarr or may be better d65b541b-885a-4bb4-badd-2a57b1bebab0/zarr/

this way we can actually support storing multiple representations for the same data withing the blob store (e.g. KEY - original nwb or whatnot, KEY/zarr/ - zarr, KEY/ipfs/ - ipfs blocks if that is a thing ;)) without causing conflicts/ambiguity when looking at a specific PREFIX and immediate "sub-folders"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to store things at zarr/KEY/ and ipfs/KEY/, much like we already store things in blobs/KEY/. We can still have the same KEY in multiple stores.

Copy link
Member

@yarikoptic yarikoptic Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Then wording above should avoid "Blob store" since it is blobs/ for me, and mention zarr/ store and layout within to match the one of blobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On db/api side then it wouldn't be blobs table/endpoint right?

A simple scheme that I think would work:

* initial value is `sha256("{etag}:{path}")` for the first subfile
* the next value is `sha256("{prev_sha256}:{etag}:{path}")`, ad infinitum
Copy link
Member

@yarikoptic yarikoptic Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are not to go for a proper "tree hash" of some kind which would provide more efficient way of computing besides "serial", I think we should just use the same dandi-etaging approach: {md5(sorted((files_etags: dict).items()))}-{len(files_etags)}, and call it dandi-treetag or alike

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we might just keep it named dandi-etag so DB and API stays consistent across many existing end points regardless either it is a file or a directory

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the benefits of this particular upload algorithm is that it doesn't need to store the hashes of all the uploaded subfiles, just a single checksum that is updated with each subfile uploaded. Your algorithm involves hashing all the etags at the end, so it does not have that benefit.

This is a viable alternative if we decide on a different scheme that does store the hashes, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you expecting each client to finalize each file upload in the specified order even if uploading async/parallel and some larger files might clog reporting on smaller files upload?

md5 hash can be updated with new data as more comes in (that is how it is computed ATM on a stream isn't it?), no need to store them all - just an implementation detail IMHO. Or am I wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a single part upload the presigned url can ensure that the correct content is upload.


### Before upload (technically optional)
1. CLI calculates the checksum of the zarr file.
1. CLI queries the API (**TODO** URL?) if the checksum has already been uploaded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly the same as for an upload of an individual file, since why should there be difference?

1. If so, it proceeds with the already uploaded zarr file and skips upload entirely.

### Upload
1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`.
Copy link
Member

@yarikoptic yarikoptic Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same /uploads/initiate, contentSize - sum of all files sizes (although not used for checksum compute - needed anyways for DB/web ui display), but either we have a different name for the checksum or provide an explicit extra option to specify the format='zarr' (defaulting to 'file' as that is what we support now).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left the actual API definition for later so we can discuss it after we agree on the requirements and that the overall implementation will meet those requirements.

### Upload
1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`.
1. CLI requests a batch of presigned URLs (**TODO** URL?).
The files must be ordered in the same way used to calculate the checksum.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that idea was then to operate via providing IAM credentials with write access to target upload prefix.
If that is the case, no need for presigned URLs, but rather instead of parts in response provide IAM credentials (access key, secret key, access token, expiration). What we might need though is a dedicated API endpoint to provide renewed credentials if prior ones are (about to) expire(d).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to operate on per-file upload basis (thus ridiculous amount of presigning) and provide this list -- I would not rely on "ordered in the same way". It should be explicitly ordered by checksumming (we need to assume that CLI and dandi-api both use the same algorithm) which would sort (or not) internally to that algorithm consistently. We should not rely on external to algorithm sorting to provide that sorted order to both checksum compute and this upload end point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design doesn't involve IAM credentials. If it doesn't solve the requirements, we can throw out this plan and come up with something else, possibly involving IAM.

If possible, I would rather avoid having to do all that IAM management. It's a technically viable option, but I don't thing we have to do it that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will stop further analysis below since it roots at the idea of "presigning" and I probably incorrectly thought that we would like to avoid that for 100000s of files zarr will be. Please clarify @dchiquito and @satra

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After consideration my opinion is that presigning is the least bad thing. It involves number of files / files per batch requests, but ultimately that's much less data than fully proxying the entire upload. Any direct upload scheme like the IAM idea has to figure out a way to calculate the checksum.

I think this scheme I laid out meets the requirements. For now it would be helpful to identify any shortcomings it has (I listed a few in the Pros/Cons section) so we can modify it or discard it in favor of something else.


## Benchmarks
I mocked up API endpoints that would behave more or less like the ones described above.
I recommend throwing the code away, but it should give a good estimate for performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest not throwing the code away, but instead transferring it to a gist so we don't have to keep a weird branch around in the codebase.

1. A well-defined ordering of the subfiles in the zarr file must exist.
The checksum must be computed on the subfiles in this order.
1. It must be applicable on one subfile at a time.
1. It must be able save its state between subfiles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save state where? If it's to a file, the only way to do that with Python's hashlib classes is via pickle, which opens up security issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea here was that the hash must be computable incrementally across an arbitrary number of requests, which means the state must be saved somehow in between requests. My hope is for a hashing scheme that uses intermediate hash values rather than saving the state of a hashlib hasher, as that doesn't have a clean solution.

The scheme I outlined below has this property, as would using Merkle trees to recursively concatenate hashes.

@satra
Copy link
Member Author

satra commented Sep 8, 2021

this is an optimization step, but one that i think we should consider in the design.

i would use a treehash or something similar for the simple reason that updating a zarr file should not require uploading all subfiles. it should only upload the changes and remove any missing pieces (so more like a sync operation).

on the api side, the id of the file can stay the same and the hashes would change. the api would only care about the aggregate hash that it can compute just like sha256 computation now using an out of band process. for the actual upload it just needs to know that all files have been uploaded. so an upload init would require number of files to be uploaded and each batch just needs confirmation that those files have been uploaded. the md5 check is done by AWS.

generally this would work, except for partial updates:

  1. initialize upload with the total number of files and size
  2. repeat (API maintains continuous hash):
    1. request presigned urls with dandi-etag for a batch in ordered tree mode
    2. upload to AWS
    3. update batch uploads on API
  3. complete upload. API stores this hash and should match etag computation on the client side for a folder.

we need to think of a way to do partial updates and still be able to update the overall etag at the point of finishing the update.

@waxlamp
Copy link
Member

waxlamp commented Sep 21, 2021

this is an optimization step, but one that i think we should consider in the design.

i would use a treehash or something similar for the simple reason that updating a zarr file should not require uploading all subfiles. it should only upload the changes and remove any missing pieces (so more like a sync operation).

on the api side, the id of the file can stay the same and the hashes would change. the api would only care about the aggregate hash that it can compute just like sha256 computation now using an out of band process. for the actual upload it just needs to know that all files have been uploaded. so an upload init would require number of files to be uploaded and each batch just needs confirmation that those files have been uploaded. the md5 check is done by AWS.

generally this would work, except for partial updates:

  1. initialize upload with the total number of files and size

  2. repeat (API maintains continuous hash):

    1. request presigned urls with dandi-etag for a batch in ordered tree mode
    2. upload to AWS
    3. update batch uploads on API
  3. complete upload. API stores this hash and should match etag computation on the client side for a folder.

we need to think of a way to do partial updates and still be able to update the overall etag at the point of finishing the update.

My understanding is that any approach that supports updateable draft Zarr files will require a much heavier storage footprint than one that doesn't (because file hashes on the order of the number of files in the Zarr archive will need to be recorded until the dandiset is published). As such, it would make sense to include a non-updateable method now because it's simpler to implement and stresses the resources less, while keeping something heavier in our back pocket for later if and when the need arises concretely. I don't object to keeping this design (or one like it) in the design document, but I'm thinking it would be a separate mode of upload added later. So you'd be presented with an option at upload time along the lines of "NWB", "Zarr (static)", "Zarr (updateable)" (of course we can use better names, and offer help text, but this is just for demonstration purposes).

In order to move ahead with this design, are you ok with committing to the simpler, more static mode now, while sketching out a more complex design to possibly be added later?

@satra
Copy link
Member Author

satra commented Sep 21, 2021

for this PR, i just wanted to list the current status of one of the dandisets (000108). i have converted about 929 zarr files to hdf5 (about 360 of these are on the archive at the moment) for the current upload (while we implement zarr support). this is about 47TB of data. and there is another 280 files being converted (another 12 - 15TB of data).

each zarr file in this dataset can have around 700000 files. and it’s a multiresolution stack, so metadata affects about 16 files, the rest are binary chunks. there are multiple acquisitions which are planned, and the standard for this kind of file continues to evolve, which means metadata changes are inevitable. while we are considering overlay methods in the short term for our h5 files because some metadata changes simply involve filename adjustments with zarr this kind of overlay is not useful. zarr is designed for lightweight updates and not designed for provenance tracking.

a whole folder upload should absolutely be the first priority, but given the scales of these datasets, it may be inefficient almost immediately given that reuploads because of metadata changes will take more than a week of uploading when it would have taken minutes..

so let's implement whole folder upload, but immediately tackle the partial upload/update.

@yarikoptic
Copy link
Member

yarikoptic commented Sep 21, 2021

I am afraid that as long as we want to retain "zarr hierarchy" within that "zarr folder", we will never be able to come up with a reasonable design for an "updateable" zarr folder on S3. If we are to make them updatable only within draft (until published):

  • how to guarantee that we do not modify an already published zarr? Possible solution: we can refuse to mint upload URLs for anything under that folder if it was already published, and refuse to publish if there is a known (not yet expired?) upload possibly going on.
    • side-effect: if zarr is published and needs to be updated for draft/next publication -- would need a full upload to a new zarr "folderblob" id :-/
  • how to guarantee "integrity" of an existing zarr folder to which we allow to "update"? E.g. after an initial upload, we "finalized" and ensured that it is all consistent (it was not yet published). Then the request came to update it but never got finished/finalized -- that folder might be in a dirty state. Possible solution: besides tree hash, we store versionIds (actually we might be good with just a datestamp whenever upload was initiated!) in the bucket of all keys in that folder upon "finalize". Whenever "update" upload comes, we mark that folder "dirty". Whenever upload is finalized - new treehash/collection of versionIds, unmark "dirty". If upload is not finalized and expired: we remove all keys (and DeleteMarkers) in that folder after the upload initiation date. I believe then the folder would look as it was before upload was initiated

I started to wonder if in the long run we aren't really doomed to provide some layer/service on top which would (similarly to what we have for assets) provide "zarr view" over a collection of blobs? But not even sure if zarr libraries would work this way - whenever a url for a specific file redirects to S3 -- most likely they then would continue constructing next URL based of that S3 URL? @satra -- did you check? If it could be done, then we actually could provide quite an efficient solution! (although not sure if it would not violate S3 T&C since those blobs would not be immediately usable, and creating a "zarr manifest" also would be somewhat useless I guess)

@satra
Copy link
Member Author

satra commented Sep 21, 2021

for published + updateable zarr, the current notion of zarr and s3 will not work, since it requires presenting the tree to a reader. this is where something like ipfs keystore would have to provide additional layer that supports updating while providing different key. in the short term, i would say updating published files is a no go.

in terms of zarr support there are examples for reading files using zarr (https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec) but they all point to an s3:// url, which is feasible for our public dandiset, but i don't know how it would work with embargo, if embargo goes private. we could even test by uploading one of the ngff files to our bucket somewhere. here is the bucket for zarr's tiny example: http://zarr-demo.s3-eu-west-2.amazonaws.com/

@yarikoptic
Copy link
Member

note:

  • there is also "consolidated metadata" which was introduced to overcome shortcoming of zarr needing to list the entire hierarchy to IIRC introduce an update etc see e.g. discussion here. Cons: requires a dedicated zarr.open_consolidated call to open such hierarchy and not yet sure if it could be (ab)used to provide views of multiple "versions", most likely not, but something may be to recommend to zarr DANDI users?

an idea:

  • similarly to how openneuro uses datalad on their public s3 bucket we use versioning of s3
  • upload to the zarrfolder goes normally. S3 "view" of that zarrfolder always corresponds to the latest/"draft" version
    • users can read that draft (or may be some published if no changes to draft) version directly from S3
    • we can provide "cleanup of incomplete upload" how I described above to 'fsck' the zarr on S3
  • zarr supports custom stores
    • upon "finalize" of the upload we establish the listing of versions for the keys in that upload. Populate some extra file with information within S3 (thus "self-containing" all necessary information on S3, so in principle could be used directly, e.g. .zarrdandi.{version})
    • we provide custom store which would then allows to access any specific published version of an archive based on those .zarrdandi.{version} files
  • we advocate using this custom store if reproducibility is desired/required
    • nothing unprecedented really. draft on S3 is just draft and not guaranteed to be "immutable"
  • we do not have "immediately" implement it. Just implement upload to zarrfolder as being worked out in this PR to start with (possibly even without worrying about removing some stale components in the zarr hierarchy? although ideally should cleanup) and then if we decide with above to proceed - we can develop it. We could even "mint" those .zarrdandi files later based on keys history metadata/timestamps (pretty much similarly to what S3 datalad-crawler does).
  • cons: S3 specific since would rely on versioning

relevant discussions: zarr-developers/zarr-specs#76 (Versioned Arrays)

@satra
Copy link
Member Author

satra commented Sep 21, 2021

s3 versioning would get around some of the issues in the short term. whatever we end up doing, zarrfolder upload needs to be implemented. so i would suggest we move ahead with that.

@dchiquito dchiquito marked this pull request as draft October 5, 2021 20:57
@waxlamp waxlamp mentioned this pull request Oct 18, 2021
@dchiquito dchiquito marked this pull request as ready for review October 20, 2021 21:18
@dchiquito
Copy link
Contributor

Merging this into the archived design docs directory in favor of #574

@dchiquito dchiquito merged commit a064d2a into master Oct 20, 2021
@dchiquito dchiquito deleted the enh-zarr-support-design branch October 20, 2021 22:44
kabilar added a commit to kabilar/dandi-archive that referenced this pull request Jan 30, 2025
Add code of conduct and Netlify link to footer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Involves creating or discussing a design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants