-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for Zarr support #295
Conversation
doc/design/zarr-support.md
Outdated
|
||
1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI | ||
should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt | ||
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: on DataLad end I might make each .zarr into a dedicated dataset. There are cons in that (e.g. no shared keystore between different .zarr files sharing some data) but it is the only way I see it to be done in a scalable fashion
doc/design/zarr-support.md
Outdated
Given these considerations here are questions for implementation | ||
1. Is there a way to upload a folder to a given prefix using an API key without having to create 100k signed URLs? | ||
1. Should the tree structure be stored somewhere so that diffs can be ascertained? | ||
1. Given that each zarr file may contain 100k+ files, how will dandi-cli handle alterations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be able to make it "ok" experience. Would likely to be slow. Might need to optimize "directory" support in fscacher - might come handy for the composite etag
computing etc.
doc/design/zarr-support.md
Outdated
1. An asset is associated with a single zarr folder. From a user perspective this is still a single asset and the UI | ||
should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt | ||
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind. | ||
3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet sure if it wouldn't be wiser to keep that folderblobs/
separate from blobs/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like folderblobs !
Co-authored-by: Yaroslav Halchenko <[email protected]>
doc/design/zarr-support.md
Outdated
1. zarr files are stored in a "directory" in S3. | ||
1. Each zarr file corresponds to a single Asset. | ||
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr file. The API verifies this checksum _immediately_ after upload; it's not good enough to download the entire zarr file to calculate it after upload. | ||
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now i remember why each file is less, the chunks are compressed. and the file calculation is 64*64*64*datatype_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size 64 * 64 * 64 bytes ~= 262 kilobytes. | |
1. The system can theoretically handle zarr files with ~1 million subfiles, each of size `zip(64 * 64 * 64 * {datatype}) bytes ~<~ 262 kilobytes`. |
Do you have an estimate for an upper bound for the file size?
doc/design/zarr-support.md
Outdated
] | ||
``` | ||
3. API responds with a corresponding list of presigned upload URLs (**TODO** where to upload?) in the S3 bucket. | ||
The size limit for each upload is 5GB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are doing a single upload (as opposed to multipart) and the etag is being computed, we could build the md5 into the presigning process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dandi-etag
is the S3 etag, which for this case is just the MD5 of the file, so it already is, basically. Do you mean that we should use the etag to generate the presigned URL so that only a file with that etag can be uploaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that only a file with that etag can be uploaded
yes - we couldn't do it for multipart, but should work for single part.
doc/design/zarr-support.md
Outdated
should not try to delve into the structure of the folder. The CLI should be able to download the entire tree. Matt | ||
at kitware is looking into IPFS + NGFF, so we should at least keep that in mind. | ||
3. Blob store allows for a folder, which contains the zarr named "locations" and data. That is given a root prefix, | ||
a zarr-compliant software can navigate the zarr metadata/structure using relative path rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
later we might end up with non-zarr folders. Should we include indication of underlying "folder format" within folder or the subfolder name?
e.g. could be d65b541b-885a-4bb4-badd-2a57b1bebab0.zarr
or may be better d65b541b-885a-4bb4-badd-2a57b1bebab0/zarr/
this way we can actually support storing multiple representations for the same data withing the blob store (e.g. KEY
- original nwb or whatnot, KEY/zarr/
- zarr, KEY/ipfs/
- ipfs blocks if that is a thing ;)) without causing conflicts/ambiguity when looking at a specific PREFIX and immediate "sub-folders"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to store things at zarr/KEY/
and ipfs/KEY/
, much like we already store things in blobs/KEY/
. We can still have the same KEY
in multiple stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. Then wording above should avoid "Blob store" since it is blobs/ for me, and mention zarr/ store and layout within to match the one of blobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On db/api side then it wouldn't be blobs table/endpoint right?
doc/design/zarr-support.md
Outdated
A simple scheme that I think would work: | ||
|
||
* initial value is `sha256("{etag}:{path}")` for the first subfile | ||
* the next value is `sha256("{prev_sha256}:{etag}:{path}")`, ad infinitum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are not to go for a proper "tree hash" of some kind which would provide more efficient way of computing besides "serial", I think we should just use the same dandi-etag
ing approach: {md5(sorted((files_etags: dict).items()))}-{len(files_etags)}
, and call it dandi-treetag
or alike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we might just keep it named dandi-etag so DB and API stays consistent across many existing end points regardless either it is a file or a directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the benefits of this particular upload algorithm is that it doesn't need to store the hashes of all the uploaded subfiles, just a single checksum that is updated with each subfile uploaded. Your algorithm involves hashing all the etags at the end, so it does not have that benefit.
This is a viable alternative if we decide on a different scheme that does store the hashes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you expecting each client to finalize each file upload in the specified order even if uploading async/parallel and some larger files might clog reporting on smaller files upload?
md5 hash can be updated with new data as more comes in (that is how it is computed ATM on a stream isn't it?), no need to store them all - just an implementation detail IMHO. Or am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a single part upload the presigned url can ensure that the correct content is upload.
doc/design/zarr-support.md
Outdated
|
||
### Before upload (technically optional) | ||
1. CLI calculates the checksum of the zarr file. | ||
1. CLI queries the API (**TODO** URL?) if the checksum has already been uploaded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly the same as for an upload of an individual file, since why should there be difference?
doc/design/zarr-support.md
Outdated
1. If so, it proceeds with the already uploaded zarr file and skips upload entirely. | ||
|
||
### Upload | ||
1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same /uploads/initiate
, contentSize
- sum of all files sizes (although not used for checksum compute - needed anyways for DB/web ui display), but either we have a different name for the checksum or provide an explicit extra option to specify the format='zarr'
(defaulting to 'file'
as that is what we support now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left the actual API definition for later so we can discuss it after we agree on the requirements and that the overall implementation will meet those requirements.
doc/design/zarr-support.md
Outdated
### Upload | ||
1. CLI queries the API (**TODO** URL?) with the checksum of the zarr file to initiate an upload. The API creates an upload UUID, records the checksum, and initializes a "running checksum" to `null`. | ||
1. CLI requests a batch of presigned URLs (**TODO** URL?). | ||
The files must be ordered in the same way used to calculate the checksum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that idea was then to operate via providing IAM credentials with write access to target upload prefix.
If that is the case, no need for presigned URLs, but rather instead of parts
in response provide IAM credentials (access key, secret key, access token, expiration). What we might need though is a dedicated API endpoint to provide renewed credentials if prior ones are (about to) expire(d).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to operate on per-file upload basis (thus ridiculous amount of presigning) and provide this list -- I would not rely on "ordered in the same way". It should be explicitly ordered by checksumming (we need to assume that CLI and dandi-api both use the same algorithm) which would sort (or not) internally to that algorithm consistently. We should not rely on external to algorithm sorting to provide that sorted order to both checksum compute and this upload end point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design doesn't involve IAM credentials. If it doesn't solve the requirements, we can throw out this plan and come up with something else, possibly involving IAM.
If possible, I would rather avoid having to do all that IAM management. It's a technically viable option, but I don't thing we have to do it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will stop further analysis below since it roots at the idea of "presigning" and I probably incorrectly thought that we would like to avoid that for 100000s of files zarr will be. Please clarify @dchiquito and @satra
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After consideration my opinion is that presigning is the least bad thing. It involves number of files / files per batch
requests, but ultimately that's much less data than fully proxying the entire upload. Any direct upload scheme like the IAM idea has to figure out a way to calculate the checksum.
I think this scheme I laid out meets the requirements. For now it would be helpful to identify any shortcomings it has (I listed a few in the Pros/Cons section) so we can modify it or discard it in favor of something else.
doc/design/zarr-support.md
Outdated
|
||
## Benchmarks | ||
I mocked up API endpoints that would behave more or less like the ones described above. | ||
I recommend throwing the code away, but it should give a good estimate for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly suggest not throwing the code away, but instead transferring it to a gist so we don't have to keep a weird branch around in the codebase.
doc/design/zarr-support.md
Outdated
1. A well-defined ordering of the subfiles in the zarr file must exist. | ||
The checksum must be computed on the subfiles in this order. | ||
1. It must be applicable on one subfile at a time. | ||
1. It must be able save its state between subfiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save state where? If it's to a file, the only way to do that with Python's hashlib classes is via pickle, which opens up security issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea here was that the hash must be computable incrementally across an arbitrary number of requests, which means the state must be saved somehow in between requests. My hope is for a hashing scheme that uses intermediate hash values rather than saving the state of a hashlib
hasher, as that doesn't have a clean solution.
The scheme I outlined below has this property, as would using Merkle trees to recursively concatenate hashes.
this is an optimization step, but one that i think we should consider in the design. i would use a treehash or something similar for the simple reason that updating a zarr file should not require uploading all subfiles. it should only upload the changes and remove any missing pieces (so more like a sync operation). on the api side, the id of the file can stay the same and the hashes would change. the api would only care about the aggregate hash that it can compute just like sha256 computation now using an out of band process. for the actual upload it just needs to know that all files have been uploaded. so an upload init would require number of files to be uploaded and each batch just needs confirmation that those files have been uploaded. the md5 check is done by AWS. generally this would work, except for partial updates:
we need to think of a way to do partial updates and still be able to update the overall etag at the point of finishing the update. |
My understanding is that any approach that supports updateable draft Zarr files will require a much heavier storage footprint than one that doesn't (because file hashes on the order of the number of files in the Zarr archive will need to be recorded until the dandiset is published). As such, it would make sense to include a non-updateable method now because it's simpler to implement and stresses the resources less, while keeping something heavier in our back pocket for later if and when the need arises concretely. I don't object to keeping this design (or one like it) in the design document, but I'm thinking it would be a separate mode of upload added later. So you'd be presented with an option at upload time along the lines of "NWB", "Zarr (static)", "Zarr (updateable)" (of course we can use better names, and offer help text, but this is just for demonstration purposes). In order to move ahead with this design, are you ok with committing to the simpler, more static mode now, while sketching out a more complex design to possibly be added later? |
for this PR, i just wanted to list the current status of one of the dandisets (000108). i have converted about 929 zarr files to hdf5 (about 360 of these are on the archive at the moment) for the current upload (while we implement zarr support). this is about 47TB of data. and there is another 280 files being converted (another 12 - 15TB of data). each zarr file in this dataset can have around 700000 files. and it’s a multiresolution stack, so metadata affects about 16 files, the rest are binary chunks. there are multiple acquisitions which are planned, and the standard for this kind of file continues to evolve, which means metadata changes are inevitable. while we are considering overlay methods in the short term for our h5 files because some metadata changes simply involve filename adjustments with zarr this kind of overlay is not useful. zarr is designed for lightweight updates and not designed for provenance tracking. a whole folder upload should absolutely be the first priority, but given the scales of these datasets, it may be inefficient almost immediately given that reuploads because of metadata changes will take more than a week of uploading when it would have taken minutes.. so let's implement whole folder upload, but immediately tackle the partial upload/update. |
I am afraid that as long as we want to retain "zarr hierarchy" within that "zarr folder", we will never be able to come up with a reasonable design for an "updateable" zarr folder on S3. If we are to make them updatable only within
I started to wonder if in the long run we aren't really doomed to provide some layer/service on top which would (similarly to what we have for assets) provide "zarr view" over a collection of blobs? But not even sure if zarr libraries would work this way - whenever a url for a specific file redirects to S3 -- most likely they then would continue constructing next URL based of that S3 URL? @satra -- did you check? If it could be done, then we actually could provide quite an efficient solution! (although not sure if it would not violate S3 T&C since those blobs would not be immediately usable, and creating a "zarr manifest" also would be somewhat useless I guess) |
for published + updateable zarr, the current notion of zarr and s3 will not work, since it requires presenting the tree to a reader. this is where something like ipfs keystore would have to provide additional layer that supports updating while providing different key. in the short term, i would say updating published files is a no go. in terms of zarr support there are examples for reading files using zarr (https://zarr.readthedocs.io/en/stable/tutorial.html#io-with-fsspec) but they all point to an |
note:
an idea:
relevant discussions: zarr-developers/zarr-specs#76 (Versioned Arrays) |
s3 versioning would get around some of the issues in the short term. whatever we end up doing, zarrfolder upload needs to be implemented. so i would suggest we move ahead with that. |
Merging this into the archived design docs directory in favor of #574 |
Add code of conduct and Netlify link to footer
This design doc is intended to help us move towards zarr support on the server. It is increasingly clear that we will need this soon. We are forcing people to use HDF5 or tiff in the short term, but will need to move this to NGFF, which uses Zarr. This may also come into play for NWB at some point in time.