-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3rd draft of zarr design doc #574
Conversation
Clarify wording in zarr design doc Co-authored-by: Satrajit Ghosh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments/questions on the initial part (until storage implementation)
doc/design/zarr-support-3.md
Outdated
1. Zarr archives are stored in a "directory" in S3. | ||
1. Each zarr archive corresponds to a single Asset. | ||
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive. | ||
1. The API verifies the checksum _immediately_ after upload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
immediately
worries me a bit as used here since don't know yet what it entails if tree hash parts are to be stored in S3 and might take awhile to even fetch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. The API verifies the checksum _immediately_ after upload. | |
1. The API verifies the checksum immediately after upload, so the validation status is available in the response to the upload REST request. |
The alternative would be validating asynchronously, which I was told was not an option. The scheme described here requires a request to S3 for each directory in the paths being updated (a/b/c/d/e/f
= 5 requests), which should be nowhere near the 30s Heroku request timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that implemented so I could check the code? I am still a bit lost since upload is a batch of files which could come for various directories, so it could in principle (e.g. a file per directory) require traversal of even more "directories" (up the hierarchy) than files.
* **DELETE /api/zarr/{zarr_id}/upload/** | ||
|
||
Cancels a batch upload. | ||
Any files already uploaded are deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might take awhile I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be pretty quick, and if worse comes to worse we can do it asynchronously and just freeze the zarr file from new uploads until everything is deleted.
This is one of the factors that needs to be considered when choosing the batch size limit.
|
||
Cancels a batch upload. | ||
Any files already uploaded are deleted. | ||
A new batch upload can then be started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would they know that already uploaded are already deleted? DELETE would not return until all deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I was picturing, yes. If that's not feasible they can simply retry starting the next batch until the asynchronous delete is completed.
Requires a `zarr_id` in the request body instead of a `blob_id`. | ||
Return an asset ID | ||
|
||
When added to a dandiset, zarr archives will appear as a normal `Asset` in all the asset API endpoints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen on /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/{asset_id}/download/
endpoint for an asset which is zarr file and not blob?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 400 error, or something to that effect.
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
We need to get moving on implementing this stuff so I'm going to go ahead and merge this. If there's any redesigning that needs to happen that can go in a new PR. |
Everything required for zarr, except publishing.
I've already started implementing this as it is written with no issues.
Outstanding requests from the last PR:
dandiarchive/zarr/...
directory in S3 something else, to reflect that we can store any kind of folder structure in it. I think that we are building this for zarr files, and we are only planning on putting zarr files in it. If there are other formats that we want to explicitly include, we can always create new directories for those formats..checksum
files indandiarchive/zarr/ZARR_ID.checksum/...
instead ofdandiarchive/zarr_checksums/ZARR_ID/
. I like having two directories for two kinds of files, rather than havingdandiarchive/zarr
contain pairs of directories, but I don't feel that strongly.