3rd draft of zarr design doc #574

dchiquito · 2021-10-20T22:43:32Z

Everything required for zarr, except publishing.

I've already started implementing this as it is written with no issues.

Outstanding requests from the last PR:

calling the dandiarchive/zarr/... directory in S3 something else, to reflect that we can store any kind of folder structure in it. I think that we are building this for zarr files, and we are only planning on putting zarr files in it. If there are other formats that we want to explicitly include, we can always create new directories for those formats.
Keeping .checksum files in dandiarchive/zarr/ZARR_ID.checksum/... instead of dandiarchive/zarr_checksums/ZARR_ID/. I like having two directories for two kinds of files, rather than having dandiarchive/zarr contain pairs of directories, but I don't feel that strongly.

doc/design/zarr-support-3.md

Clarify wording in zarr design doc Co-authored-by: Satrajit Ghosh <[email protected]>

yarikoptic

some comments/questions on the initial part (until storage implementation)

doc/design/zarr-support-3.md

yarikoptic · 2021-11-23T00:07:33Z

doc/design/zarr-support-3.md

+1. Zarr archives are stored in a "directory" in S3.
+1. Each zarr archive corresponds to a single Asset.
+1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive.
+1. The API verifies the checksum _immediately_ after upload.


immediately worries me a bit as used here since don't know yet what it entails if tree hash parts are to be stored in S3 and might take awhile to even fetch...

Suggested change

1. The API verifies the checksum _immediately_ after upload.

1. The API verifies the checksum immediately after upload, so the validation status is available in the response to the upload REST request.

The alternative would be validating asynchronously, which I was told was not an option. The scheme described here requires a request to S3 for each directory in the paths being updated (a/b/c/d/e/f = 5 requests), which should be nowhere near the 30s Heroku request timeout.

Is that implemented so I could check the code? I am still a bit lost since upload is a batch of files which could come for various directories, so it could in principle (e.g. a file per directory) require traversal of even more "directories" (up the hierarchy) than files.

doc/design/zarr-support-3.md

yarikoptic · 2021-11-23T03:23:50Z

doc/design/zarr-support-3.md

+* **DELETE /api/zarr/{zarr_id}/upload/**
+
+  Cancels a batch upload.
+  Any files already uploaded are deleted.


might take awhile I guess

I think it would be pretty quick, and if worse comes to worse we can do it asynchronously and just freeze the zarr file from new uploads until everything is deleted.
This is one of the factors that needs to be considered when choosing the batch size limit.

yarikoptic · 2021-11-23T03:24:20Z

doc/design/zarr-support-3.md

+
+  Cancels a batch upload.
+  Any files already uploaded are deleted.
+  A new batch upload can then be started.


how would they know that already uploaded are already deleted? DELETE would not return until all deleted?

That's what I was picturing, yes. If that's not feasible they can simply retry starting the next batch until the asynchronous delete is completed.

doc/design/zarr-support-3.md

yarikoptic · 2021-11-23T03:46:12Z

doc/design/zarr-support-3.md

+  Requires a `zarr_id` in the request body instead of a `blob_id`.
+  Return an asset ID
+
+When added to a dandiset, zarr archives will appear as a normal `Asset` in all the asset API endpoints.


what would happen on /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/{asset_id}/download/ endpoint for an asset which is zarr file and not blob?

A 400 error, or something to that effect.

Co-authored-by: Yaroslav Halchenko <[email protected]>

doc/design/zarr-support-3.md

Co-authored-by: Yaroslav Halchenko <[email protected]>

dchiquito · 2021-12-15T22:49:21Z

We need to get moving on implementing this stuff so I'm going to go ahead and merge this. If there's any redesigning that needs to happen that can go in a new PR.

3rd draft of zarr design doc

e77ca03

dchiquito requested review from yarikoptic, satra and waxlamp October 20, 2021 22:43

This was referenced Oct 20, 2021

Design for Zarr support #295

Merged

Second zarr design doc #552

Merged

satra reviewed Oct 23, 2021

View reviewed changes

dchiquito added 2 commits October 25, 2021 10:58

Add+clarify API endpoints for zarr design doc

9554054

Describe immutable zarr publishing

ec47076

waxlamp added this to the Sprint - November 1, 2021 to November 15, 2021 milestone Nov 1, 2021

satra reviewed Nov 16, 2021

View reviewed changes

doc/design/zarr-support-3.md Outdated Show resolved Hide resolved

satra reviewed Nov 16, 2021

View reviewed changes

doc/design/zarr-support-3.md Outdated Show resolved Hide resolved

dchiquito and others added 2 commits November 22, 2021 11:30

Apply suggestions from code review

04d8398

Clarify wording in zarr design doc Co-authored-by: Satrajit Ghosh <[email protected]>

Clarify zarr asset metadata

a23e8b1

yarikoptic reviewed Nov 23, 2021

View reviewed changes

Apply Yarik's corrections to zarr design doc

10dc87f

Co-authored-by: Yaroslav Halchenko <[email protected]>

yarikoptic reviewed Dec 2, 2021

View reviewed changes

doc/design/zarr-support-3.md Outdated Show resolved Hide resolved

yarikoptic reviewed Dec 2, 2021

View reviewed changes

doc/design/zarr-support-3.md Show resolved Hide resolved

dchiquito and others added 2 commits December 2, 2021 14:23

Clarify zarr_id and batch upload error codes

56cf737

Co-authored-by: Yaroslav Halchenko <[email protected]>

Include files/directories in .checksum file description

b05890c

dchiquito merged commit 3c21f27 into master Dec 15, 2021

dchiquito deleted the enh-zarr-support-design-3 branch December 15, 2021 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3rd draft of zarr design doc #574

3rd draft of zarr design doc #574

dchiquito commented Oct 20, 2021 •

edited by waxlamp

Loading

yarikoptic left a comment

yarikoptic Nov 23, 2021

dchiquito Nov 29, 2021

yarikoptic Dec 2, 2021

yarikoptic Nov 23, 2021

dchiquito Nov 29, 2021

yarikoptic Nov 23, 2021

dchiquito Nov 29, 2021

yarikoptic Nov 23, 2021

dchiquito Nov 29, 2021

dchiquito commented Dec 15, 2021

	1. The API verifies the checksum _immediately_ after upload.
	1. The API verifies the checksum immediately after upload, so the validation status is available in the response to the upload REST request.

3rd draft of zarr design doc #574

3rd draft of zarr design doc #574

Conversation

dchiquito commented Oct 20, 2021 • edited by waxlamp Loading

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchiquito commented Dec 15, 2021

dchiquito commented Oct 20, 2021 •

edited by waxlamp

Loading