Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3rd draft of zarr design doc #574

Merged
merged 8 commits into from
Dec 15, 2021
Merged

3rd draft of zarr design doc #574

merged 8 commits into from
Dec 15, 2021

Conversation

dchiquito
Copy link
Contributor

@dchiquito dchiquito commented Oct 20, 2021

Everything required for zarr, except publishing.

I've already started implementing this as it is written with no issues.

Outstanding requests from the last PR:

  • calling the dandiarchive/zarr/... directory in S3 something else, to reflect that we can store any kind of folder structure in it. I think that we are building this for zarr files, and we are only planning on putting zarr files in it. If there are other formats that we want to explicitly include, we can always create new directories for those formats.
  • Keeping .checksum files in dandiarchive/zarr/ZARR_ID.checksum/... instead of dandiarchive/zarr_checksums/ZARR_ID/. I like having two directories for two kinds of files, rather than having dandiarchive/zarr contain pairs of directories, but I don't feel that strongly.

This was referenced Oct 20, 2021
dchiquito and others added 2 commits November 22, 2021 11:30
Clarify wording in zarr design doc

Co-authored-by: Satrajit Ghosh <[email protected]>
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments/questions on the initial part (until storage implementation)

1. Zarr archives are stored in a "directory" in S3.
1. Each zarr archive corresponds to a single Asset.
1. The CLI uses some kind of tree hashing scheme to compute a checksum for the entire zarr archive.
1. The API verifies the checksum _immediately_ after upload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

immediately worries me a bit as used here since don't know yet what it entails if tree hash parts are to be stored in S3 and might take awhile to even fetch...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. The API verifies the checksum _immediately_ after upload.
1. The API verifies the checksum immediately after upload, so the validation status is available in the response to the upload REST request.

The alternative would be validating asynchronously, which I was told was not an option. The scheme described here requires a request to S3 for each directory in the paths being updated (a/b/c/d/e/f = 5 requests), which should be nowhere near the 30s Heroku request timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that implemented so I could check the code? I am still a bit lost since upload is a batch of files which could come for various directories, so it could in principle (e.g. a file per directory) require traversal of even more "directories" (up the hierarchy) than files.

* **DELETE /api/zarr/{zarr_id}/upload/**

Cancels a batch upload.
Any files already uploaded are deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might take awhile I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be pretty quick, and if worse comes to worse we can do it asynchronously and just freeze the zarr file from new uploads until everything is deleted.
This is one of the factors that needs to be considered when choosing the batch size limit.


Cancels a batch upload.
Any files already uploaded are deleted.
A new batch upload can then be started.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would they know that already uploaded are already deleted? DELETE would not return until all deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was picturing, yes. If that's not feasible they can simply retry starting the next batch until the asynchronous delete is completed.

Requires a `zarr_id` in the request body instead of a `blob_id`.
Return an asset ID

When added to a dandiset, zarr archives will appear as a normal `Asset` in all the asset API endpoints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen on /dandisets/{versions__dandiset__pk}/versions/{versions__version}/assets/{asset_id}/download/ endpoint for an asset which is zarr file and not blob?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 400 error, or something to that effect.

@dchiquito
Copy link
Contributor Author

We need to get moving on implementing this stuff so I'm going to go ahead and merge this. If there's any redesigning that needs to happen that can go in a new PR.

@dchiquito dchiquito merged commit 3c21f27 into master Dec 15, 2021
@dchiquito dchiquito deleted the enh-zarr-support-design-3 branch December 15, 2021 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants