-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for zarr (.zarr/.ngff) #127
Comments
I can't tell what you're trying to say here. |
yes, 1-to-1 to prefix (AKA folder) on S3: dandi@drogon:~$ s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/ | head
DIR s3://dandiarchive/zarr/020f7130-3a59-4140-b01d-ac2180917b05/
DIR s3://dandiarchive/zarr/02499e55-c945-4af9-a9d8-d9072d94959c/
DIR s3://dandiarchive/zarr/0316a531-decb-4401-99b7-5d15e8c3dcec/
DIR s3://dandiarchive/zarr/031bf698-6917-4294-a086-61a2454e0a07/
...
pretty much what you asked (and I answered) about above ;-) |
should be reenabled whenever #127 is addressed
|
for the commit which would create the subdataset for zarr (if committing separately) would be worthwhile using creation time. If you would not be committing in dandiset's git repo at the moment of creation, then let's use the datetime of the commit to be committed in that
I think this all is still to be decided upon, so AFAIK publishing of dandisets with zarrs is disabled and we should error out if we run into such siutation. BUT meanwhile, in case of datalad dandisets, while bucket is still versioned, we just need to make sure to use versioned URLs to S3. |
Do you mean the initial commit(s) created when
I would not.
I don't know what you mean by this.
Zarr entry timestamps can only be retrieved via S3. |
yes, since that one would do some commits (e.g. to commit
in the dandiset's datalad dataset which is to commit the changes to
oh, that sucks... then I guess we should take modified time for that entire zarr (not asset it belongs to) -- do we get that timestamp somewhere |
@yarikoptic Are you expecting the backup script to only create one repository per Zarr, that repository being a submodule of the respective Dandiset's dataset? I assumed that the Zarrs repositories would be created under
We still need to query S3 to get files' sizes and their versioned AWS URLs, and all S3 queries can be done in a single request per entry. |
right, I forgot that aspect of the design -- we do have all of them under
oh, because of dandi/dandi-archive#925 (to be addressed as a part of the larger dandi/dandi-archive#937)? then may be we should also ask to have mtime to be included too while at it? |
@yarikoptic Could you write out a pseudocode sketch or something of how you envision adding a Zarr repo to a Dandiset dataset working? Right now, pre-Zarr, it roughly works like this:
In particular, once the syncing of the Zarr to its repository under
We would still need to query S3 to get the versioned AWS URL to register for the file in git-annex. |
|
|
I am a bit lost , but
;-) tricky you! you dug it up -- even for non-empty ones we must consider DeleteMarker's (deleted files/keys) datetimes so we have datetime of modification of a
Let's stick with |
|
@yarikoptic Also:
|
Good questions!
I don't think this is possible (any longer). See datalad/datalad#5155 for a TODO context manager and actually an existing one used in the tests
Yes, I think we are doomed to do that.
Not sure about always, but it is ATM. A similar one is in staging bucket (we have tests using staging, right?)
indeed, neither files nor DeleteMarkers have to exist... so then we could take that zarr creation datetime as a commit datetime
Let's call it
Something like
should do it (unless some rule added later overrides it) |
@yarikoptic For the |
@yarikoptic Reminder about question in previous comment. |
yes, as |
@yarikoptic Do |
|
@yarikoptic So they do create commits in the outer dataset, and that can't be avoided? Should all non-Zarrs assets be committed before cloning/updating Zarr subdatasets?
Take it for what, exactly? |
I don't think we want to avoid commits - they are what would give us idea about the state of a dandiset at that point in time
Not necessarily, as we don't commit per each non zarr asset change. Ideally there should be nothing special about Zarr asset/subdataset in that sense
For the commit in dandiset. Since there could be other assets to be saved, I guess we shouldn't do commit -d . , but let to eventual call to save to save it? |
@yarikoptic I'm confused about exactly what should be committed at what point.
Git commit timestamps have one-second resolution, so adding a millisecond is not an option. |
Echoing my thinking above -- Let's not commit right at that point.
same answer -- treat zarr subdataset as any other asset, collapsing multiple updates across assets where our logic says to do that already (we moved away from 1-commit-per-asset awhile back)
then 1 second? or just remove any increment? your choice -- anything which wouldn't throw off logic for minting release commits |
How? When using the Datalad Python API, how exactly do I invoke |
|
@yarikoptic I'm trying to write a test in which a Zarr is uploaded & backed up, then a file in the Zarr becomes a directory, a directory becomes a file, and the Zarr is uploaded & backed up again. However, when calling |
Cool! Just push that test and I will try it out as well to troubleshoot datalad from there? |
@yarikoptic Pushed. You can run just that test with |
filed datalad/datalad#6558 . Could you for now disable such a tricky unit test? ;) |
@yarikoptic Test disabled. |
@yarikoptic The
|
@yarikoptic Ping. |
sorry for the delay... Although I do not like breeding commands, I think for now or forever we would be ok if there is a dedicated Rationale:
|
What have you decided about this? |
Let's just do /mnt/backup/dandi/dandizarrs folder for now:
|
zarr/
refers to a .zarr or .ngff folder asset which are already entering staging server and soon will emerge in the main one. We should be ready (#126 is a stop gap measure so we do not "pollute" our datalad dandisets)zarr/
should be a DataLad subdatasetzarr_id
s as the ones used on s3 as prefixes underzarr/
prefix (look ats3://dandi-api-staging-dandisets/zarr/
)/mnt/backup/dandi/dandizarrs
(folder? super dataset?) or can/mnt/backup/dandi/dandisets/zarrs
subdatasetdandizars
into a repo/superdataset - since would not reflect entire state on the bucket anyways.(@yarikoptic is yet to review/adopt Populate assets via a separate subcommand #103 for out of banding backup of regular assets)
remove
d if zarr is removed in a given dandiset/path.zarr/
.zarr/
.zarr-checksum
or whatever that file to contain overall checksum should reside undergit
notgit-annex
Having established workflow for out-of-band backuping regular assets (based on #103) we will approach backup of zarrs. Some notes:
where I changed prefix and incremented last number in uuid (is it still legit? ;))
The text was updated successfully, but these errors were encountered: