Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP zarr backend #98

Closed
wants to merge 126 commits into from
Closed

WIP zarr backend #98

wants to merge 126 commits into from

Conversation

donghekang
Copy link

@donghekang donghekang commented Jul 15, 2019

Motivation

Add support for Zarr DirectoryStore I/O

How to test the behavior?

See the included tests

ToDo

Dependencies

  • Update sources to latest HDMF dev
  • Remove dependency on six in Zarr backend
  • Update requirements to the current Zarr and numcodecs version
  • Make the import of Zarr optional to make sure HDMF work even without Zarr installed, i.e., only the explicit import of the zarr backend classes should fail if Zarr is not installed. Update: In src/hdmf/utils.py ,src/hdmf/io/__init__.py , tests/unit/test_io_zarr.py etc.
  • Add get_docval_macro method #446 adds a get_docval_macro function. We should remove the __get_docval_macro that has been added in this PR and update the code to use the new one from HDMF.

Features

  • Implement core Zarr I/O backend and utilities
  • Builder.written flag has been removed and built status should be tracked in the I/O backend
  • Remove logic to support export to Zarr from HDF5 and use new export functions instead
  • Add support to allow the use of Zarr-supported compressors and filters for datasets
  • Raise NotImplemented error for RegionReferences and mention in Docs
  • Add support for AbstractDataChunkIterator (also exhaust_dci flag has been moved to the HDF5 backend and removed from HDMFIO)
  • Add support for for RegionReferences (Optional)

Testing and Documentation

  • Add unit tests for iterative data write
  • Update CHANGELOG.md
  • Check that we can convert an NWB file from HDF5 --> Zarr --> HDF5
  • Add tests for export between HDF5 and Zarr and vice versa
  • Tests for export from ZarrToZarr are in TestExportZarrToZarr. Export of links seems to fail. Re-enable and debug the tests.
  • Check resolve references properly #179 to see if we need to update reference handling
  • Resolve TODO items listed in the code
  • Add example for HDF5 to Zarr conversion (and vice versa) to the export docs.

Checklist

  • Have you checked our Contributing document?
  • Have you ensured the PR description clearly describes problem and the solution?
  • Is your contribution compliant with our coding style ? This can be checked running flake8 from the source directory.
  • Have you checked to ensure that there aren't other open Pull Requests for the same change?
  • Have you included the relevant issue number using #XXX notation where XXX is the issue number ? By including "Fix #XXX" you allow GitHub to close the corresponding issue.

@oruebel
Copy link
Contributor

oruebel commented Jul 15, 2019

@bendichter
Copy link
Contributor

@kangDH thanks for getting this going! A bunch of us are really excited about incorporating alternative backends!

@bendichter
Copy link
Contributor

@kangDH I think for requirements it might make more sense to have Zarr and other optional backends be included as extras

@donghekang
Copy link
Author

@kangDH I think for requirements it might make more sense to have Zarr and other optional backends be included as extras

Because we import zarr in the utils.py, I think zarr should be installed by default.

@oruebel oruebel requested a review from ajtritt July 15, 2019 22:41
@bendichter
Copy link
Contributor

@kangDH it looks like you developed this in python 2. Could you change to python 2/3 syntax? e.g. changing print a to print(a). Also we use flake8 for managing code style, which is pretty strict. Check the results there to see where it has issues.

@oruebel
Copy link
Contributor

oruebel commented Jul 16, 2019

I'm working on the fixes for Python 3

@oruebel
Copy link
Contributor

oruebel commented Jul 17, 2019

@kangDH the following PR donghekang#1 to your branch should fix the test failures for Python 2/3 and flake8. Can you please review and merge the PR with your branch.

@bendichter
Copy link
Contributor

@oruebel Since Zarr only supports chunked datasets, how does this backend handles writing unchunked datasets? Does it write the dataset as one big chunk?

@oruebel
Copy link
Contributor

oruebel commented Sep 4, 2020

In that case the chunk parameter will be set to False when calling the require_dataset function of zarr.hiearchy.Group I believe this means that Zarr will store the array in a single block, but I'd need to double check what Zarr actually does in this case when the file is written to disk.

@bendichter
Copy link
Contributor

That could be a problem for large datasets because a user would not be able to read a section of it without reading the entire dataset into memory

@oruebel
Copy link
Contributor

oruebel commented Sep 4, 2020

The default option for chunking is set to True currently in ZarrIO. We can run some test with some NWB files. The convert should be fairly simple:

from pynwb import NWBHDF5IO, NWBZarrIO
import os
infile = "H19.28.012.11.05-2.nwb"
outfile = "test_zarr_" + os.path.basename(infile)
h5r = NWBHDF5IO(infile , 'r', load_namespaces=False)
f = h5r.read()
zw = NWBZarrIO(outfile, 
               mode='w', 
               manager=h5r.manager, 
               chunking=True)
zw.write(f, cache_spec=True)
zw.close()
h5r.close()

@oruebel oruebel mentioned this pull request Oct 29, 2020
6 tasks
@oruebel oruebel mentioned this pull request Feb 16, 2022
25 tasks
@oruebel
Copy link
Contributor

oruebel commented Feb 16, 2022

Closing in favor of the new PR #696

@oruebel oruebel closed this Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants