Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for JSON metadata workflow #124

Closed
b-pos465 opened this issue Jun 27, 2022 · 4 comments
Closed

Support for JSON metadata workflow #124

b-pos465 opened this issue Jun 27, 2022 · 4 comments

Comments

@b-pos465
Copy link

Use Case

I am trying to access NetCDF4 data via JSON metadata with intake-xarray. This approach is based on this blog post by lsterzinger. I am trying to make the data access as convenient as possible. The ideal solution for me with the existing API would look like this:

# using open_netcdf
json_source = intake.open_netcdf('/home/jovyan/work/output/s3/combine.json', xarray_kwargs={'engine': 'zarr', 'consolidated': False})

# using open_zarr
json_source = intake.open_zarr('/home/jovyan/work/output/s3/combine.json', consolidated=False)

When testing this approach I get the following error:

...

File /opt/conda/lib/python3.9/site-packages/zarr/hierarchy.py:1057, in _normalize_store_arg(store, storage_options, mode)
   1055 if store is None:
   1056     return MemoryStore()
-> 1057 return normalize_store_arg(store,
   1058                            storage_options=storage_options, mode=mode)

File /opt/conda/lib/python3.9/site-packages/zarr/storage.py:123, in normalize_store_arg(store, storage_options, mode)
    121         return N5Store(store)
    122     else:
--> 123         return DirectoryStore(store)
    124 else:
    125     if not isinstance(store, BaseStore) and isinstance(store, MutableMapping):

File /opt/conda/lib/python3.9/site-packages/zarr/storage.py:844, in DirectoryStore.__init__(self, path, normalize_keys, dimension_separator)
    842 path = os.path.abspath(path)
    843 if os.path.exists(path) and not os.path.isdir(path):
--> 844     raise FSPathExistNotDir(path)
    846 self.path = path
    847 self.normalize_keys = normalize_keys

FSPathExistNotDir: path exists but is not a directory: %r

The approach from the blog post uses an FSMap. So I tried the following:

fs = fsspec.filesystem(
    "reference", 
    fo="/home/jovyan/work/output/s3/combine.json", 
    remote_protocol="file",
    skip_instance_cache=True
)
m = fs.get_mapper("")

json_source = intake.open_zarr(m, engine='zarr')
json_source.discover()

This one works. But it kind of misses the point of Intake as the user has to know about the fsspec API to create a working FSMap.

Suggestion

Version 1

I would like to implement an extra case for the open_zarr method to support the JSON workflow introduced in the blog post mentioned above.

Version 2

I could also imagine an extra method for the JSON workflow, something like intake.open_zarr_metadata('combine.json').

Questions

  1. Which approach would you prefer?

  2. While looking through existing issues I found xarray.open_zarr to be deprecated  #70. If I get it correctly, you removed the fsspec mapper 2020 as it wasn't needed anymore. Is there another solution to bring the JSON workflow to intake-xarray that I overlooked?

  3. Unfortunately, my Python knowledge is limited so I have no idea how to test a modified version of intake-array. I found https://intake-xarray.readthedocs.io/en/latest/contributing.html#id9 to run tests. But how can I test a modified version of intake-array with Intake locally? Would be great to have this in the docs!

@martindurant
Copy link
Member

This does already work, but the invocation via intake-xarray (or xarray open_dataset directly) is complex. Actually, intake-xarray is great exactly because it hides this complexity from the user once you've figured it out. Your call should look something like

source = intake.open_zarr(
    "reference://",
    storage_options={
        "fo": '/home/jovyan/work/output/s3/combine.json',
        "remote_protocol": "...",  # e.g., "s3", "http", ...
        "remote_options": {...}  # anything needed to configure that remote filesystem
    },
    consolidated=False
)

And yes, open_netcdf essentially does the same thing, except that you specify the engine, and all those arguments get nested inside a "backend_kwargs".

@martindurant
Copy link
Member

If you succeed in generating an interesting dataset and would like to share in public, the kerchunk project would like to know about it!

@b-pos465
Copy link
Author

b-pos465 commented Jul 7, 2022

Thank you for your help! Your approach works perfectly fine.

I was able to generate a YAML file from the source above and load it back in.

Actually, I am not working on a dataset but on a web-based tool for migrating NETCDF4 data to Zarr. It supports both an actual conversion and the JSON metadata workflow mentioned above. Right now I am working on the Intake integration for the JSON metadata. Here is a link to the repository: https://github.com/climate-v/nc2zarr-webapp

@b-pos465 b-pos465 closed this as completed Jul 7, 2022
@martindurant
Copy link
Member

Are you aware of https://pangeo-forge.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants