-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest version drops vital coordinates of CMIP6 output #305
Comments
@jbusecke, I'm having trouble reproducing the issue when I use the code snippet you provided. I don't see # I extracted the store path via cat.df["zstore"][0]
In [17]: store = 'gs://cmip6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Omon/thetao/gn/'
In [18]: mm = fsspec.get_mapper(store)
In [19]: ds_raw = xr.open_zarr(mm, **zarr_kwargs)
In [20]: ds_raw
Out[20]:
<xarray.Dataset>
Dimensions: (bnds: 2, i: 360, j: 291, lev: 45, time: 1980, vertices: 4)
Coordinates:
* i (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
* j (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
latitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
lev_bnds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
longitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds, vertices
Data variables:
thetao (time, lev, j, i) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>
vertices_latitude (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
vertices_longitude (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray> In [21]: ds_manual = combined_preprocessing(ds_raw)
/Users/abanihi/opt/miniconda3/envs/intake-esm-dev/lib/python3.8/site-packages/xarray/core/indexing.py:1361: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array[indexer]
To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array[indexer]
return self.array[key]
In [22]: ds_manual
Out[22]:
<xarray.Dataset>
Dimensions: (bnds: 2, lev: 45, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
* x (x) float32 0.5 1.5 2.5 3.5 ... 356.5 357.5 358.5 359.5
* y (y) float32 -78.3935 -78.19057 ... 71.618454 71.65057
lat (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
lev_bounds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
time_bounds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
vertices_latitude (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
vertices_longitude (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
* bnds (bnds) int64 0 1
* vertex (vertex) int64 0 1 2 3
Data variables:
thetao (time, lev, y, x) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray> Am I missing anything? Is there any other information you can provide? |
This looks like a problem with the combined_preprocessing function. What's the version of cmip6_preprocessing that you are using? |
I'm running: In [26]: cmip6_preprocessing.__version__
Out[26]: 'v0.1.4' Should I upgrade? |
Looks like |
I think I was using the upstream master, could you try with that? If that fixes this issue Ill release soon. |
I will have a more detailed overview about this issue once cmip6_preprocessing/62 is up and running. This is where I discovered the issue, for now I have pinned the older intake-esm versions. |
cmip6_preprocessing/67 shows the issue. Just by updating intake-esm I get a bunch of failing tests due to missing coordinates. Any idea what could cause this? |
Your assessment is dead on. When I turn the aggregations off, I can see that there are coordinate information that get thrown away when the aggregations are enabled:
Dimensions: (bnds: 2, i: 360, j: 291, lev: 45, time: 1980, vertices: 4)
Coordinates:
* i (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
* j (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
latitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
lev_bnds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
longitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
time_bnds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds, vertices
Data variables:
thetao (time, lev, j, i) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>
vertices_latitude (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
vertices_longitude (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
Dimensions: (i: 360, j: 291, lev: 45, member_id: 1, time: 1980)
Coordinates:
* i (i) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
* j (j) int32 0 1 2 3 4 5 6 7 8 ... 283 284 285 286 287 288 289 290
latitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.126e+03 5.375e+03 5.625e+03
longitude (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
* member_id (member_id) <U8 'r1i1p1f1'
Data variables:
thetao (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 6, 45, 291, 360), meta=np.ndarray> I'm going to look into this and will ping you when I have a patch for this bug... |
@jbusecke, thank you for the minimal example too! I can confirm that #306 seems to have fixed this issue: In [1]: import intake
In [2]: import xarray as xr
In [3]: import fsspec
In [4]: from cmip6_preprocessing.preprocessing import combined_preprocessing
In [5]: col = intake.open_esm_datastore(
...: "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.jso
...: n"
...: )
In [6]: cat = col.search(
...: source_id='CanESM5',
...: experiment_id='historical',
...: variable_id='thetao',
...: member_id="r1i1p1f1",
...: table_id="Omon",
...: grid_label='gn',
...: )
In [7]: cat
Out[7]: <pangeo-cmip6 catalog with 1 dataset(s) from 1 asset(s)>
In [8]: zarr_kwargs = {'consolidated':True}
In [9]: ddict = cat.to_dataset_dict(
...: zarr_kwargs=zarr_kwargs,
...: preprocess=combined_preprocessing,
...: storage_options={"token": "anon"},
...: )
...: _, ds = ddict.popitem()
...: ds
--> The keys in the returned dictionary of datasets are constructed as follows:
'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
Out[9]: ███████████████████████████████████████████| 100.00% [1/1 00:00<00:00]
<xarray.Dataset>
Dimensions: (bnds: 2, lev: 45, member_id: 1, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
* x (x) float32 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* y (y) float32 -78.3935 -78.19058 ... 89.36653 89.74177
lat (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
lev_bounds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
time_bounds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
lat_verticies (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
lon_verticies (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
* bnds (bnds) int64 0 1
* vertex (vertex) int64 0 1 2 3
lon_bounds (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
lat_bounds (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
* member_id (member_id) <U8 'r1i1p1f1'
Data variables:
thetao (member_id, time, lev, y, x) float32 dask.array<chunksize=(1, 6, 45, 291, 360), meta=np.ndarray>
In [10]: mm = fsspec.get_mapper(
...: cat.df["zstore"][0]
...: )
...: ds_raw = xr.open_zarr(mm, **zarr_kwargs)
...: ds_manual = combined_preprocessing(ds_raw)
...: ds_manual
...:
Out[10]:
<xarray.Dataset>
Dimensions: (bnds: 2, lev: 45, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
* x (x) float32 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
* y (y) float32 -78.3935 -78.19058 ... 89.36653 89.74177
lat (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* lev (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
lev_bounds (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
lon (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
* time (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
time_bounds (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
lat_verticies (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
lon_verticies (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
* bnds (bnds) int64 0 1
* vertex (vertex) int64 0 1 2 3
lon_bounds (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
lat_bounds (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
Data variables:
thetao (time, lev, y, x) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray> |
Also, it appears that with #306, everything is back to working (including jbusecke/xMIP#67) with an exception of one test -- Docs: https://docs.pytest.org/en/stable/warnings.html
========================== short test summary info ===========================
FAILED cmip6_preprocessing/tests/test_preprocessing_cloud.py::test_check_dim_coord_values[CESM2-thetao-ssp585-gn]
= 1 failed, 644 passed, 2101 skipped, 38 xfailed, 9030 warnings in 8754.16s (2:25:54) =
P.S. That's plenty of long running tests... I wasn't sure how long they were going to take to finish running 😅 |
Yeah it's not great to run for 1 hour...I will try to improve this in the next days, but I wanted to get this running first. Thanks for the fix! I am rerunning the tests, and will investigate what is happening with that one model... |
Also thanks for the quick fix! |
Of course! Just merged #306 into master. When you get a chance, can you test jbusecke/xMIP#67 against intake-esm's master branch and let me know how it goes? I plan to release a new version once I have your confirmation that everything is working. |
Just switched the CI to the upstream master. Now we just need to wait an hour...🤪 |
It seems that everything is back to normal with the newest master (jbusecke/xMIP#67). Your single model failure could have been caused by some hickup with the cloud data? |
Perhaps... Thank you for the update! I am going to cut a new release by the end of the day today. |
Closing this as it has been addressed in #306 Please reopen if you still encounter this issue with the latest stable (v2020.12.18) version |
Description
I have run into trouble with the latest intake-esm version and the google CMIP6 data. When loading datasets via intake esm
This is missing a bunch of the coordinates, which breaks many of my workflows (and testing) using cmip6 data.
I assume this is an intake-esm issue since I can load and preprocess the file manually
and the coordinates are present (e.g.
lon_bounds
/lat_bounds
).This only happens with the current version. I tested with v2020.08.15 and it does not happen.
Version information: output of
intake_esm.show_versions()
Paste the output of
intake_esm.show_versions()
here:The text was updated successfully, but these errors were encountered: