Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest version drops vital coordinates of CMIP6 output #305

Closed
jbusecke opened this issue Dec 9, 2020 · 18 comments
Closed

Latest version drops vital coordinates of CMIP6 output #305

jbusecke opened this issue Dec 9, 2020 · 18 comments
Labels
bug Issues that present a reasonable conviction there is a reproducible bug.

Comments

@jbusecke
Copy link
Contributor

jbusecke commented Dec 9, 2020

Description

I have run into trouble with the latest intake-esm version and the google CMIP6 data. When loading datasets via intake esm

import intake
import fsspec
import xarray as xr
from cmip6_preprocessing.preprocessing import combined_preprocessing

col = intake.open_esm_datastore(
        "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json"
    )
cat = col().search(
        source_id='CanESM5',
        experiment_id='historical',
        variable_id='thetao',
        member_id="r1i1p1f1",
        table_id="Omon",
        grid_label='gn',
    )
cat

zarr_kwargs = {'consolidated':True}
ddict = cat.to_dataset_dict(
            zarr_kwargs=zarr_kwargs,
            preprocess=combined_preprocessing,
            storage_options={"token": "anon"},
        )
_, ds = ddict.popitem()
ds

Screen Shot 2020-12-09 at 5 40 15 PM

This is missing a bunch of the coordinates, which breaks many of my workflows (and testing) using cmip6 data.

I assume this is an intake-esm issue since I can load and preprocess the file manually

# now load the 'raw' zarr file and preprocess in the same way
mm = fsspec.get_mapper(
            cat.df["zstore"][0]
        )
ds_raw = xr.open_zarr(mm, **zarr_kwargs)
ds_manual = combined_preprocessing(ds_raw)
ds_manual

image
and the coordinates are present (e.g. lon_bounds/lat_bounds).

This only happens with the current version. I tested with v2020.08.15 and it does not happen.

Version information: output of intake_esm.show_versions()

Paste the output of intake_esm.show_versions() here:

INSTALLED VERSIONS
------------------

cftime: 1.2.1
dask: 2.30.0
fastprogress: 0.2.7
fsspec: 0.8.4
gcsfs: 0.7.1
intake: 0.6.0
intake_esm: 2020.11.4
netCDF4: 1.5.4
pandas: 1.1.4
requests: 2.24.0
s3fs: 0.4.2
xarray: 0.16.1
zarr: 2.5.0
@andersy005
Copy link
Member

@jbusecke, thank you for opening this issue! The only major change that may have introduced this issue is #287. I'm going to look into this and will get back to you ASAP.

@andersy005
Copy link
Member

@jbusecke, I'm having trouble reproducing the issue when I use the code snippet you provided. I don't see lon_bounds and lat_bounds in the raw zarr store:

# I extracted the store path via cat.df["zstore"][0]
In [17]: store = 'gs://cmip6/CMIP/CCCma/CanESM5/historical/r1i1p1f1/Omon/thetao/gn/'

In [18]: mm = fsspec.get_mapper(store)

In [19]: ds_raw = xr.open_zarr(mm, **zarr_kwargs)

In [20]: ds_raw
Out[20]: 
<xarray.Dataset>
Dimensions:             (bnds: 2, i: 360, j: 291, lev: 45, time: 1980, vertices: 4)
Coordinates:
  * i                   (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
  * j                   (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
    latitude            (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev                 (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
    lev_bnds            (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    longitude           (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time                (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
    time_bnds           (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
Dimensions without coordinates: bnds, vertices
Data variables:
    thetao              (time, lev, j, i) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>
    vertices_latitude   (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    vertices_longitude  (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
In [21]: ds_manual = combined_preprocessing(ds_raw)
/Users/abanihi/opt/miniconda3/envs/intake-esm-dev/lib/python3.8/site-packages/xarray/core/indexing.py:1361: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return self.array[key]

In [22]: ds_manual
Out[22]: 
<xarray.Dataset>
Dimensions:             (bnds: 2, lev: 45, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
  * x                   (x) float32 0.5 1.5 2.5 3.5 ... 356.5 357.5 358.5 359.5
  * y                   (y) float32 -78.3935 -78.19057 ... 71.618454 71.65057
    lat                 (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev                 (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
    lev_bounds          (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    lon                 (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time                (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
    time_bounds         (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    vertices_latitude   (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    vertices_longitude  (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
  * bnds                (bnds) int64 0 1
  * vertex              (vertex) int64 0 1 2 3
Data variables:
    thetao              (time, lev, y, x) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>

Am I missing anything? Is there any other information you can provide?

@jbusecke
Copy link
Contributor Author

This looks like a problem with the combined_preprocessing function. What's the version of cmip6_preprocessing that you are using?

@andersy005
Copy link
Member

I'm running:

In [26]: cmip6_preprocessing.__version__
Out[26]: 'v0.1.4'

Should I upgrade?

@andersy005
Copy link
Member

Should I upgrade?

Looks like 'v0.1.4' is the latest release

@jbusecke
Copy link
Contributor Author

I think I was using the upstream master, could you try with that? If that fixes this issue Ill release soon.

@jbusecke
Copy link
Contributor Author

jbusecke commented Dec 11, 2020

I will have a more detailed overview about this issue once cmip6_preprocessing/62 is up and running. This is where I discovered the issue, for now I have pinned the older intake-esm versions.

@jbusecke
Copy link
Contributor Author

cmip6_preprocessing/67 shows the issue. Just by updating intake-esm I get a bunch of failing tests due to missing coordinates. Any idea what could cause this?

@andersy005
Copy link
Member

andersy005 commented Dec 18, 2020

@jbusecke,

Your assessment is dead on. When I turn the aggregations off, I can see that there are coordinate information that get thrown away when the aggregations are enabled:

  • Aggreations OFF:
 Dimensions:             (bnds: 2, i: 360, j: 291, lev: 45, time: 1980, vertices: 4)
 Coordinates:
   * i                   (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
   * j                   (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
     latitude            (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
   * lev                 (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
     lev_bnds            (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
     longitude           (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
   * time                (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:0...
     time_bnds           (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
 Dimensions without coordinates: bnds, vertices
 Data variables:
     thetao              (time, lev, j, i) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>
     vertices_latitude   (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
     vertices_longitude  (j, i, vertices) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
  • Aggregations ON:
 Dimensions:    (i: 360, j: 291, lev: 45, member_id: 1, time: 1980)
 Coordinates:
   * i          (i) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
   * j          (j) int32 0 1 2 3 4 5 6 7 8 ... 283 284 285 286 287 288 289 290
     latitude   (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
   * lev        (lev) float64 3.047 9.454 16.36 ... 5.126e+03 5.375e+03 5.625e+03
     longitude  (j, i) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
   * time       (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
   * member_id  (member_id) <U8 'r1i1p1f1'
 Data variables:
     thetao     (member_id, time, lev, j, i) float32 dask.array<chunksize=(1, 6, 45, 291, 360), meta=np.ndarray>

I'm going to look into this and will ping you when I have a patch for this bug...

@andersy005 andersy005 added the bug Issues that present a reasonable conviction there is a reproducible bug. label Dec 18, 2020
@andersy005
Copy link
Member

@jbusecke, thank you for the minimal example too! I can confirm that #306 seems to have fixed this issue:

In [1]: import intake

In [2]: import xarray as xr

In [3]: import fsspec

In [4]: from cmip6_preprocessing.preprocessing import combined_preprocessing

In [5]: col = intake.open_esm_datastore(
   ...:         "https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.jso
   ...: n"
   ...:     )

In [6]: cat = col.search(
   ...:         source_id='CanESM5',
   ...:         experiment_id='historical',
   ...:         variable_id='thetao',
   ...:         member_id="r1i1p1f1",
   ...:         table_id="Omon",
   ...:         grid_label='gn',
   ...:     )

In [7]: cat
Out[7]: <pangeo-cmip6 catalog with 1 dataset(s) from 1 asset(s)>

In [8]: zarr_kwargs = {'consolidated':True}
  • intake-esm
In [9]: ddict = cat.to_dataset_dict(
   ...:             zarr_kwargs=zarr_kwargs,
   ...:             preprocess=combined_preprocessing,
   ...:             storage_options={"token": "anon"},
   ...:         )
   ...: _, ds = ddict.popitem()
   ...: ds

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

Out[9]: ███████████████████████████████████████████| 100.00% [1/1 00:00<00:00]
<xarray.Dataset>
Dimensions:        (bnds: 2, lev: 45, member_id: 1, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
  * x              (x) float32 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * y              (y) float32 -78.3935 -78.19058 ... 89.36653 89.74177
    lat            (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev            (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
    lev_bounds     (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
  * bnds           (bnds) int64 0 1
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
  * member_id      (member_id) <U8 'r1i1p1f1'
Data variables:
    thetao         (member_id, time, lev, y, x) float32 dask.array<chunksize=(1, 6, 45, 291, 360), meta=np.ndarray>
  • xarray + fsspec
In [10]: mm = fsspec.get_mapper(
    ...:             cat.df["zstore"][0]
    ...:         )
    ...: ds_raw = xr.open_zarr(mm, **zarr_kwargs)
    ...: ds_manual = combined_preprocessing(ds_raw)
    ...: ds_manual
    ...: 

Out[10]: 
<xarray.Dataset>
Dimensions:        (bnds: 2, lev: 45, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
  * x              (x) float32 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * y              (y) float32 -78.3935 -78.19058 ... 89.36653 89.74177
    lat            (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * lev            (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
    lev_bounds     (lev, bnds) float64 dask.array<chunksize=(45, 2), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float32 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
  * bnds           (bnds) int64 0 1
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
Data variables:
    thetao         (time, lev, y, x) float32 dask.array<chunksize=(6, 45, 291, 360), meta=np.ndarray>

@andersy005
Copy link
Member

Also, it appears that with #306, everything is back to working (including jbusecke/xMIP#67) with an exception of one test

-- Docs: https://docs.pytest.org/en/stable/warnings.html
========================== short test summary info ===========================
FAILED cmip6_preprocessing/tests/test_preprocessing_cloud.py::test_check_dim_coord_values[CESM2-thetao-ssp585-gn]
= 1 failed, 644 passed, 2101 skipped, 38 xfailed, 9030 warnings in 8754.16s (2:25:54) =

P.S. That's plenty of long running tests... I wasn't sure how long they were going to take to finish running 😅

@jbusecke
Copy link
Contributor Author

Yeah it's not great to run for 1 hour...I will try to improve this in the next days, but I wanted to get this running first.

Thanks for the fix! I am rerunning the tests, and will investigate what is happening with that one model...

@jbusecke
Copy link
Contributor Author

Also thanks for the quick fix!

@andersy005
Copy link
Member

Of course! Just merged #306 into master. When you get a chance, can you test jbusecke/xMIP#67 against intake-esm's master branch and let me know how it goes? I plan to release a new version once I have your confirmation that everything is working.

@jbusecke
Copy link
Contributor Author

Just switched the CI to the upstream master. Now we just need to wait an hour...🤪

@jbusecke
Copy link
Contributor Author

It seems that everything is back to normal with the newest master (jbusecke/xMIP#67). Your single model failure could have been caused by some hickup with the cloud data?

@andersy005
Copy link
Member

Your single model failure could have been caused by some hickup with the cloud data?

Perhaps... Thank you for the update! I am going to cut a new release by the end of the day today.

@andersy005
Copy link
Member

Closing this as it has been addressed in #306

Please reopen if you still encounter this issue with the latest stable (v2020.12.18) version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues that present a reasonable conviction there is a reproducible bug.
Projects
None yet
Development

No branches or pull requests

2 participants