Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CESM POP 1-degree #56

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

paigem
Copy link
Contributor

@paigem paigem commented Jun 28, 2021

This is the start of a pipeline for CESM run v5_rel04_BC5_ne30_g16. I have followed steps 1-5 laid out by @cisaacstern in Issue #46.

@cisaacstern Could I get a bit more guidance on how to write the meta.yaml and pipeline.py files? It looks like the instructions at https://github.com/paigem/staged-recipes.git aren't quite finished yet.

@pangeo-forge pangeo-forge deleted a comment from review-notebook-app bot Jun 28, 2021
@cisaacstern
Copy link
Member

@paigem, thanks for kicking off this PR!

The staged-recipes documentation is actually quite out of date. In response to your question, I've begun the process of updating the README in commit 08cff96 of #57. The big picture is that pipeline.py files are actually no longer needed, and have been replaced by the recipe.py file you've already created in this PR.

As for meta.yaml, that will be needed for automating recipe execution by a Bakery, but is not required to get started on your recipe. In fact, it is also possible to manually execute your recipe, thereby creating your Zarr store without a meta.yaml. We can revisit this once the recipe itself is where we want it. 😄

For a next step, I've pushed an outline for you to fill in to your recipe.py. In order to see the outline locally, you'll need to fetch it. From within your local PR branch, run:

git fetch origin
git merge origin/cesm-pop-lowres-1deg

Now, when you open recipes/cesm-pop-lowres-1deg/recipe.py, it should look like this.

You will want to change the f-string returned from the make_full_path function here so that it reflects a real path for downloading the source files. Place {variable} in the part of that path where the variable names occur. I do recall we'll need credentials for these downloads, but you can ignore that for now, as those will be added in at the execution phase (either manual, or in a Bakery).

Some more detail on make_full_path functions is provided in the docs here. We do not have a time dimension in the make_full_path function for this recipe, because based on #46 (comment), it appears the data provider wraps all of the time steps for a given variable in a single netCDF file.

The merging of the variables is achieved by the MergeDim instance. To complete this, you'll just need to fill out the vars list with the names of your 14 variables as they appear in the source file paths.

Finally, we'll need a chunk size here.

Once you've made these changes to recipe.py (or as many of them as make sense at this time), you can commit them back to GitHub with:

git add -A
git commit -m "<some descriptive message here>"
git push origin cesm-pop-lowres-1deg

Feel free to ask any questions if any of the above doesn't make sense. It's still early days for us onboarding new recipe contributors, so if something doesn't make sense, it's likely my fault for not explaining it well!

@paigem
Copy link
Contributor Author

paigem commented Jun 30, 2021

Thanks @cisaacstern for the great instructions! I think I have mostly completed all of the above steps:

  • added the recipe.py file
  • added the download path for each variable (note: this path takes you to a sign-in page, so authentication is needed for this step)
  • added a list of strings with all 14 variables
  • put in an estimated chunk size, based only on the size of the full dataset and the number of time steps
    • currently, I put {time=200}, and estimate that 200 time steps should be roughly 60 MB
    • this should be double checked before the final conversion to Zarr

@paigem
Copy link
Contributor Author

paigem commented Jul 9, 2021

@cisaacstern just checking on the progress here! Let me know if there is something more I need to do in order to move this forward.

@cisaacstern
Copy link
Member

@paigem, thanks for following-up! I made a few commits to the recipe to get it ready to run.

I then made an OpenID account, but when I click "Download Options" here, I get the Authorization Required response copied below.

Rather than requesting and waiting for authorization on my account, perhaps it's easiest if we use your credentials to cache the netCDFs? If you're comfortable with that approach, you can securely message them to me at https://keybase.io/cstern. If not, I'll make a request for authorization on my account.

"""

Authorization Required

You do not have enough privileges to access the requested resource.

Please request membership in the following group to gain access to the requested resource:

Group Name Description
CCSM CCSM (Community Climate System Model) users

"""

@cisaacstern
Copy link
Member

This recipe has been blocked by three issues, the first two of which I have (albeit provisional) solutions for:

  1. We don't want @paigem's secret API token to end up as part of the names of a lot of public files on the GCS cache. Installing from Make fsspec_open_kwargs, query_string_secrets, & is_opendap attributes of FilePattern pangeo-forge-recipes#167 solves this.
  2. How do we append the API token to the recipe urls? We don't want to push it to the public PR. 0548f8b plus the usage pattern described in Make fsspec_open_kwargs, query_string_secrets, & is_opendap attributes of FilePattern pangeo-forge-recipes#167 (comment) resolves this for manual execution.

The third issue is as-yet unresolved, and I'll make it it's own comment because it's (I think) unrelated to these.

@paigem
Copy link
Contributor Author

paigem commented Jul 16, 2021

Thank you for these updates @cisaacstern!

@cisaacstern
Copy link
Member

The actively blocking issue for this recipe is what appear to be incorrect HTTP Content Length headers.

Unfortunately, the authentication issues referenced in #56 (comment) mean I can't provide a fully reproducible example, but I will describe the situation in detail below.

For the first input url, curl reads the headers as:

curl -I https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=$TOKEN
HTTP/1.1 200 
Date: Fri, 16 Jul 2021 01:44:31 GMT
Server: Apache
Last-Modified: Thu, 30 Oct 2014 01:38:26 GMT
Accept-Ranges: bytes
Content-Type: application/x-netcdf
Content-Length: 1598457682
Set-Cookie: JSESSIONID=EA92736C6194748B575F70E199020F8E;path=/;Secure;HttpOnly
Via: 1.1 tds.ucar.edu

fsspec agrees on Content-Length, because bytes=1590000000-1598457681 is the last range it fetches when called from within _copy_btw_filesystems:

fsspec.http - DEBUG - Fetch range for <File-like object HTTPFileSystem, https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=<TOKEN>>: 1590000000-1598457682
fsspec.http - DEBUG - https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=<TOKEN> : bytes=1590000000-1598457681
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems copying block of 8457682 bytes
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems reading data
pangeo_forge_recipes.storage - DEBUG - FSSpecTarget.open yielded
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems done

The problem is that this file is not 1598457682/1e9 ≈ 1.6 GBs, but rather 18778326866 ≈ 18.8 GBs. So fsspec gives us an incomplete file. We know the actual file size because the wget script provided by NCAR (copied to gist here, with tokens removed) somehow determines the actual length (and downloads without issue):

Running wget_cmems.sh version: 
wget command: wget -c --user-agent=wget/1.21.1/esg/3.0.34-20210630-162451/created/2021-07-12T18:29:21-06:00
v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc ...Downloading
--2021-07-15 18:56:46--  https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=$TOKEN
Resolving tds.ucar.edu (tds.ucar.edu)... 128.117.225.68
Connecting to tds.ucar.edu (tds.ucar.edu)|128.117.225.68|:443... connected.
HTTP request sent, awaiting response... 206 
Length: 18778326866 (17G), 17616172837 (16G) remaining [application/x-netcdf]
Saving to: ‘v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc’

@martindurant, any suggestions for how to move forward? I've tried setting fsspec_open_kwargs={"block_size": 0} to stream the file instead, which does start working, and then seems to reliably stall after a minute or two, before the transfer is complete. I've also tried (perhaps naively) to explicitly pass the known length as fsspec_open_kwargs={"size": 18778326866}, which turns out to not be a recognized kwarg.

The only thing I can think is that NCAR's wget script is doing some further authentication using the JSESSIONID in the Set-Cookie header, but not certain (a) if that's even technically plausible; or (b) how to test and/or replicate it with fsspec if so.

cc @rsignell-usgs b/c we discussed this yesterday on the Pangeo call, and perhaps your THREDDS expertise will yield some key insights

@martindurant
Copy link

I've tried setting fsspec_open_kwargs={"block_size": 0} to stream the file instead, which does start working, and then seems to reliably stall after a minute or two,

Can you figure out why? This is with dask running? Is h5py or other library in on this download? You can try interrupting the coroutines or thread to find out at least where its waiting.

@cisaacstern
Copy link
Member

cisaacstern commented Jul 22, 2021

Can you figure out why?

I'm executing this recipe incrementally in Jupyter (no Dask).

Turns out this is not an fsspec issue. I believe that with {"block_size": 0} this (a) plus this (b) logger.debug in pangeo_forge_recipes/storage.py write such an incredibly large number of lines so quickly to stdout that Jupyter can't keep up with rendering them all. At first, a bunch of debug lines appear, then the cell appears to stall. But in reality, only the output render has stalled; fsspec is actually cranking away in the background. Eventually, Jupyter catches up and provides a warning:

IOPub message rate exceeded
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

And a little while later, the output render resumes and the download is completed. I am about to submit a PR with a proposed solution to this issue.

@cisaacstern
Copy link
Member

cisaacstern commented Jul 23, 2021

@rabernat, now that I have this recipe's files cached to GCS, I'm encountering an issue where they take an unexpectedly long time to open with xr.open_dataset. This means prepare_target hangs for many minutes here.

As far as I can tell, these are valid netCDF files which have been transferred to the cache uncorrupted, because I did get a bunch of intelligible debug information back from xarray when I re-installed it with all these added print statements. The logging just goes on for minutes upon minutes without returning the dataset, however, which is quite different from the "instantaneous" lazy loading I've come to expect.

This repo contains an notebook (binder link included) which demonstrates this behavior with an attempt to open the ssh_2 netCDF input for this recipe:

https://github.com/cisaacstern/cesm-open-dataset

NCAR provides a "netCDF header" for each of the input files, and I've also provided the ssh_2 header alongside the notebook as a text file. The only item there that seems particularly noteworthy to me is that it contains 60,590 timesteps, which is considerably more than for any other netCDF file I've worked with thus far.

Is there anything else in that header that points to why this might be taking so long to open? Are there any strategies we might employ to make xr.open_dataset run faster in this case? I assume if I just left my laptop open for a very long time, this recipe might work as-is, but I figured this was a good opportunity to make sure I wasn't overlooking something obvious.

@rabernat
Copy link
Contributor

rabernat commented Jul 27, 2021

As far as I can tell, these are valid netCDF files which have been transferred to the cache uncorrupted,

If the netCDF files are very large (many GB), it can indeed take a long time to open them with xarray (many minutes). This is why we don't like to use netCDF on cloud storage! Since this recipe has many variables, each one of the variable files needs to be opened up by prepare_target. Currently this happens in serial.

To get some intuition for this, I'd recommend trying to directly open one of the files from the cache, outside of pangeo_forge_recipes. Could you just do?

%time ds = xr.open_dataset(cache_path)

That will give us some valuable quantitative information. This is also a place where it matters a lot where you are running the recipe. Are you sure you are in the same cloud region as your cache bucket?

One way we might be able to make this go a bit faster would be to use the approach suggested in pangeo-forge/pangeo-forge-recipes#82, where we parallelize the opening of the files required by prepare_target. But that would introduce a lot of complexity.

In general, the most important property of Pangeo Forge recipes is that they work, not that they are extremely fast. We only have to run the recipe once. It's acceptable if prepare_target takes an hour for certain recipes.

@cisaacstern
Copy link
Member

Could you just do?
%time ds = xr.open_dataset(cache_path)

Yes, I did try that in https://github.com/cisaacstern/cesm-open-dataset/blob/main/cesm_open_dataset.ipynb. But at the time I didn't know that ...

It's acceptable if prepare_target takes an hour for certain recipes.

... so now, I can move forward with this by just "letting it run".

This is also a place where it matters a lot where you are running the recipe. Are you sure you are in the same cloud region as your cache bucket?

In this case, I believe yes: opening from the pangeo-forge GCS cache (which I'm pretty sure is in us-central1), from a Pangeo Cloud notebook at us-central1-b.gcp.pangeo.io. But good to know to keep an eye out for this in the future.

@cisaacstern
Copy link
Member

cisaacstern commented Aug 16, 2021

@paigem, this data is now available as a zarr store on OSN:

import fsspec
import xarray as xr

path = "s3://Pangeo/pangeo-forge/cesm_pop_lowres_1deg/v5_rel04_BC5_ne30_g16.zarr"
mapper = fsspec.get_mapper(
    path, anon=True, client_kwargs={'endpoint_url': 'https://ncsa.osn.xsede.org'},
)
ds = xr.open_zarr(mapper, consolidated=True)
print(ds)
<xarray.Dataset>
Dimensions:             (nlat: 384, nlon: 320, time: 60590, z_t: 60, z_w: 60, d2: 2, z_t_150m: 15, z_w_bot: 60, z_w_top: 60)
Coordinates:
    TLAT                (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    TLONG               (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULAT                (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULONG               (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    day                 (time) int64 dask.array<chunksize=(200,), meta=np.ndarray>
  * time                (time) object 0001-01-02 00:00:00 ... 0167-01-01 00:0...
    time_counter        (time) int64 dask.array<chunksize=(200,), meta=np.ndarray>
  * z_t                 (z_t) float32 500.0 1.5e+03 ... 5.125e+05 5.375e+05
  * z_t_150m            (z_t_150m) float32 500.0 1.5e+03 ... 1.35e+04 1.45e+04
  * z_w                 (z_w) float32 0.0 1e+03 2e+03 ... 5e+05 5.25e+05
  * z_w_bot             (z_w_bot) float32 1e+03 2e+03 3e+03 ... 5.25e+05 5.5e+05
  * z_w_top             (z_w_top) float32 0.0 1e+03 2e+03 ... 5e+05 5.25e+05
Dimensions without coordinates: nlat, nlon, d2
Data variables: (12/63)
    ANGLE               (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ANGLET              (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DXT                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DXU                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DYT                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DYU                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ...                  ...
    sea_ice_salinity    float64 ...
    sflux_factor        float64 ...
    sound               float64 ...
    stefan_boltzmann    float64 ...
    time_bound          (time, d2) object dask.array<chunksize=(200, 2), meta=np.ndarray>
    vonkar              float64 ...
Attributes:
    Conventions:   CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netcdf/CF-curren...
    calendar:      All years have exactly  365 days.
    cell_methods:  cell_methods = time: mean ==> the variable values are aver...
    contents:      Diagnostic and Prognostic Variables
    history:       none
    nsteps_total:  24
    revision:      $Id: tavg.F90 41939 2012-11-14 16:37:23Z [email protected] $
    source:        CCSM POP2, the CCSM Ocean Component
    start_time:    This dataset was created on 2014-03-18 at 17:05:44.6
    tavg_sum:      64800.0
    title:         v5_rel04_BC5_ne30_g16

Do let me know if that works for you, and if that data looks as you expect it to.

Oh, and please feel free to ignore the added coordinates "day" and "time_counter". I added them to help save out binned subset files using the netcdf_subsets module. I'll follow-up with an explanation of that module in a separate comment.

@cisaacstern
Copy link
Member

cisaacstern commented Aug 16, 2021

@rabernat, the usage pattern for the subsetting module I mentioned in #56 (comment) is:

from subset_recipe import variables

fs_gcs =  gcsfs.GCSFileSystem(  # ...creds, etc )
target_bins = 60  # number of bins required to bring each subset < 500 MB
cache_path =  # path where original source files are cached

for var in variables:
    subset_var = NetCDFSubsets(
        cache_fs=fs_gcs,
        cache_dir=cache_path,
        var_name=var,
        target_bins=target_bins,
        concat_dim_name="time",
        concat_dim_length=60590,
    )
    subset_var.subset_netcdf()

I ran this code for each of this recipe's 14 source files, saving out netCDF subsets (rather than using recipe.subset_inputs), because the source files from NCAR average about ~30 GB each, and take between 5-8 minutes for xarray to open from the GCS cache (even when running same-region/same-provider compute). Which means, of course, that the recipe.subset_inputs approach would take 14 files x 60 subsets x 5 mins = 70 hours in file-opening time.

Once I'd saved out the netCDF subsets (in 4 parallel batches, ~4 hrs total runtime), I ran the recipe from a new subset_recipe.py, and the execution took about 9 hours start to finish, even with copy_input_to_local_file=True (to avoid any risk of interference from the h5py hanging issues we've been seeing, e.g., in #68 (comment) and pangeo-forge/pangeo-forge-recipes#177).

Noting this approach for my own future reference, and in case it's useful for pushing through other recipes that face similar constraints, even though I imagine it's not an ideal long-term solution.

@rabernat
Copy link
Contributor

the source files from NCAR average about ~30 GB each, and take between 5-8 minutes for xarray to open from the GCS cache (even when running same-region/same-provider compute)

Just out of curiosity, what if you open the files with directly with netCDF4, rather than xarray? How long does it take to open them and list the variables? I'm also curious how long it would take to create a reference fs for these files (cc @martindurant).

@paigem
Copy link
Contributor Author

paigem commented Aug 17, 2021

Hooray thanks @cisaacstern!! It looks great so far, but for two comments/questions:

  1. It looks like the SST variable did not get transferred here. I count 14 variables from the original source files and only 13 on the cloud, and it looks like SST is the missing variable. This is one of the key output variables that users will be interested in, so if it's not too much trouble, it should probably be added to the cloud.
  2. Is this data sitting on AWS S3 storage? Perhaps it doesn't matter, but the other NCAR CESM models are (I believe) on Google Cloud Storage, and I have been using Pangeo's Google Cloud deployment. I just want to verify that I won't incur excess costs when accessing data stored on S3 from Google Cloud.

@cisaacstern
Copy link
Member

It looks like the SST variable did not get transferred here.

Thanks for catching this, Paige. This was indeed an oversight on my part. When I ran the NetCDF file subsetting described in #56 (comment), I overlooked the fact that "SST" is a substring of "SST2", which meant that the method _fn_from_var grabbed the wrong file to subset from the cache. Referring back the my logs confirms that the subsets for SST were incorrectly created from the SST2 source file:

Filename for SST is pangeo-forge-us-central1/pangeo-forge-cache/cesm_pop_lowres_1deg/v5_rel04_BC5_ne30_g16/80554384345dbfeb88651f95e0ef7065-https_tds.ucar.edu_thredds_fileserver_datazone_campaign_cesm_collections_asd_v5_rel04_bc5_ne30_g16_ocn_proc_tseries_daily_v5_rel04_bc5_ne30_g16.pop.h.nday1.sst2.00010101-01661231.nc

I'm now re-subsetting the SST source file with the correction I just made in f6062f3, and will fix the zarr store once that's complete. I'll notify you on this thread when the corrected zarr is available.

Is this data sitting on AWS S3 storage? I just want to verify that I won't incur excess costs when accessing data stored on S3 from Google Cloud.

Yes, this data is on Open Storage Network (OSN), which is an AWS S3 instance. OSN is not a "requester pays" bucket, however, so accessing it from Pangeo's Google Cloud deployment will not result in any egress costs for the requester. (I do all of my OSN bucket work from the Google Cloud as well.)

@cisaacstern
Copy link
Member

I'll notify you on this thread when the corrected zarr is available.

Rebuild is complete. The complete dataset should now be accessible via the code block provided above in #56 (comment).

* origin: (237 commits)
  Delete recipes/casm/recipe.py
  Delete recipes/casm/meta.yaml
  Delete recipes/soda-342/soda342_ice.py
  Delete recipes/soda-342/meta.yaml
  [pre-commit.ci] auto fixes from pre-commit.com hooks
  Delete recipes/eVolv2k/recipe.py
  Delete recipes/eVolv2k/meta.yaml
  [pre-commit.ci] auto fixes from pre-commit.com hooks
  move np + crime imports into function body
  [pre-commit.ci] auto fixes from pre-commit.com hooks
  move URL_FORMAT into function body
  Update recipes/casm/recipe.py
  Delete recipes/GPM_3IMERGDL/recipe.py
  Delete recipes/GPM_3IMERGDL/meta.yaml
  [pre-commit.ci] auto fixes from pre-commit.com hooks
  Remove .DS_Store
  Revert "rever gpm merge"
  [pre-commit.ci] auto fixes from pre-commit.com hooks
  rever gpm merge
  Update recipe.py
  ...
@andersy005
Copy link
Member

pre-commit.ci autofix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants