-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CESM POP 1-degree #56
base: master
Are you sure you want to change the base?
Conversation
@paigem, thanks for kicking off this PR! The As for For a next step, I've pushed an outline for you to fill in to your
Now, when you open You will want to change the f-string returned from the Some more detail on The merging of the variables is achieved by the Finally, we'll need a chunk size here. Once you've made these changes to
Feel free to ask any questions if any of the above doesn't make sense. It's still early days for us onboarding new recipe contributors, so if something doesn't make sense, it's likely my fault for not explaining it well! |
Thanks @cisaacstern for the great instructions! I think I have mostly completed all of the above steps:
|
@cisaacstern just checking on the progress here! Let me know if there is something more I need to do in order to move this forward. |
@paigem, thanks for following-up! I made a few commits to the recipe to get it ready to run. I then made an OpenID account, but when I click "Download Options" here, I get the Authorization Required response copied below. Rather than requesting and waiting for authorization on my account, perhaps it's easiest if we use your credentials to cache the netCDFs? If you're comfortable with that approach, you can securely message them to me at https://keybase.io/cstern. If not, I'll make a request for authorization on my account. """ Authorization RequiredYou do not have enough privileges to access the requested resource. Please request membership in the following group to gain access to the requested resource:
""" |
This recipe has been blocked by three issues, the first two of which I have (albeit provisional) solutions for:
The third issue is as-yet unresolved, and I'll make it it's own comment because it's (I think) unrelated to these. |
Thank you for these updates @cisaacstern! |
The actively blocking issue for this recipe is what appear to be incorrect HTTP Content Length headers.
For the first input url,
fsspec.http - DEBUG - Fetch range for <File-like object HTTPFileSystem, https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=<TOKEN>>: 1590000000-1598457682
fsspec.http - DEBUG - https://tds.ucar.edu/thredds/fileServer/datazone/campaign/cesm/collections/ASD/v5_rel04_BC5_ne30_g16/ocn/proc/tseries/daily/v5_rel04_BC5_ne30_g16.pop.h.nday1.HMXL_2.00010101-01661231.nc?api-token=<TOKEN> : bytes=1590000000-1598457681
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems copying block of 8457682 bytes
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems reading data
pangeo_forge_recipes.storage - DEBUG - FSSpecTarget.open yielded
pangeo_forge_recipes.storage - DEBUG - _copy_btw_filesystems done The problem is that this file is not 1598457682/1e9 ≈ 1.6 GBs, but rather 18778326866 ≈ 18.8 GBs. So
@martindurant, any suggestions for how to move forward? I've tried setting The only thing I can think is that NCAR's wget script is doing some further authentication using the cc @rsignell-usgs b/c we discussed this yesterday on the Pangeo call, and perhaps your THREDDS expertise will yield some key insights |
Can you figure out why? This is with dask running? Is h5py or other library in on this download? You can try interrupting the coroutines or thread to find out at least where its waiting. |
I'm executing this recipe incrementally in Jupyter (no Dask). Turns out this is not an IOPub message rate exceeded
And a little while later, the output render resumes and the download is completed. I am about to submit a PR with a proposed solution to this issue. |
@rabernat, now that I have this recipe's files cached to GCS, I'm encountering an issue where they take an unexpectedly long time to open with As far as I can tell, these are valid netCDF files which have been transferred to the cache uncorrupted, because I did get a bunch of intelligible debug information back from xarray when I re-installed it with all these added print statements. The logging just goes on for minutes upon minutes without returning the dataset, however, which is quite different from the "instantaneous" lazy loading I've come to expect. This repo contains an notebook (binder link included) which demonstrates this behavior with an attempt to open the https://github.com/cisaacstern/cesm-open-dataset NCAR provides a "netCDF header" for each of the input files, and I've also provided the Is there anything else in that header that points to why this might be taking so long to open? Are there any strategies we might employ to make |
If the netCDF files are very large (many GB), it can indeed take a long time to open them with xarray (many minutes). This is why we don't like to use netCDF on cloud storage! Since this recipe has many variables, each one of the variable files needs to be opened up by To get some intuition for this, I'd recommend trying to directly open one of the files from the cache, outside of pangeo_forge_recipes. Could you just do?
That will give us some valuable quantitative information. This is also a place where it matters a lot where you are running the recipe. Are you sure you are in the same cloud region as your cache bucket? One way we might be able to make this go a bit faster would be to use the approach suggested in pangeo-forge/pangeo-forge-recipes#82, where we parallelize the opening of the files required by In general, the most important property of Pangeo Forge recipes is that they work, not that they are extremely fast. We only have to run the recipe once. It's acceptable if |
Yes, I did try that in https://github.com/cisaacstern/cesm-open-dataset/blob/main/cesm_open_dataset.ipynb. But at the time I didn't know that ...
... so now, I can move forward with this by just "letting it run".
In this case, I believe yes: opening from the pangeo-forge GCS cache (which I'm pretty sure is in |
@paigem, this data is now available as a zarr store on OSN: import fsspec
import xarray as xr
path = "s3://Pangeo/pangeo-forge/cesm_pop_lowres_1deg/v5_rel04_BC5_ne30_g16.zarr"
mapper = fsspec.get_mapper(
path, anon=True, client_kwargs={'endpoint_url': 'https://ncsa.osn.xsede.org'},
)
ds = xr.open_zarr(mapper, consolidated=True)
print(ds)
Do let me know if that works for you, and if that data looks as you expect it to. Oh, and please feel free to ignore the added coordinates |
@rabernat, the usage pattern for the subsetting module I mentioned in #56 (comment) is: from subset_recipe import variables
fs_gcs = gcsfs.GCSFileSystem( # ...creds, etc )
target_bins = 60 # number of bins required to bring each subset < 500 MB
cache_path = # path where original source files are cached
for var in variables:
subset_var = NetCDFSubsets(
cache_fs=fs_gcs,
cache_dir=cache_path,
var_name=var,
target_bins=target_bins,
concat_dim_name="time",
concat_dim_length=60590,
)
subset_var.subset_netcdf() I ran this code for each of this recipe's 14 source files, saving out netCDF subsets (rather than using Once I'd saved out the netCDF subsets (in 4 parallel batches, ~4 hrs total runtime), I ran the recipe from a new Noting this approach for my own future reference, and in case it's useful for pushing through other recipes that face similar constraints, even though I imagine it's not an ideal long-term solution. |
Just out of curiosity, what if you open the files with directly with netCDF4, rather than xarray? How long does it take to open them and list the variables? I'm also curious how long it would take to create a reference fs for these files (cc @martindurant). |
Hooray thanks @cisaacstern!! It looks great so far, but for two comments/questions:
|
Thanks for catching this, Paige. This was indeed an oversight on my part. When I ran the NetCDF file subsetting described in #56 (comment), I overlooked the fact that
I'm now re-subsetting the SST source file with the correction I just made in f6062f3, and will fix the zarr store once that's complete. I'll notify you on this thread when the corrected zarr is available.
Yes, this data is on Open Storage Network (OSN), which is an AWS S3 instance. OSN is not a "requester pays" bucket, however, so accessing it from Pangeo's Google Cloud deployment will not result in any egress costs for the requester. (I do all of my OSN bucket work from the Google Cloud as well.) |
Rebuild is complete. The complete dataset should now be accessible via the code block provided above in #56 (comment). |
* origin: (237 commits) Delete recipes/casm/recipe.py Delete recipes/casm/meta.yaml Delete recipes/soda-342/soda342_ice.py Delete recipes/soda-342/meta.yaml [pre-commit.ci] auto fixes from pre-commit.com hooks Delete recipes/eVolv2k/recipe.py Delete recipes/eVolv2k/meta.yaml [pre-commit.ci] auto fixes from pre-commit.com hooks move np + crime imports into function body [pre-commit.ci] auto fixes from pre-commit.com hooks move URL_FORMAT into function body Update recipes/casm/recipe.py Delete recipes/GPM_3IMERGDL/recipe.py Delete recipes/GPM_3IMERGDL/meta.yaml [pre-commit.ci] auto fixes from pre-commit.com hooks Remove .DS_Store Revert "rever gpm merge" [pre-commit.ci] auto fixes from pre-commit.com hooks rever gpm merge Update recipe.py ...
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
This is the start of a pipeline for CESM run v5_rel04_BC5_ne30_g16. I have followed steps 1-5 laid out by @cisaacstern in Issue #46.
@cisaacstern Could I get a bit more guidance on how to write the
meta.yaml
andpipeline.py
files? It looks like the instructions at https://github.com/paigem/staged-recipes.git aren't quite finished yet.