Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS NOAA WHOI #221

Merged
merged 7 commits into from
Dec 13, 2022
Merged

AWS NOAA WHOI #221

merged 7 commits into from
Dec 13, 2022

Conversation

kathrynberger
Copy link
Contributor

A recipe for AWS NOAA WHOI Sea Surface Temperature, one of the three resources made available as part of NOAA's Oceanic Climate Data Records (see: https://registry.opendata.aws/noaa-cdr-oceanic/)

File pattern identified and tested both using pruned recipe feature, as well as running on three months worth of data. Output looks correct and as expected.

Closes out issue: https://github.com/developmentseed/aws-asdi/issues/21

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for submitting this recipe @kathrynberger!

In past recipes (example), the way I resolved this "creation date in the URL" issue was by first crawling the server to build a list of URLs.

Your wildcard solution is much simpler! I didn't know that could even work.

My one concern would be if there are multiple versions for the same date. That would lead to duplicate / incorrect data. I feel like we should strive for deterministic file patterns wherever possible.

day = base + pd.Timedelta(days=time)
input_url_pattern = (
's3://noaa-cdr-sea-surface-temp-whoi-pds/data/{day:%Y}'
'/SEAFLUX-OSB-CDR_V02R00_SST_D{day:%Y%m%d}_C*.nc'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work? I did not think glob-style wildcards were supported by s3fs! 🤯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat thanks so much for your feedback. It did appear to work correctly when I tried it on 3 months worth of data, but that may not have been enough of a wider test to pick up duplicated/incorrect data. I agree with your recommendation and follow a deterministic file pattern as better practice. I'll revise following the example you've provided. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented deterministic file pattern as recommended above and successfully tested on a 3 year dataset. All output looks good.

@cisaacstern
Copy link
Member

/run aws-noaa-sea-surface-temp-whoi

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Nov 17, 2022

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors.
That feature is not quite ready yet, however, so please reach out on this thread to a
maintainer, and they'll help you diagnose the problem.

@rbavery
Copy link
Contributor

rbavery commented Nov 17, 2022

@kathrynberger Could this be failing because there needs to be a requirements.txt in the recipe folder? for s3fs ? I needed to install that locally to get the recipe to run. I've seen some other recipes include a requirements.txt, like this one https://github.com/pangeo-forge/staged-recipes/pull/220/files

if this is the issue, maybe it'd be good to include adding an optional requirements.txt as a step in https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_cloud/recipe_contribution.html ?

@cisaacstern
Copy link
Member

Thanks for jumping in @rbavery!

Gosh, I've really got to resolve pangeo-forge/pangeo-forge-orchestrator#150, without which it's basically impossible for community members to debug these errors. 🙃

So the cloud workers definitely have s3fs by default. A requirements.txt would be for more exotic requirements. (Unless something really major has changed in the last month that I missed!)

Looking at the backed logs, I'm seeing:

File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/reference_hdf_zarr.py", line 30, in scan_file
    with file_opener(fname, **config.netcdf_storage_options) as fp:
  File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/storage.py", line 283, in file_opener
    with opener as fp:
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/spec.py", line 1094, in open
    f = self._open(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 175, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 273, in __init__
    self._open()
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 278, in _open
    self.f = open(self.path, mode=self.mode)
RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/noaa-cdr-sea-surface-temp-whoi-pds/data/1988/SEAFLUX-OSB-CDR_V02R00_SST_D19880102_C20160820.nc' [while running 'Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/scan_file/Execute-ptransform-56']
"

filter(lambda x: x.endswith('.nc'), fs.ls(url_base + str(year), detail=False))
)

pattern = pattern_from_file_sequence(file_list, 'time', nitems_per_file=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that the files in file_list do not start with s3://, so that fsspec is looking for local files instead.

Copy link
Contributor

@rbavery rbavery Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on a separate recipe and found this is the issue. adding a map to add the missing s3:// solved it.

from os.path import join

fs = s3fs.S3FileSystem(anon=True)
is_nc = lambda x: x.endswith('.nc')
add_s3 = lambda x: "s3://" + x
for year in years:
    file_list += sorted(
        filter(is_nc, map(add_s3, fs.ls(join(url_base, str(year)), detail=False)))
    )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @cisaacstern @rabernat and @rbavery for catching this, I see the error here and will revise with the suggestion above. 👍


url_base = 's3://noaa-cdr-sea-surface-temp-whoi-pds/data/'

years = range(1988, 2022)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to handle years differently, given that when data is available in 2023, this won't find the 2023 data.

years could instead be defined like so?

Suggested change
years = range(1988, 2022)
years_folders = fs.ls(join(url_base))
years = list(map(lambda x: os.path.basename(x), years_folders))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion @rbavery just tested this to verify, great way to consider ingesting future years. I'll add this to the revisions

@sharkinsspatial
Copy link
Contributor

/run aws-noaa-sea-surface-temp-whoi

@pangeo-forge
Copy link
Contributor

pangeo-forge bot commented Dec 13, 2022

🎉 The test run of aws-noaa-sea-surface-temp-whoi at 166954e succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1389/aws-noaa-sea-surface-temp-whoi.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

@cisaacstern
Copy link
Member

@sharkinsspatial reports that the test data looks good, so I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants