AWS NOAA WHOI #221

kathrynberger · 2022-11-16T13:33:16Z

A recipe for AWS NOAA WHOI Sea Surface Temperature, one of the three resources made available as part of NOAA's Oceanic Climate Data Records (see: https://registry.opendata.aws/noaa-cdr-oceanic/)

File pattern identified and tested both using pruned recipe feature, as well as running on three months worth of data. Output looks correct and as expected.

Closes out issue: https://github.com/developmentseed/aws-asdi/issues/21

rabernat

Thanks so much for submitting this recipe @kathrynberger!

In past recipes (example), the way I resolved this "creation date in the URL" issue was by first crawling the server to build a list of URLs.

Your wildcard solution is much simpler! I didn't know that could even work.

My one concern would be if there are multiple versions for the same date. That would lead to duplicate / incorrect data. I feel like we should strive for deterministic file patterns wherever possible.

rabernat · 2022-11-16T15:25:06Z

recipes/aws-noaa-whoi/recipe.py

+    day = base + pd.Timedelta(days=time)
+    input_url_pattern = (
+        's3://noaa-cdr-sea-surface-temp-whoi-pds/data/{day:%Y}'
+        '/SEAFLUX-OSB-CDR_V02R00_SST_D{day:%Y%m%d}_C*.nc'


Does this actually work? I did not think glob-style wildcards were supported by s3fs! 🤯

@rabernat thanks so much for your feedback. It did appear to work correctly when I tried it on 3 months worth of data, but that may not have been enough of a wider test to pick up duplicated/incorrect data. I agree with your recommendation and follow a deterministic file pattern as better practice. I'll revise following the example you've provided. 👍

Implemented deterministic file pattern as recommended above and successfully tested on a 3 year dataset. All output looks good.

recipes/aws-noaa-whoi/meta.yaml

revise with active bakery Co-authored-by: Charles Stern <[email protected]>

cisaacstern · 2022-11-17T20:04:39Z

/run aws-noaa-sea-surface-temp-whoi

pangeo-forge · 2022-11-17T20:09:51Z

The test failed, but I'm sure we can find out why!

Pangeo Forge maintainers are working diligently to provide public logs for contributors.
That feature is not quite ready yet, however, so please reach out on this thread to a
maintainer, and they'll help you diagnose the problem.

rbavery · 2022-11-17T20:25:24Z

@kathrynberger Could this be failing because there needs to be a requirements.txt in the recipe folder? for s3fs ? I needed to install that locally to get the recipe to run. I've seen some other recipes include a requirements.txt, like this one https://github.com/pangeo-forge/staged-recipes/pull/220/files

if this is the issue, maybe it'd be good to include adding an optional requirements.txt as a step in https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_cloud/recipe_contribution.html ?

cisaacstern · 2022-11-17T20:42:59Z

Thanks for jumping in @rbavery!

Gosh, I've really got to resolve pangeo-forge/pangeo-forge-orchestrator#150, without which it's basically impossible for community members to debug these errors. 🙃

So the cloud workers definitely have s3fs by default. A requirements.txt would be for more exotic requirements. (Unless something really major has changed in the last month that I missed!)

Looking at the backed logs, I'm seeing:

File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/reference_hdf_zarr.py", line 30, in scan_file
    with file_opener(fname, **config.netcdf_storage_options) as fp:
  File "/srv/conda/envs/notebook/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/storage.py", line 283, in file_opener
    with opener as fp:
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/core.py", line 103, in __enter__
    f = self.fs.open(self.path, mode=mode)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/spec.py", line 1094, in open
    f = self._open(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 175, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 273, in __init__
    self._open()
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/fsspec/implementations/local.py", line 278, in _open
    self.f = open(self.path, mode=self.mode)
RuntimeError: FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/noaa-cdr-sea-surface-temp-whoi-pds/data/1988/SEAFLUX-OSB-CDR_V02R00_SST_D19880102_C20160820.nc' [while running 'Start|scan_file|Reshuffle_000|finalize|Reshuffle_001/scan_file/Execute-ptransform-56']
"

rabernat · 2022-11-17T20:45:19Z

recipes/aws-noaa-whoi/recipe.py

+        filter(lambda x: x.endswith('.nc'), fs.ls(url_base + str(year), detail=False))
+    )
+
+pattern = pattern_from_file_sequence(file_list, 'time', nitems_per_file=1)


My guess is that the files in file_list do not start with s3://, so that fsspec is looking for local files instead.

I'm working on a separate recipe and found this is the issue. adding a map to add the missing s3:// solved it.

from os.path import join fs = s3fs.S3FileSystem(anon=True) is_nc = lambda x: x.endswith('.nc') add_s3 = lambda x: "s3://" + x for year in years: file_list += sorted( filter(is_nc, map(add_s3, fs.ls(join(url_base, str(year)), detail=False))) )

thanks @cisaacstern @rabernat and @rbavery for catching this, I see the error here and will revise with the suggestion above. 👍

rbavery · 2022-11-18T21:14:08Z

recipes/aws-noaa-whoi/recipe.py

+
+url_base = 's3://noaa-cdr-sea-surface-temp-whoi-pds/data/'
+
+years = range(1988, 2022)


we might want to handle years differently, given that when data is available in 2023, this won't find the 2023 data.

years could instead be defined like so?

Suggested change

years = range(1988, 2022)

years_folders = fs.ls(join(url_base))

years = list(map(lambda x: os.path.basename(x), years_folders))

good suggestion @rbavery just tested this to verify, great way to consider ingesting future years. I'll add this to the revisions

sharkinsspatial · 2022-12-13T18:15:10Z

/run aws-noaa-sea-surface-temp-whoi

pangeo-forge · 2022-12-13T18:20:43Z

🎉 The test run of aws-noaa-sea-surface-temp-whoi at 166954e succeeded!

import xarray as xr

store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1389/aws-noaa-sea-surface-temp-whoi.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

cisaacstern · 2022-12-13T20:48:10Z

@sharkinsspatial reports that the test data looks good, so I'll merge this.

kathrynberger added 4 commits November 15, 2022 11:07

add file pattern

a279651

replace with ref_hdf_zarr recipe and add meta yaml

eb843a9

additions to meta file

c1af95c

cleaned up recipe

9d21dcf

rabernat reviewed Nov 16, 2022

View reviewed changes

cisaacstern reviewed Nov 16, 2022

View reviewed changes

recipes/aws-noaa-whoi/meta.yaml Outdated Show resolved Hide resolved

kathrynberger and others added 2 commits November 17, 2022 14:58

Update recipes/aws-noaa-whoi/meta.yaml

7f85f66

revise with active bakery Co-authored-by: Charles Stern <[email protected]>

revise with deterministic file pattern

d87c7ea

rabernat reviewed Nov 17, 2022

View reviewed changes

rbavery reviewed Nov 18, 2022

View reviewed changes

add s3 prefix and allow for future years

166954e

cisaacstern merged commit 2c6c995 into pangeo-forge:master Dec 13, 2022

cisaacstern mentioned this pull request Dec 13, 2022

Prod run failing with AssertionError: Found chunk size mismatch pangeo-forge/aws-noaa-whoi-feedstock#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS NOAA WHOI #221

AWS NOAA WHOI #221

kathrynberger commented Nov 16, 2022

rabernat left a comment

rabernat Nov 16, 2022

kathrynberger Nov 17, 2022

kathrynberger Nov 17, 2022

cisaacstern commented Nov 17, 2022

pangeo-forge bot commented Nov 17, 2022

rbavery commented Nov 17, 2022

cisaacstern commented Nov 17, 2022

rabernat Nov 17, 2022

rbavery Nov 18, 2022 •

edited

Loading

kathrynberger Nov 21, 2022

rbavery Nov 18, 2022

kathrynberger Nov 21, 2022

sharkinsspatial commented Dec 13, 2022

pangeo-forge bot commented Dec 13, 2022

cisaacstern commented Dec 13, 2022


		url_base = 's3://noaa-cdr-sea-surface-temp-whoi-pds/data/'

		years = range(1988, 2022)

	years = range(1988, 2022)
	years_folders = fs.ls(join(url_base))
	years = list(map(lambda x: os.path.basename(x), years_folders))

AWS NOAA WHOI #221

AWS NOAA WHOI #221

Conversation

kathrynberger commented Nov 16, 2022

rabernat left a comment

Choose a reason for hiding this comment

rabernat Nov 16, 2022

Choose a reason for hiding this comment

kathrynberger Nov 17, 2022

Choose a reason for hiding this comment

kathrynberger Nov 17, 2022

Choose a reason for hiding this comment

cisaacstern commented Nov 17, 2022

pangeo-forge bot commented Nov 17, 2022

rbavery commented Nov 17, 2022

cisaacstern commented Nov 17, 2022

rabernat Nov 17, 2022

Choose a reason for hiding this comment

rbavery Nov 18, 2022 • edited Loading

Choose a reason for hiding this comment

kathrynberger Nov 21, 2022

Choose a reason for hiding this comment

rbavery Nov 18, 2022

Choose a reason for hiding this comment

kathrynberger Nov 21, 2022

Choose a reason for hiding this comment

sharkinsspatial commented Dec 13, 2022

pangeo-forge bot commented Dec 13, 2022

cisaacstern commented Dec 13, 2022

rbavery Nov 18, 2022 •

edited

Loading