Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS NOAA WHOI #221

Merged
merged 7 commits into from
Dec 13, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions recipes/aws-noaa-whoi/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
title: 'AWS NOAA WHOI SST'
description: 'Analysis-ready datasets derived from AWS NOAA WHOI NetCDF'
pangeo_forge_version: '0.9.2'
pangeo_notebook_version: '2021.07.17'
recipes:
- id: aws-noaa-sea-surface-temp-whoi
object: 'recipe:recipe'
provenance:
providers:
- name: 'AWS NOAA Oceanic CDR'
description: 'Registry of Open Data on AWS National Oceanographic & Atmospheric Administration National Centers for Environmental Information'
roles:
- producer
- licensor
url: s3://noaa-cdr-sea-surface-temp-whoi-pds/
license: 'Open Data'
maintainers:
- name: 'Kathryn Berger'
orcid: '0000-0001-9731-6519'
github: kathrynberger
bakery:
id: 'devseed.bakery.development.aws.us-west-2' # must come from a valid list of bakeries
target: pangeo-forge-aws-bakery-flowcachebucketdasktest4-10neo67y7a924
resources:
memory: 4096
cpu: 1024
21 changes: 21 additions & 0 deletions recipes/aws-noaa-whoi/recipe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pandas as pd

from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes.reference_hdf_zarr import HDFReferenceRecipe

start_date = '1988-01-01'


def format_function(time):
base = pd.Timestamp(start_date)
day = base + pd.Timedelta(days=time)
input_url_pattern = (
's3://noaa-cdr-sea-surface-temp-whoi-pds/data/{day:%Y}'
'/SEAFLUX-OSB-CDR_V02R00_SST_D{day:%Y%m%d}_C*.nc'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work? I did not think glob-style wildcards were supported by s3fs! 🤯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat thanks so much for your feedback. It did appear to work correctly when I tried it on 3 months worth of data, but that may not have been enough of a wider test to pick up duplicated/incorrect data. I agree with your recommendation and follow a deterministic file pattern as better practice. I'll revise following the example you've provided. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented deterministic file pattern as recommended above and successfully tested on a 3 year dataset. All output looks good.

)
return input_url_pattern.format(day=day)


dates = pd.date_range(start_date, '2022-11-08', freq='D')
pattern = FilePattern(format_function, ConcatDim('time', range(len(dates)), 1))
recipe = HDFReferenceRecipe(pattern, netcdf_storage_options={'anon': True})