-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS NOAA WHOI #221
AWS NOAA WHOI #221
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for submitting this recipe @kathrynberger!
In past recipes (example), the way I resolved this "creation date in the URL" issue was by first crawling the server to build a list of URLs.
Your wildcard solution is much simpler! I didn't know that could even work.
My one concern would be if there are multiple versions for the same date. That would lead to duplicate / incorrect data. I feel like we should strive for deterministic file patterns wherever possible.
recipes/aws-noaa-whoi/recipe.py
Outdated
day = base + pd.Timedelta(days=time) | ||
input_url_pattern = ( | ||
's3://noaa-cdr-sea-surface-temp-whoi-pds/data/{day:%Y}' | ||
'/SEAFLUX-OSB-CDR_V02R00_SST_D{day:%Y%m%d}_C*.nc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work? I did not think glob-style wildcards were supported by s3fs! 🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rabernat thanks so much for your feedback. It did appear to work correctly when I tried it on 3 months worth of data, but that may not have been enough of a wider test to pick up duplicated/incorrect data. I agree with your recommendation and follow a deterministic file pattern as better practice. I'll revise following the example you've provided. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented deterministic file pattern as recommended above and successfully tested on a 3 year dataset. All output looks good.
revise with active bakery Co-authored-by: Charles Stern <[email protected]>
/run aws-noaa-sea-surface-temp-whoi |
The test failed, but I'm sure we can find out why! Pangeo Forge maintainers are working diligently to provide public logs for contributors. |
@kathrynberger Could this be failing because there needs to be a requirements.txt in the recipe folder? for if this is the issue, maybe it'd be good to include adding an optional requirements.txt as a step in https://pangeo-forge.readthedocs.io/en/latest/pangeo_forge_cloud/recipe_contribution.html ? |
Thanks for jumping in @rbavery! Gosh, I've really got to resolve pangeo-forge/pangeo-forge-orchestrator#150, without which it's basically impossible for community members to debug these errors. 🙃 So the cloud workers definitely have Looking at the backed logs, I'm seeing:
|
recipes/aws-noaa-whoi/recipe.py
Outdated
filter(lambda x: x.endswith('.nc'), fs.ls(url_base + str(year), detail=False)) | ||
) | ||
|
||
pattern = pattern_from_file_sequence(file_list, 'time', nitems_per_file=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is that the files in file_list
do not start with s3://
, so that fsspec is looking for local files instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on a separate recipe and found this is the issue. adding a map to add the missing s3://
solved it.
from os.path import join
fs = s3fs.S3FileSystem(anon=True)
is_nc = lambda x: x.endswith('.nc')
add_s3 = lambda x: "s3://" + x
for year in years:
file_list += sorted(
filter(is_nc, map(add_s3, fs.ls(join(url_base, str(year)), detail=False)))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cisaacstern @rabernat and @rbavery for catching this, I see the error here and will revise with the suggestion above. 👍
recipes/aws-noaa-whoi/recipe.py
Outdated
|
||
url_base = 's3://noaa-cdr-sea-surface-temp-whoi-pds/data/' | ||
|
||
years = range(1988, 2022) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to handle years differently, given that when data is available in 2023, this won't find the 2023 data.
years could instead be defined like so?
years = range(1988, 2022) | |
years_folders = fs.ls(join(url_base)) | |
years = list(map(lambda x: os.path.basename(x), years_folders)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good suggestion @rbavery just tested this to verify, great way to consider ingesting future years. I'll add this to the revisions
/run aws-noaa-sea-surface-temp-whoi |
🎉 The test run of import xarray as xr
store = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/test/pangeo-forge/staged-recipes/recipe-run-1389/aws-noaa-sea-surface-temp-whoi.zarr"
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds |
@sharkinsspatial reports that the test data looks good, so I'll merge this. |
A recipe for AWS NOAA WHOI Sea Surface Temperature, one of the three resources made available as part of NOAA's Oceanic Climate Data Records (see: https://registry.opendata.aws/noaa-cdr-oceanic/)
File pattern identified and tested both using pruned recipe feature, as well as running on three months worth of data. Output looks correct and as expected.
Closes out issue: https://github.com/developmentseed/aws-asdi/issues/21