Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example pipeline for SWOT-Xover #14

Open
roxyboy opened this issue Dec 1, 2020 · 41 comments
Open

Example pipeline for SWOT-Xover #14

roxyboy opened this issue Dec 1, 2020 · 41 comments
Labels
proposed recipe swot-adac SWOT Adopt-a-Crossover Dataset

Comments

@roxyboy
Copy link

roxyboy commented Dec 1, 2020

Source Dataset

SWOT-Xover is a subset of a few basin-scale model outputs with the resolution of ~1/50° surface hourly and interior daily data. The subsets will cover the cross-over regions of the SWOT fast-sampling phase.

  • Project description is given here
  • File format: zarr
  • Organization of file: one file for six-months of surface and interior data each (i.e. two files per model per region).
  • File access: automating the zarrification of datasets pulled from FTP servers.

Transformation / Alignment / Merging

Files should be concatenated along the time dimension.

Output Dataset

The zarrification of data should be automated via the pangeo-forge pipeline following the pangeo-forge recipe. In order to facilitate the automation, we would ask each modelling group to have the outputs in netcdf4 format and make it available via an ftp server.
A single monthly file of daily-averaged 3D data of u, v, w, T & S in one region is ~30Gb. With the four regions, six months and five models, this would sum up to ~3.6Tb in total on the cloud storage. The chunks of the zarr dataset will be on the order of {'time':30, 'z':5, 'y':100, 'x':100}.
For the surface, a single daily file of hourly averaged data of SST, SSS, SSH, wind stress & buoyancy fluxes in one region is ~380Mb. With the regions, months and models, this sums up to ~45Gb. The chunks of the zarr dataset will be on the order of {'time':100, 'y':100, 'x':100}

@rabernat
Copy link
Contributor

Can you provide more details about the input files? How big are they? What URLs will we use to download them?

@roxyboy
Copy link
Author

roxyboy commented Jan 24, 2021

I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...

@rabernat
Copy link
Contributor

I was hoping that each modelling group could upload their zarrified data to the Wasabi cloud storage...

Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.

What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.

@roxyboy
Copy link
Author

roxyboy commented Feb 3, 2021

Then this is not a Pangeo Forge pipeline. The point of Pangeo Forge is to automatically put together the Zarr in the cloud.

What you propose is fine--it's just not part of Pangeo Forge. Let's leave this open for now as we figure out the best path forward.

I think we're going to try the Pangeo Forge pipeline for the eNATL60 data. Pending on how this goes, we may recommend other modelling centers to follow the pipeline.

@rabernat
Copy link
Contributor

rabernat commented Feb 3, 2021

Great! To move forward, we need some more details about exactly where to find the data and how it is formatted. Please edit your original issue to conform to the template (https://github.com/pangeo-forge/staged-recipes/issues).

@roxyboy
Copy link
Author

roxyboy commented Feb 3, 2021

Yes, I'm still working on extracting the cross-over regions (which surprisingly takes time dealing with massive netCDF files) but I will update the details as soon as I get this hashed out.

@rabernat
Copy link
Contributor

rabernat commented Feb 3, 2021

(which surprisingly takes time dealing with massive netCDF files)

If only there were a better format! 🤣 😉

@roxyboy
Copy link
Author

roxyboy commented Feb 5, 2021

This is getting a bit ahead of ourselves but in the case we would ask the modelling groups to provide their data via ftp or opendap links for the pangeo-forge pipeline, would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c? I'm only asking because I think it would be best if we could reduce the amount of hassle each modelling group goes through. The idea I had in mind was to develop the pipeline on the SWOT-AdAC Jupyterhub.

@roxyboy
Copy link
Author

roxyboy commented Feb 10, 2021

@rabernat I added a bit more detail in the output data section. Is this sufficient?

@rabernat
Copy link
Contributor

Is this sufficient?

Can you provide an actual working FTP link to one of the datasets?

would the "computation" costs to upload them to the cloud come out from the payments we'll be making to 2i2c?

No, they will be supported by pangeo forge and our NSF grant. 2i2c is for the jupyterhub.

A single monthly file of daily-averaged 3D data of u, v, w, T & S in one region is ~30Gb.

This will require pangeo-forge/pangeo-forge-recipes#49, a feature that is not yet implemented. We are working on it.

@roxyboy
Copy link
Author

roxyboy commented Feb 12, 2021

Can you provide an actual working FTP link to one of the datasets?

Sorry for the lagged response. Here is a working link:
https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/Interior/eNATL60/catalog.html

@roxyboy
Copy link
Author

roxyboy commented Feb 18, 2021

Here is another working ftp link for lNALT60: https://data.geomar.de/downloads/20.500.12085/0e95d316-f1ba-47e3-b667-fc800afafe22/data/

@rabernat
Copy link
Contributor

Ok thanks for these. Will have a look soon.

I talked with @lesommer, and we decided to try putting this data in OSN for now.

@roxyboy
Copy link
Author

roxyboy commented Feb 22, 2021

The eNATL60 regional outputs for regions 1-3 are now all available here: https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/catalog/meomopendap/extract/SWOT-Adac/catalog.html

@rabernat
Copy link
Contributor

FYI, that server is giving HTTP certificate errors.

$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

Would it be possible to get this fixed?

@roxyboy
Copy link
Author

roxyboy commented Feb 22, 2021

FYI, that server is giving HTTP certificate errors.

$ curl -I https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc
curl: (60) SSL certificate problem: certificate has expired
More details here: https://curl.haxx.se/docs/sslcerts.html

Would it be possible to get this fixed?

@auraoupa Do you know why this is happening...?

@AurelieAlbert
Copy link

Yes it is a know issue of our opendap (expired certificate), we get by it by adding --no-check-certificate to our wget commands, it would be --insecure for curl (did not try it), but it maybe more efficient (and cleaner) to have it fixed ... I'll try to make it happen !

@rabernat
Copy link
Contributor

We need to get the files via fsspec and unfortunately I don't (yet) know how to work around the certificate error...but there must be a way! I'll try to dig deeper on my end too.

@rabernat
Copy link
Contributor

rabernat commented Feb 27, 2021

Now it looks like the server https://ige-meom-opendap.univ-grenoble-alpes.fr/ is down completely? This is making it hard to develop the recipe.

@auraoupa
Copy link
Contributor

auraoupa commented Mar 1, 2021

Sorry about that, it should be ok now. About the certificate, the University should be fixing it soon they say ! I keep you posted

@auraoupa
Copy link
Contributor

auraoupa commented Mar 1, 2021

The certificate is now valid, I hope it helps for the development of the recipe !

@rabernat
Copy link
Contributor

rabernat commented Mar 1, 2021

Success!

url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc'
with fsspec.open(url) as fp:
    ds = xr.open_dataset(fp)
    display(ds)
<xarray.Dataset>
Dimensions:        (time_counter: 720, x: 574, y: 675)
Coordinates:
    nav_lon        (y, x) float32 ...
    nav_lat        (y, x) float32 ...
    time_centered  (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
  * time_counter   (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
    depth          (y, x) float32 ...
    lat            (y, x) float32 ...
    lon            (y, x) float32 ...
    e1t            (y, x) float64 ...
    e2t            (y, x) float64 ...
    e1f            (y, x) float64 ...
    e2f            (y, x) float64 ...
    e1u            (y, x) float64 ...
    e2u            (y, x) float64 ...
    e1v            (y, x) float64 ...
    e2v            (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
    sossheig       (time_counter, y, x) float32 ...
    sozocrtx       (time_counter, y, x) float32 ...
    somecrty       (time_counter, y, x) float32 ...
    sosstsst       (time_counter, y, x) float32 ...
    sosaline       (time_counter, y, x) float32 ...
    sozotaux       (time_counter, y, x) float32 ...
    sometauy       (time_counter, y, x) float32 ...
    qt_oce         (time_counter, y, x) float32 ...
    sowaflup       (time_counter, y, x) float32 ...
    tmask          (y, x) int8 ...
    umask          (y, x) int8 ...
    vmask          (y, x) int8 ...
    fmask          (y, x) int8 ...

@roxyboy
Copy link
Author

roxyboy commented Mar 19, 2021

Success!

url = 'https://ige-meom-opendap.univ-grenoble-alpes.fr/thredds/fileServer/meomopendap/extract/SWOT-Adac/Surface/eNATL60/Region03-surface-hourly_2010-04.nc'
with fsspec.open(url) as fp:
    ds = xr.open_dataset(fp)
    display(ds)
<xarray.Dataset>
Dimensions:        (time_counter: 720, x: 574, y: 675)
Coordinates:
    nav_lon        (y, x) float32 ...
    nav_lat        (y, x) float32 ...
    time_centered  (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
  * time_counter   (time_counter) datetime64[ns] 2010-04-01T00:30:00 ... 2010...
    depth          (y, x) float32 ...
    lat            (y, x) float32 ...
    lon            (y, x) float32 ...
    e1t            (y, x) float64 ...
    e2t            (y, x) float64 ...
    e1f            (y, x) float64 ...
    e2f            (y, x) float64 ...
    e1u            (y, x) float64 ...
    e2u            (y, x) float64 ...
    e1v            (y, x) float64 ...
    e2v            (y, x) float64 ...
Dimensions without coordinates: x, y
Data variables:
    sossheig       (time_counter, y, x) float32 ...
    sozocrtx       (time_counter, y, x) float32 ...
    somecrty       (time_counter, y, x) float32 ...
    sosstsst       (time_counter, y, x) float32 ...
    sosaline       (time_counter, y, x) float32 ...
    sozotaux       (time_counter, y, x) float32 ...
    sometauy       (time_counter, y, x) float32 ...
    qt_oce         (time_counter, y, x) float32 ...
    sowaflup       (time_counter, y, x) float32 ...
    tmask          (y, x) int8 ...
    umask          (y, x) int8 ...
    vmask          (y, x) int8 ...
    fmask          (y, x) int8 ...

Sorry, I missed this. This is great news! Could you let us know what the status is regarding the data storage on OSN @rabernat ??

@rabernat
Copy link
Contributor

The status is that I'm still working on it. I hope to be able to start ingesting data soon (next week). I'm deeply sorry for the delays and I thank you for your patience.

@lesommer
Copy link

thanks for all your work with this @rabernat !

@roxyboy
Copy link
Author

roxyboy commented Mar 22, 2021

I started a PR #24 for the recipe.

@roxyboy
Copy link
Author

roxyboy commented May 7, 2021

@rabernat Could we prioritize pushing the surface data to the cloud for all available models (in #26, #27, #29) before the interior 3D data? Since we have a few different models ready to push, I think there are already a few inter-model analyses that could be done with just the surface data :)

@rabernat rabernat added the swot-adac SWOT Adopt-a-Crossover Dataset label May 12, 2021
@roxyboy
Copy link
Author

roxyboy commented Jun 1, 2021

@rabernat @cisaacstern I've started analyzing the SWOT-AdAC data (#24 #26 #29 ) on a Google Cloud based Jupyterhub but does the OSN storage also support storing of analysis data?

@rabernat
Copy link
Contributor

rabernat commented Jun 1, 2021

does the OSN storage also support storing of analysis data?

No, we cannot provide write access to OSN.

Can you explain more about the use case you have in mind? How much data do you imagine needing to write? Does it need to be shared across users?

For writing data, you have a few options:

  • Store data in your jupyter home directory (suitable for smallish data; not accessible from dask workers)
  • Ask 2i2c to set up a shared NFS storage volume that is accessible from the dask workers (suitable for medium data)
  • Ask 2i2c to set up a pangeo-style scratch bucket in Google Cloud Storage (suitable for big data)

@cisaacstern
Copy link
Member

cisaacstern commented Jun 5, 2021

I believe all of the surface datasets are now on OSN. Returning to this main thread to provide an high-level "flyover" of how it's organized. Note that below, fs_osn and swot are always defined as:

import s3fs
endpoint_url = 'https://ncsa.osn.xsede.org'
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': endpoint_url},)
swot = "Pangeo/pangeo-forge/swot_adac"

💺 Fasten your seatbelt, this will be a long one!

INALT60 #26

fs_osn.ls(f"{swot}/INALT60")
['Pangeo/pangeo-forge/swot_adac/INALT60/grid.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_flux_1d.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_4h.zarr',
 'Pangeo/pangeo-forge/swot_adac/INALT60/surf_ocean_5d.zarr']

We currently a single zarr store for each surface dataset. The time dimension for these data is non-contiguous as seen in the recipe here. If it's useful, I can separate each of these surface datasets into separate seasonal stores, as demonstrated in the other recipes below.

GIGATL #27

fs_osn.ls(f"{swot}/GIGATL")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/surf_reg_01.zarr']

@roxyboy, unless you need it for something, I will delete surf_reg_01.zarr which is missing the input for Jan 28 as you identified in #27 (comment).

For each region's surface data, there are both aso (Aug, Sep, Oct) and fma (Feb, Mar, Apr) stores:


fs_osn.ls(f"{swot}/GIGATL/Region01/surf")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region01/surf/fma.zarr']
fs_osn.ls(f"{swot}/GIGATL/Region02/surf")
['Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/GIGATL/Region02/surf/fma.zarr']

The fma stores should both contain the previously missing Jan 28 data. (h/t @rabernat for showing me how to ammend and reuse the existing cache.)

HYCOM50 #29

fs_osn.ls(f"{swot}/HYCOM50")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_01.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_02.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/grid_03.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_01.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_02.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/surf_03.zarr']

For each region defined in the recipe, there are both aso and fma stores:

fs_osn.ls(f"{swot}/HYCOM50/Region01_GS/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region01_GS/surf/fma.zarr']
fs_osn.ls(f"{swot}/HYCOM50/Region02_GE/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region02_GE/surf/fma.zarr']
fs_osn.ls(f"{swot}/HYCOM50/Region03_MD/surf")
['Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/HYCOM50/Region03_MD/surf/fma.zarr']

@roxyboy, `surf_01.zarr`, `surf_02.zarr`, and `surf_03.zarr` are the earlier drafts where non-contiguous data is concatenated together. Do you have any use for them now that the seasonal stores are up? If not, I'll delete.

eNATL60 #24

fs_osn.ls(f"{swot}/eNATL60")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03']

For each of the regions, aso and fma stores are provided for the surface_hourly data:

fs_osn.ls(f"{swot}/eNATL60/Region01/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region01/surface_hourly/fma.zarr']
fs_osn.ls(f"{swot}/eNATL60/Region02/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region02/surface_hourly/fma.zarr']
fs_osn.ls(f"{swot}/eNATL60/Region03/surface_hourly")
['Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/aso.zarr',
 'Pangeo/pangeo-forge/swot_adac/eNATL60/Region03/surface_hourly/fma.zarr']

Next steps

@roxyboy, please let me know if you run into any issues with any of the above. Also, what should we work on next? Adding the interior data?

@roxyboy
Copy link
Author

roxyboy commented Jun 7, 2021

This is great! Thanks @cisaacstern .

INALT60 #26

We currently a single zarr store for each surface dataset. The time dimension for these data is non-contiguous as seen in the recipe here. If it's useful, I can separate each of these surface datasets into separate seasonal stores, as demonstrated in the other recipes below.

The time metadata for INALT60 is in Gregorian so I think it's fine that we keep it as it currently is because it's much easier to parse out the seasons.

GIGATL #27

@roxyboy, unless you need it for something, I will delete surf_reg_01.zarr which is missing the input for Jan 28 as you identified in #27 (comment).

Yes, please delete surf_reg_01.

HYCOM50 #29

@roxyboy, surf_01.zarr, surf_02.zarr, and surf_03.zarr are the earlier drafts where non-contiguous data is concatenated together. Do you have any use for them now that the seasonal stores are up? If not, I'll delete.

Please feel free to delete surf_01.zarr, surf_02.zarr, and surf_03.zarr.

@roxyboy, please let me know if you run into any issues with any of the above. Also, what should we work on next? Adding the interior data?

Yes, fluxing the interior data to the cloud would be greatly appreciated :)

@roxyboy
Copy link
Author

roxyboy commented Jun 16, 2021

@cisaacstern Model outputs from FESOM are in the making but can the recipes handle netcdf4 files compressed with tar?

@rabernat
Copy link
Contributor

but can the recipes handle netcdf4 files compressed with tar?

It would be ideal if we could avoid tarring inputs. But if this is unavoidable, we will find a way to deal with it.

@rabernat
Copy link
Contributor

@roxyboy - Today @cisaacstern and I met to discuss this. It will introduce significant complexity in Pangeo Forge to handle the tarred files. We are not sure this effort is worth it since there is an easy workaround--can we just ask the data provider to un-tar the files before putting them online? That is a reasonable request, no?

@roxyboy
Copy link
Author

roxyboy commented Jun 17, 2021

@roxyboy - Today @cisaacstern and I met to discuss this. It will introduce significant complexity in Pangeo Forge to handle the tarred files. We are not sure this effort is worth it since there is an easy workaround--can we just ask the data provider to un-tar the files before putting them online? That is a reasonable request, no?

Yes, I've asked them to untar it. Will be making a PR for FESOM soon.

@roxyboy
Copy link
Author

roxyboy commented Jun 30, 2021

We've (@lesommer and I) decided that hosting regional extracts from LLC4320 on OSN is probably better than us pulling the data from ECCO portal for analysis. Dimitris asked if he could directly push the data himself from where LLC4320 sits after the extraction but is this possible? Otherwise, I can ask him to (temporarily) put it on an ftp server.

@cisaacstern
Copy link
Member

@roxyboy, as far as I'm aware, we're not able to provide write access to OSN. If you point me to the files on a temporary ftp server, however, I can write them to the swot_adac bucket for you. Will these files be netCDFs? How many of them are there and what is their total size?

@cisaacstern
Copy link
Member

cisaacstern commented Jul 1, 2021

Brief progress report below. Simulations with no emojis mean that we haven't started a recipe yet.

Name Recipe Surface Interior
eNATL60
MEDWEST60
Mediterranean
GIGATL
HYCOM50
llc4320
lNALT60
FESOM
SM-telescope

Edit (July 2): As mentioned in #29 (comment), updated table to reflect that HYCOM50 int data is online.

Edit (July 3): Updated table to reflect that GIGATL int data is online; xref #27 (comment).

Edit (July 19): FESOM surface data added to project catalog: pangeo-data/swot_adac_ogcms#2

Edit (July 20): eNATL60 surface data added to catalog: pangeo-data/swot_adac_ogcms#3

@roxyboy
Copy link
Author

roxyboy commented Sep 10, 2021

@rabernat @cisaacstern What do the Pangeo folks think about hosting the global 1/25 HYCOM surface data of u, v, and SSH developed by Brian Arbic's group on OSN? The storage will likely be on the order of 8Tb. The idea is that it'll benefit the SWOT-AdAC community by having global access to both LLC4320 and HYCOM25. As an example, we/I can work on hosting a Jupyter notebook showing the transition scales on the Pangeo gallery.
@lesommer can fill in the details of discussion he had with Brian if necessary.

@rabernat
Copy link
Contributor

YES to global HYCOM on OSN.

@cisaacstern
Copy link
Member

Standing by to assist with the recipe once @roxyboy and/or @lesommer points us to the source files.

This will be a good test for our new Google Cloud Bakery once it comes online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposed recipe swot-adac SWOT Adopt-a-Crossover Dataset
Projects
None yet
Development

No branches or pull requests

6 participants