Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earth System Data Cube (ESDC) cmorizer #2799

Merged
merged 59 commits into from
Feb 2, 2023
Merged

Conversation

bsolino
Copy link
Contributor

@bsolino bsolino commented Sep 16, 2022

Description


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.

New or updated data reformatting script


To help with the number of pull requests:

@bsolino bsolino added this to the v2.7.0 milestone Sep 16, 2022
@bsolino bsolino self-assigned this Sep 16, 2022
@bsolino
Copy link
Contributor Author

bsolino commented Sep 27, 2022

The implementation is ready and can be reviewed already. There are a few details missing that don't affect the code:
- I classified the dataset as Tier 2, as it is publicly available without login. However, it includes data from Tier 3 datasets, like ERA5.
- I am unsure about the appropriate name of the dataset, I will consult with the providers in case they prefer a different name

Copy link
Contributor

@remi-kazeroni remi-kazeroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this work @bsolino! I did some testing of the CMORizer without checking the code in depth as you wanted some early feedback on this PR. The CMORizer seems to work fine, including reading data from the cloud store. I think the handling of metadata is not fully compete, see suggestion. Also, you may want to check with the provider which original metadata to keep.

The main issue I see is that this cmorizer only starts looking for cloud data if a ~/RAWOBS/Tier2/ESDC directory already exists (even an empty directory is enough). Otherwise, the Tool crashes because no data are found. I'm not sure this is something to address in this cmorizer script or more broadly in cmorizer.py.

@bsolino
Copy link
Contributor Author

bsolino commented Oct 4, 2022

The main issue I see is that this cmorizer only starts looking for cloud data if a ~/RAWOBS/Tier2/ESDC directory already exists (even an empty directory is enough). Otherwise, the Tool crashes because no data are found. I'm not sure this is something to address in this cmorizer script or more broadly in cmorizer.py.

Good catch! I've done some testing, but there is nothing I can do from within the cmorizer. The detection happens in the function def _get_dataset_tier(self, dataset) from cmorizer.py. The only solution I can think to that issue is to have an entirely different detection system, perhaps from the documentation data (datasets.yml or cmor_config/DATASET.yml). However, it seems that will require a bigger discussion, as I don't know why the tool takes the tier from the folder and not the folder from the tier.

Copy link
Contributor

@remi-kazeroni remi-kazeroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @bsolino! Everything looks good to me and works as expected 👍

@remi-kazeroni
Copy link
Contributor

@valeriupredoi, could you please take a look at this PR whenever you have time for that? And maybe advise @bsolino on how the best way to add dependencies to the environment*.yml and setup.py files? Cheers 🍻

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a couple optional wiggles from me, looking good, good work, guys! I'll have a look at the environment files now 👍

@valeriupredoi
Copy link
Contributor

setup.py looks fine apart from that aiohttp should be at the top of the list, but for the two environment yml files I'd put zarr and aiohttp in the main list of dependencies (since they are actively used by the main code), so zarr would be at the bottom of the dependencies list and aiohttp at the top after pip and python. Where is it used (aiohttp) anyway, I've not seen it imported anywhere? If it's not, please add it only when it will be, last thing we want is yet another dependency 😁

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just temp holding this for the unused aiohttp dependency

Co-authored-by: Valeriu Predoi <[email protected]>
@bsolino
Copy link
Contributor Author

bsolino commented Feb 2, 2023

Where is it used (aiohttp) anyway, I've not seen it imported anywhere? If it's not, please add it only when it will be, last thing we want is yet another dependency 😁

It is not used directly by us (neither is zarr, for that matter 😉). It is a necessary module to open http filesystems, without it the program terminates when trying to open the zarr datacube in the cloud.

ImportError: HTTPFileSystem requires "requests" and "aiohttp" to be installed

@valeriupredoi
Copy link
Contributor

Where is it used (aiohttp) anyway, I've not seen it imported anywhere? If it's not, please add it only when it will be, last thing we want is yet another dependency grin

It is not used directly by us (neither is zarr, for that matter wink). It is a necessary module to open http filesystems, without it the program terminates when trying to open the zarr datacube in the cloud.

ImportError: HTTPFileSystem requires "requests" and "aiohttp" to be installed

aha gotcha! zarr is used via the xarray engine, and that too should be a dependency of xarray only it isn't (had this convo with the kerchunk devs that want to do deps like Xarray's - optional depending what engine/module one needs - never agreed with this but it's a thing). Cheers for explaining it, mate 👍

Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all nice and green and am happy too, awesome work @bsolino 🍺 Cheers to @remi-kazeroni for testing as well!

@valeriupredoi valeriupredoi merged commit 9eb3f67 into main Feb 2, 2023
@valeriupredoi valeriupredoi deleted the esdc-cmorizer-pilot branch February 2, 2023 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Earth System Data Cube (ESDC) cmorizer
3 participants