Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tutorial on intake catalog #298

Merged
merged 11 commits into from
Nov 30, 2023
Merged

Conversation

anton-seaice
Copy link
Collaborator

Closes #295

Add a tutorial on moving from cosima_cookbook to the intake catalog.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@navidcy
Copy link
Collaborator

navidcy commented Oct 14, 2023

@anton-seaice can you resolve the conflicts?

@anton-seaice
Copy link
Collaborator Author

@anton-seaice can you resolve the conflicts?

Ok done

@dougiesquire
Copy link
Collaborator

dougiesquire commented Oct 17, 2023

Thanks @anton-seaice! I haven't looked at this in detail yet (sorry), but my first thought is that this could possibly be more focussed on how to do the same things with the cookbook/catalogue and less focussed on plotting ice data. For example, sections on:

  • Loading the cookbook/catalogue
  • Finding model runs
  • Loading variables from a COSIMA run
    • include tips on passing kwargs and chunking
  • Loading data directly as a datatree (possibly?)

These sections could emphasise the differences (+pros/cons) between the two approaches. Some of this info is already in your notebook, but adapting the notebook to make this info the focus might make it easier to find/ingest/remember.

@anton-seaice
Copy link
Collaborator Author

Thanks @anton-seaice! I haven't looked at this in detail yet (sorry), but my first thought is that this could possibly be more focussed on how to do the same things with the cookbook/catalogue and less focussed on plotting ice data. For example, sections on:

* Loading the cookbook/catalogue

* Finding model runs

* Loading variables from a COSIMA run
  
  * include tips on passing kwargs and chunking

* Loading data directly as a datatree (possibly?)

These sections could emphasise the differences (+pros/cons) between the two approaches. Some of this info is already in your notebook, but adapting the notebook to make this info the focus might make it easier to find/ingest/remember.

I was thinking we needed the sections on trying to load some real data to show why we needed to do some steps. i.e. you only discover the need for extra filtering of search results if you are trying to load a field without the time dimension, simlarly you can't subset by a date range correctly before you have subtracted the 'timedelta'.

I could end the notebook at the line 'There are no more changes to the notebook', and use your 4 headings to breakup the notebook into clearer sections?

@dougiesquire
Copy link
Collaborator

'There are no more changes to the notebook'

So is this an adaption of another notebook? Is that made clear somewhere? We possibly want to avoid having two notebooks that are related, as they then have to be maintained together to remain useful.

simlarly you can't subset by a date range correctly before you have subtracted the 'timedelta'

This is a quirk of the ice data right? Not related to the use of the catalog.

Yes, I agree that we want to demonstrate loading data etc in the notebook, but we don't have to do anything with those data. Instead, you could demonstrate loading some data using both methods and show that they give the same thing. I guess the key message is that:

da = cosima_cookbook.querying.getvar(expt="expt0", variable="var0", session=session, frequency="1 monthly")

translates to

ds = catalog["expt0"].search(variable="var0", frequency="1mon").to_dask()

and then differences (e.g. fixed frequency data, time filtering, queries that return files spanning multiple datasets) could be demonstrated from there.

@anton-seaice
Copy link
Collaborator Author

Ill update to focus on those points :)

@anton-seaice
Copy link
Collaborator Author

anton-seaice commented Oct 19, 2023

@dougiesquire Back to you. I'm not very good with dask - so I am not sure the chunking section is very good, your input would be useful :)

I realised it doesn't show the dask graph size or chunk memory size in the github view, so maybe we need a different way of demonstrating the chunk size?

@dougiesquire
Copy link
Collaborator

@anton-seaice, do you mind if I make some changes directly to your notebook (if I have permission)? There are a few things that could be changed/added and doing this via reviewnb is always a bit painful

@anton-seaice
Copy link
Collaborator Author

@anton-seaice, do you mind if I make some changes directly to your notebook (if I have permission)? There are a few things that could be changed/added and doing this via reviewnb is always a bit painful

No problem. Please do that. You should be able to pull straight from the pr (gh pr checkout 298)

@dougiesquire
Copy link
Collaborator

@anton-seaice sorry it took me so long to look at this properly. I've changed/added a bunch of things. Let me know what you think.

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:12Z
----------------------------------------------------------------

These warnings take up the whole page, so maybe run this command with

%%capture --no-stderr

or similar, and reference COSIMA/cosima-cookbook#333


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:13Z
----------------------------------------------------------------

I think the catalog should go first and the cookbook second under the heading "opening data"


dougiesquire commented on 2023-10-24T22:49:53Z
----------------------------------------------------------------

I thought it could be best to first show people what they're familiar with, then demonstrate that you do exactly the same thing with the catalog. But I'll leave it up to you. Happy with whatever, so long as it's clear

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:14Z
----------------------------------------------------------------

"The catalog loads multiple frequencies of the same variable"


dougiesquire commented on 2023-10-24T22:51:58Z
----------------------------------------------------------------

But it's not just different frequencies. You might have data on different grids for example that the catalog could load into distinct datasets with a single call, but the cookbook couldn't

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:15Z
----------------------------------------------------------------

I think this bit adds confusion:

"

Additionally, there are cases where the cookbook will silently load and concatenate data on different grids, because it does not know which files in the database should/shouldn't be concatenated together. "

Maybe put at the end of this sub section?

Then the rest could be:

"The catalog knows which files are at different frequencies and provide methods to open multiple datasets from one query. For example, trying to open a variable available at multiple frequencies without specifying a frequency, gives an error, but we can open them using to_dataset_dict() rather than to_dask()"


dougiesquire commented on 2023-10-24T23:03:36Z
----------------------------------------------------------------

The point is that the cookbook will effectively just find all the files in the database that match the query and try to concatenate them. The catalog however knows which files should be concatenated together and which shouldn't. As a hypothetical example, imagine your query returns 6 files: 2 monthly on grid1, 2 daily on grid1 and 2 monthly on grid2. The cookbook will try to concatenate these into a single dataset (actually it will probably thrown an error first). The catalog will return 3 datasets.

So the difference is more general than just frequencies. Feel free to remove/edit the paragraph you found confusing, but I don't think we should make this section just about frequencies.

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:16Z
----------------------------------------------------------------

Well need to add a cell tag to ignore this cell in jenkins


dougiesquire commented on 2023-10-24T23:03:52Z
----------------------------------------------------------------

I think I've done that already

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:17Z
----------------------------------------------------------------

"Alternatively, multiple datasets can be opened directly into an xarray-datatree by ..."


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:18Z
----------------------------------------------------------------

mask the warnings or make this a raw cell


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:19Z
----------------------------------------------------------------

Move this part into a new comment below the code cell?

This is mostly because the opening of datasets is a trivially parallelized task that is done lazily anyway so opening all files and reducing the times using xarray's sel methods shouldn't add too much overhead. In most cases where the overhead of opening the files seems large, this can be reduced through sensible choices of keyword arguments provided to open_dataset and combine_by_coords - see the xarray documentation on Reading multi-file datasets for details.

This note can be made clearer - ill work on it


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:20Z
----------------------------------------------------------------

Possibly this section should go before "Some important differences" because these are things everyone needs to know? The differences section are more intuitive?


dougiesquire commented on 2023-10-24T23:08:46Z
----------------------------------------------------------------

I'm not sure I agree. Maybe everything could be merged into a single section? Or they each become their own sections?

@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:21Z
----------------------------------------------------------------

I think expand this cell similar to

"Passing keyword arguments is different" section


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:22Z
----------------------------------------------------------------

Correctly choosing chunk sizes when you open datasets will greatly improve the speed of your analysis. Check out the ...


@review-notebook-app
Copy link

review-notebook-app bot commented Oct 24, 2023

View / edit / reply to this conversation on ReviewNB

anton-seaice commented on 2023-10-24T22:40:23Z
----------------------------------------------------------------

I'm not sure this heading is clear?

"Loading time-invariant variables"

"Loading model information"

Are there variables other than grid information?

"Loading grid information"


dougiesquire commented on 2023-10-24T23:09:40Z
----------------------------------------------------------------

"Loading time-invariant variables" is good

Copy link
Collaborator

I thought it could be best to first show people what they're familiar with, then demonstrate that you do exactly the same thing with the catalog. But I'll leave it up to you. Happy with whatever, so long as it's clear


View entire conversation on ReviewNB

Copy link
Collaborator

But it's not just different frequencies. You might have data on different grids for example that the catalog could load into distinct datasets with a single call, but the cookbook couldn't


View entire conversation on ReviewNB

Copy link
Collaborator

The point is that the cookbook will effectively just find all the files in the database that match the query and try to concatenate them. The catalog however knows which files should be concatenated together and which shouldn't. As a hypothetical example, imagine your query returns 6 files: 2 monthly on grid1, 2 daily on grid1 and 2 monthly on grid2. The cookbook will try to concatenate these into a single dataset (actually it will probably thrown an error first). The catalog will return 3 datasets.

So the difference is more general than just frequencies. Feel free to remove/edit the paragraph you found confusing, but I don't think we should make this section just about frequencies.


View entire conversation on ReviewNB

Copy link
Collaborator

I think I've done that already


View entire conversation on ReviewNB

Copy link
Collaborator

I'm not sure I agree. Maybe everything could be merged into a single section? Or they each become their own sections?


View entire conversation on ReviewNB

Copy link
Collaborator

"Loading time-invariant variables" is good


View entire conversation on ReviewNB

@anton-seaice
Copy link
Collaborator Author

Dougie and I have finished looking at this, if someone else is able to review that would be great. @navidcy ?

@anton-seaice
Copy link
Collaborator Author

@micaeljtoliveira Do you mind giving this a quick review? Dougie and I are happy with it

@max-anu max-anu self-requested a review November 30, 2023 01:07
Copy link
Contributor

@max-anu max-anu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through the notebook - It's good and I don't think anything else needs to be added. I've also checked that it runs correctly.

@max-anu max-anu merged commit f7a4ee5 into COSIMA:main Nov 30, 2023
@anton-seaice anton-seaice deleted the intake-example branch January 8, 2024 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACCESS Intake Catalog
4 participants