-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tutorial on intake catalog #298
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@anton-seaice can you resolve the conflicts? |
Ok done |
Thanks @anton-seaice! I haven't looked at this in detail yet (sorry), but my first thought is that this could possibly be more focussed on how to do the same things with the cookbook/catalogue and less focussed on plotting ice data. For example, sections on:
These sections could emphasise the differences (+pros/cons) between the two approaches. Some of this info is already in your notebook, but adapting the notebook to make this info the focus might make it easier to find/ingest/remember. |
I was thinking we needed the sections on trying to load some real data to show why we needed to do some steps. i.e. you only discover the need for extra filtering of search results if you are trying to load a field without the time dimension, simlarly you can't subset by a date range correctly before you have subtracted the 'timedelta'. I could end the notebook at the line 'There are no more changes to the notebook', and use your 4 headings to breakup the notebook into clearer sections? |
So is this an adaption of another notebook? Is that made clear somewhere? We possibly want to avoid having two notebooks that are related, as they then have to be maintained together to remain useful.
This is a quirk of the ice data right? Not related to the use of the catalog. Yes, I agree that we want to demonstrate loading data etc in the notebook, but we don't have to do anything with those data. Instead, you could demonstrate loading some data using both methods and show that they give the same thing. I guess the key message is that: da = cosima_cookbook.querying.getvar(expt="expt0", variable="var0", session=session, frequency="1 monthly") translates to ds = catalog["expt0"].search(variable="var0", frequency="1mon").to_dask() and then differences (e.g. fixed frequency data, time filtering, queries that return files spanning multiple datasets) could be demonstrated from there. |
Ill update to focus on those points :) |
@dougiesquire Back to you. I'm not very good with dask - so I am not sure the chunking section is very good, your input would be useful :) I realised it doesn't show the dask graph size or chunk memory size in the github view, so maybe we need a different way of demonstrating the chunk size? |
@anton-seaice, do you mind if I make some changes directly to your notebook (if I have permission)? There are a few things that could be changed/added and doing this via reviewnb is always a bit painful |
No problem. Please do that. You should be able to pull straight from the pr ( |
@anton-seaice sorry it took me so long to look at this properly. I've changed/added a bunch of things. Let me know what you think. |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:12Z These warnings take up the whole page, so maybe run this command with
%%capture --no-stderr
or similar, and reference COSIMA/cosima-cookbook#333 |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:13Z I think the catalog should go first and the cookbook second under the heading "opening data" dougiesquire commented on 2023-10-24T22:49:53Z I thought it could be best to first show people what they're familiar with, then demonstrate that you do exactly the same thing with the catalog. But I'll leave it up to you. Happy with whatever, so long as it's clear |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:14Z "The catalog loads multiple frequencies of the same variable" dougiesquire commented on 2023-10-24T22:51:58Z But it's not just different frequencies. You might have data on different grids for example that the catalog could load into distinct datasets with a single call, but the cookbook couldn't |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:15Z I think this bit adds confusion:
" Additionally, there are cases where the cookbook will silently load and concatenate data on different grids, because it does not know which files in the database should/shouldn't be concatenated together. "
Maybe put at the end of this sub section?
Then the rest could be:
"The catalog knows which files are at different frequencies and provide methods to open multiple datasets from one query. For example, trying to open a variable available at multiple frequencies without specifying a frequency, gives an error, but we can open them using dougiesquire commented on 2023-10-24T23:03:36Z The point is that the cookbook will effectively just find all the files in the database that match the query and try to concatenate them. The catalog however knows which files should be concatenated together and which shouldn't. As a hypothetical example, imagine your query returns 6 files: 2 monthly on grid1, 2 daily on grid1 and 2 monthly on grid2. The cookbook will try to concatenate these into a single dataset (actually it will probably thrown an error first). The catalog will return 3 datasets.
So the difference is more general than just frequencies. Feel free to remove/edit the paragraph you found confusing, but I don't think we should make this section just about frequencies. |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:16Z Well need to add a cell tag to ignore this cell in jenkins dougiesquire commented on 2023-10-24T23:03:52Z I think I've done that already |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:17Z "Alternatively, multiple datasets can be opened directly into an xarray-datatree by ..."
|
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:18Z mask the warnings or make this a raw cell |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:19Z Move this part into a new comment below the code cell?
This is mostly because the opening of datasets is a trivially parallelized task that is done lazily anyway so opening all files and reducing the times using xarray's
This note can be made clearer - ill work on it |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:20Z Possibly this section should go before "Some important differences" because these are things everyone needs to know? The differences section are more intuitive? dougiesquire commented on 2023-10-24T23:08:46Z I'm not sure I agree. Maybe everything could be merged into a single section? Or they each become their own sections? |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:21Z I think expand this cell similar to
"Passing keyword arguments is different" section
|
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:22Z Correctly choosing chunk sizes when you open datasets will greatly improve the speed of your analysis. Check out the ... |
View / edit / reply to this conversation on ReviewNB anton-seaice commented on 2023-10-24T22:40:23Z I'm not sure this heading is clear?
"Loading time-invariant variables"
"Loading model information"
Are there variables other than grid information?
"Loading grid information" dougiesquire commented on 2023-10-24T23:09:40Z "Loading time-invariant variables" is good |
I thought it could be best to first show people what they're familiar with, then demonstrate that you do exactly the same thing with the catalog. But I'll leave it up to you. Happy with whatever, so long as it's clear View entire conversation on ReviewNB |
But it's not just different frequencies. You might have data on different grids for example that the catalog could load into distinct datasets with a single call, but the cookbook couldn't View entire conversation on ReviewNB |
The point is that the cookbook will effectively just find all the files in the database that match the query and try to concatenate them. The catalog however knows which files should be concatenated together and which shouldn't. As a hypothetical example, imagine your query returns 6 files: 2 monthly on grid1, 2 daily on grid1 and 2 monthly on grid2. The cookbook will try to concatenate these into a single dataset (actually it will probably thrown an error first). The catalog will return 3 datasets.
So the difference is more general than just frequencies. Feel free to remove/edit the paragraph you found confusing, but I don't think we should make this section just about frequencies. View entire conversation on ReviewNB |
I think I've done that already View entire conversation on ReviewNB |
I'm not sure I agree. Maybe everything could be merged into a single section? Or they each become their own sections? View entire conversation on ReviewNB |
"Loading time-invariant variables" is good View entire conversation on ReviewNB |
Dougie and I have finished looking at this, if someone else is able to review that would be great. @navidcy ? |
This reverts commit 6771596.
@micaeljtoliveira Do you mind giving this a quick review? Dougie and I are happy with it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read through the notebook - It's good and I don't think anything else needs to be added. I've also checked that it runs correctly.
Closes #295
Add a tutorial on moving from cosima_cookbook to the intake catalog.