Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

History fields: Fix inconsistent gridcell averages and provide metadata for correct regional / global averages #1297

Open
billsacks opened this issue Mar 10, 2021 · 8 comments
Labels
bug something is working incorrectly science Enhancement to or bug impacting science size: large Large project that will take a few weeks or more

Comments

@billsacks
Copy link
Member

billsacks commented Mar 10, 2021

Brief summary of bug

I am folding two closely related issues into this issue: (1) We are inconsistent in how we do gridcell averaging of history fields, and this is likely done incorrectly for at least some history fields; (2) We don't provide metadata on history files that would be needed for correct averaging up to regional or global averages.

@danicalombardozzi and I plan to lead a discussion on this at an upcoming CLM-science meeting (so no need to comment in this issue, unless you especially want to: we can discuss it by zoom, which may be more efficient).

General bug information

CTSM version you are using: N/A

Does this bug cause significantly incorrect results in the model's science? Probably, for some diagnostic fields

Configurations affected: Not yet investigated, but probably all, to some extent

Details of bug

Inconsistent (and probably in some cases incorrect) gridcell averages

Consider a grid cell that is 1/2 natural veg and 1/2 lake, and a diagnostic field that is only calculated over vegetated columns. If the value of this variable is 10 in the vegetated column, what should its grid cell average be: 10? 5?

The answer to this depends on the variable. The first consideration is often: Is this (a) a variable expressed on a per-m^2 basis, or (b) a variable not expressed per-m^2 (e.g., temperature, a stress coefficient, etc.). For (b), we generally would want to exclude lakes from the averages, so the grid cell average in this example would be 10. For (a), it is sometimes / often appropriate to assume a 0 value over some land units – so in this example, we may end up with a grid cell average value of 5 (because now this value applies per m^2 grid cell area) – but this may not always be the case.

This is implemented in different ways for different variables. I can think of four methods in the code:

  1. l2g_scale_type: this is an optional argument specified for some history fields in the hist_addfld call. For example, setting l2g_scale_type='veg' means to only take averages over vegetated landunits (natural veg & crop).
  2. set_*=spval and set_*=0 in the hist_addfld call: this method predates the implementation of l2g_scale_type. set_*=spval accomplishes something like l2g_scale_type, whereas set_*=0 accomplishes something like initializing the value to 0 for certain landunits in initCold. (See also History fields incorrect when set_xxx=0 but the xxx landunit is also set in initCold #431 )
  3. All (or almost all) history fields are initialized to spval everywhere, overwriting the original NaN initialization. This means that, if no other mechanism is used and the field isn't set over a given landunit, that landunit will end up being excluded from any averaging.
  4. Some fields are explicitly set to 0 in cold start initialization for some landunits, or maybe even every time through the run loop. It's possible that some fields are set to spval in other places as well, though I can't remember ever seeing this.

This inconsistency is itself a problem, making it hard to understand the treatment of any given history variable as well as whether two variables are treated the same way or different ways. In addition, it is almost certainly the case that some history variables are treated differently from how they should be.

Lack of metadata needed for correct regional and global averages

There is another, related problem with our diagnostics: We do not have the metadata on the history files that would be needed to calculate correct regional and global averages. This issue is especially pronounced for fields that only apply over special landunits, which typically have a small and very variable area per grid cell. Consider two grid cells, one with 10% lake and one with 50% lake, and a history field that only applies over lakes. When you take the average of those two grid cells, you would like the second grid cell to contribute 5x as much to the weighted average as the first grid cell, but the history files do not provide the metadata that would be needed for you to know how to do this.

Part of the problem with adding the correct metadata is the inconsistency described above for how we set missing values for particular landunits. If we can migrate to a consistent solution for that, then we can add an extra bit of metadata for each history field pointing to the weighting map that should be used for that field.

My preferred solution

My preference would be to migrate all history fields to using the l2g_scale_type mechanism. This would mean dropping support for the set_* arguments to hist_addfld, removing the initial setting of all history fields to spval, and removing any ad hoc settings of history fields to spval, if any currently exist. Removing the initial setting of history fields to spval would be nice for a few other reasons: It has caused some subtle issues in the past, and it stops us from being able to check for NaNs in the code very effectively when running in debug mode.

If we can completely remove this use of spval in the code, we may be able to remove the checks for whether the field is spval in the subgridAve code.

Then we can add fields to the history file that give weighting maps corresponding to each possible l2g_scale_type variable. Note that these are potentially dynamic in time, due to dynamic landunits. Alternatively, we could have some standard way of denoting the applicable landunits in a metadata field and let the user construct their own weighting maps using the existing PCT_LANDUNIT metadata. (The latter would be more work for the user, but would avoid adding bloat to the history files.)

Digging back through some notes, I see that I discussed this pretty extensively with @dlawrenncar @olyson @ekluzek and @slevisconsulting in Fall, 2011. Apparently I even started working on this at that time, with a very work-in-progress branch in svn: https://svn-ccsm-models.cgd.ucar.edu/clm2/branches/percent_na_metadata . But then that work got abandoned due to higher priority issues.

  • (2021-04-19) I looked at what I did on that branch. None of that work is relevant anymore, so there's no need to look back at that branch.
@billsacks billsacks added bug something is working incorrectly tag: bug - impacts science size: large Large project that will take a few weeks or more labels Mar 10, 2021
@billsacks
Copy link
Member Author

billsacks commented Apr 15, 2021

From discussion at last week's CLM-science meeting (see https://github.com/ESCOMP/CTSM/wiki/Meeting-Notes-2021-Science#april-8 and the presentation linked from there):

  • People generally support moving to l2g_scale_type for fields where we only take averages over certain landunits
  • No clear sense of the best solution for fields that are always 0 over certain landunits
  • No clear sense at this point of how we'll move to this consistency, though we could probably start by focusing on the default biogeophysical variables (non-default variables less important, and my sense is that the biogeochemical variables may already be more consistent)
  • For the correct way of forming regional / global averages: Idea of prototyping a few fields

@billsacks
Copy link
Member Author

billsacks commented Apr 15, 2021

I have opened a project to track individual tasks within this larger project: https://github.com/ESCOMP/CTSM/projects/35 . General discussion about this project can stay in this issue.

@billsacks
Copy link
Member Author

It might make sense to make l2g_scale_type (intended to be renamed to landunit_mask) a required argument when defining a history field. For fields that have valid values for all lanudnits, we could have an 'all_landunits' value. Even though it will be a bit tedious to add this argument to the 1000+ calls, this might help with consistency moving forward, since it will force anyone adding a new field to consider what the averaging should be.

(However, this may be unnecessary if we find that most fields do have valid values over all landunits.)

@billsacks
Copy link
Member Author

I have added individual issues for the first few steps of this project. Those should get us to the point where we can start playing with how we want to write scripts for taking global and regional averages.

@ekluzek
Copy link
Collaborator

ekluzek commented Apr 15, 2021

I like making landunit_mask a required argument, since you should think about it each time you add a hist_addfld call. The default could be all, but unless you make it a required argument it doesn't force the developer to think about it. I think we likely will need to go through a few iterations of this to get to the final state that we want it in. So the first step might be to make "all" the default, but in a later step we require it of all calls, which then requires changing the thousand calls.

@billsacks
Copy link
Member Author

I have now added issues for all of the steps I can think of associated with this project.

@klindsay28
Copy link

Late in the game comment here ...

Have you considered using the CF convention attribute cell_measures?

In the hypothetical example of

Consider a grid cell that is 1/2 natural veg and 1/2 lake, and a diagnostic field that is only calculated over vegetated columns. If the value of this variable is 10 in the vegetated column, what should its grid cell average be: 10? 5?

could you average only over the vegetated area, and provide a cell_measures attribute that points to a variable that contains vegetated area?

There is a admittedly a chicken and egg problem of using this attribute in output and updating analysis workflows to take this into account.

@billsacks
Copy link
Member Author

Thanks a lot for your thoughts @klindsay28 . I'm going to copy your comment to #1345 because I think the best thing to do is for the people writing post-processing scripts to decide what will work best for them, and then we can implement that in the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly science Enhancement to or bug impacting science size: large Large project that will take a few weeks or more
Projects
None yet
Development

No branches or pull requests

4 participants