Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default units #3585

Closed
bjlittle opened this issue Dec 6, 2019 · 15 comments
Closed

Default units #3585

bjlittle opened this issue Dec 6, 2019 · 15 comments

Comments

@bjlittle
Copy link
Member

bjlittle commented Dec 6, 2019

Currently iris implements the following default units for its containers (ignoring AuxCoordFactory here, as the units of the resultant derived coordinate is determined from the units of AuxCoordFactory coordinate dependencies)

Class Default Unit Abstract
_DimensionalMetadata no-unit
AncillaryVariable no-unit
AuxCoord 1
CellMeasure 1
Coord 1
Cube unknown
DimCoord 1

Are these sensible defaults?

I think that we should answer this question as part of the MVP of 3.0.0.


Checkout these related issues/PRs :

@bjlittle
Copy link
Member Author

bjlittle commented Dec 6, 2019

Ping @SciTools/iris-devs

@lbdreyer
Copy link
Member

lbdreyer commented Dec 6, 2019

My initial reaction is that they should all be "unknown", as it currently is for cubes, for similar reasons that the name is "unknown" unless otherwise specified.

Are there any cases where it would be reasonable for the user to expect a default unit?

@bjlittle
Copy link
Member Author

bjlittle commented Dec 6, 2019

Are there any cases where it would be reasonable for the user to expect a default unit?

@lbdreyer I'm not 100% confident, but I don't think so... To me it seems sensible to assume a default of unknown, which tells the user that there is no metadata available to specify what the units should be - and this would align with the default philosophy applied to the Cube.

I can't recall the history behind this decision, and I'm sure that it must have been debated at some point. It's probably goes way, way back to the origin of iris... perhaps @rhattersley might have some thoughts, recollections or an opinion to help clarify?

I can only assume that the default of 1 comes from the canonical units reference in CF Metadata Section 3.3 Standard Names... dunno 😕

Otherwise, at the moment I'm erring on the blanket default of unknown for units everywhere...

@rhattersley
Copy link
Member

perhaps @rhattersley might have some thoughts, recollections or an opinion to help clarify?

IIRC it comes from section "3.1. Units" that states:

A variable with no units attribute is assumed to be dimensionless.

That said, that statement in the CF conventions strikes me as decidedly questionable! It would have been safer to define a missing units attribute as "the units have not been defined" / "the file is not CF-compliant"... but as you know, that kind of sloppy pattern is all too common in the CF conventions. 😑 Sorry... no more whinging about the conventions, I promise!

My 2p ... the defaults for the in-memory objects should be unknown across the board, and possibly the CF-netCDF loader should impose the 1 interpretation.

@rcomer
Copy link
Member

rcomer commented Dec 10, 2019

Just looking at some AuxCoords I have where the points are strings. In this case it doesn't really make sense that the units are 1. They aren't a measure, just a label.

@cpelley
Copy link

cpelley commented Dec 11, 2019

My 2p ... the defaults for the in-memory objects should be unknown across the board, and possibly the CF-netCDF loader should impose the 1 interpretation.

Just to put this on the table.
I would propose that we just raise a warning to the user when there isn't a unit on loading a NetCDF and do away with the CF NetCDF caveat case (i.e. set 'unknown' in all cases of no units).
I think this is more likely to help point people to the fact that they never intended to have dimensionless units in their files in the first place...
(I guess I'm saying that think the CF statement could be more trouble than its worth to maintain)

@stephenworsley
Copy link
Contributor

stephenworsley commented Feb 10, 2020

A variable with no units attribute is assumed to be dimensionless.

From further on in "3.1. Units"...

The Udunits package defines a few dimensionless units...

Two examples of dimensionless units are then given: "1" and "1e-6". I think it is incorrect to equate "dimensionless" with "having '1' as a unit". By my understanding of the CF conventions, I believe it is acceptable (and probably correct) to apply a unit of unknown (or possibly no-unit) when a NetCDF file fails to supply one.

@bjlittle
Copy link
Member Author

bjlittle commented May 13, 2020

After an offline conversation with @stephenworsley, we tended towards the following 1, no-unit and unknown behaviour for units:

  • unknown is the default units, exceptions are:
    • string dtype objects default to no-unit
    • flags objects default to no-unit

for saving:

  • no-unit is not saved to NetCDF
  • unknown is not saved to NetCDF
  • 1 is saved as is to NetCDF, be opinionated about these dimensionless units if you care
  • the units for string dtype objects are not saved to NetCDF, only if they are either unknown or no-unit
  • the units for flags objects are not saved to NetCDF, only if they are either unknown or no-unit

for loading:

  • units of 1 are loaded as is
  • a lack of units metadata is interpreted as unknown, exceptions are:
    • string dtype objects default to no-unit
    • flags objects default to no-unit

In general:

  • we want to preserve metadata units save and load round-tripping
  • the user is king / queen i.e., they get what the ask for units metadata-wise
  • users are not surprised
  • * and / cube arithmetic with units of no-unit is not possible
    • this is in favour of the argument against no-unit being the default units
  • don't save out units of unknown, as it's not CF-Conventions compliant
  • don't save out units of no-unit, as it's not CF-Conventions compliant (see [PI] Aligning save behaviour of cube with units "unknown" and "no-unit" #3394)
  • the CF-Conventions are questionable/unclear in certain units related areas, therefore iris can be opinionated through implementation

@pp-mo
Copy link
Member

pp-mo commented May 13, 2020

Nice summary!

( I just deleted previous comments on "units is None". Re-reading, I think you ruled that out. )

@pp-mo
Copy link
Member

pp-mo commented May 14, 2020

In view of the now proposed #3705, are we happy that the Coord constructors still give a default units='1' ?
If we were to change that, it's definitely going to affect a lot of people
For instance... #3647 (comment)
... but it does begin to seem a bit illogical.

@abooton abooton changed the title Default units Default units (Improve them when loading and creating cubes) May 14, 2020
@abooton abooton changed the title Default units (Improve them when loading and creating cubes) Default units May 14, 2020
@pp-mo
Copy link
Member

pp-mo commented May 15, 2020

Just to add some fuel. I think all of these problems stem from that questionable CF statement :

A variable with no units attribute is assumed to be dimensionless.

This is quoted by both @rhattersley and @stephenworsley above, and I must agree here with @rhattersley that "that statement in the CF conventions strikes me as decidedly questionable" !!

The problem as I see it, is that when in a scientific context you say that a numeric value is "dimensionless", this normally only means that it has no physical unit (e.g. length or time), and is equivalent in dimensional analysis to a "pure ratio" : That doesn't mean it has no scale reference -- it could be a fraction, a percentage, an angle, an angle cosine ...
Whereas, a string value simply "has no units".

So, in addition to what @stephenworsley has said -- that dimensionless is not simply the same as '1', I would add that "unitless" is not at all the same as "dimensionless" !

CF has managed to confuse these things by stating that no stated unit is equivalent to "dimensionless", which is practically the same as saying it defaults to a unit of '1'.
But of course that doesn't really apply to strings.
Worse still, this does not logically apply to measures which are unitless but "happen to be" numeric values, like bitwise flags, station identifiers, or land-categories -- all of which are very commonly used in CF encoded datafiles.
A useful way of seeing this is, I think, that such things do not have a linear scale and can not be combined in arithmetic, e.g. you cannot average, scale or interpolate station numbers, landuse categories etc.
( In maths, things represented by "numbers" are just not all the same type of measure, with the same valid rules . )

So, if we try to rationalise the "unreasonable" CF rule, our interpretation will break, for any file data that doesn't specify a unit but "expects" to be interpreted as a dimensionless numeric value, like a factor or proportion. E.G. it could be fractional measures, or even percentages.

Whereas, the CF approach is pragmatically safer, because for calculations it will "do less harm" to assign a unit to something that doesn't have one, than to treat something as unitless when that might not have been intended :
E.G. "I can't do arithmetic on it, I get a unit error when multiplying by another value" -- this may not be the intended meaning of the file creator.

I think we have a conflict here, between what is sensible for our internal logic and operations, and the desire to preserve logical distinctions for "roundtrip" of data, i.e. saving what was in the original file.
By trying to be "more logical than CF", we can be consistent for data read-to-write passthrough, but we will introduce potential problems in calculation, for particular (?rare?) cases of datafiles and user code.

We need to be able to argue that the problem cases will be rare and acceptable.

This a lot like the problems we have had with fill-value, and masking : More logical interpretations often cause practical problems, because the rest of the world is not sufficiently logical 😉

@rhattersley
Copy link
Member

More logical interpretations often cause practical problems, because the rest of the world is not sufficiently logical

Quote of the year 👍 😁

@bjlittle
Copy link
Member Author

I personally think that no matter what we do, something will definitely break for someone; hence why this needs to be resolved for iris v3.0.0. It's now or never.

In light of the logical pea soup that is CF, we just need to give clarity through being opinionated. Otherwise there is a very really danger that this is all going to be really quite confusing, and personally I'd rather err on the side of clarity through keeping things really simple (just like myself). Remember we need to explain all this nonsense to the user. Occam's Razor, and all that.

So, one persons sensible default is not necessarily another's, and yet we need to square that circle... and that's not really possible, unless I'm missing something totally obvious (which wouldn't be the first time).

For what it's worth, I'd naively opt for a blanket unknown (as per above, with exceptions).

You could argue that it's lazy to rely on implementation defaults particularly if you care about your metadata. So if you really care i.e., really, really care, then be explicit. The user knows their data, we don't, so the simple message is, "if you care about the units of your data, set them accordingly, otherwise (with some exceptions) they'll be unknown".

@stephenworsley
Copy link
Contributor

Just to add another case to consider. iris.coord_categorisation.add_categorised_coord() also has a default unit (currently "1") which will have to be decided on.

@abooton
Copy link
Contributor

abooton commented Jun 10, 2020

See #3708 for the initial progress implementing these changes. A summary based on the above discussion as to how we are interpreting the differing "unite types" is also included (see the "Context" section in the issue description).

@abooton abooton closed this as completed Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants