Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent rules for handling merges between variables with different attributes #25

Closed
shoyer opened this issue Feb 26, 2014 · 13 comments
Labels
API design topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)

Comments

@shoyer
Copy link
Member

shoyer commented Feb 26, 2014

Currently, variable attributes are checked for equality before allowing for a merge via a call to xarray_equal. It should be possible to merge datasets even if some of the variable metadata disagrees (conflicting attributes should be dropped). This is already the behavior for global attributes.

The right design of this feature should probably include some optional argument to Dataset.merge indicating how strict we want the merge to be. I can see at least three versions that could be useful:

  1. Drop conflicting metadata silently.
  2. Don't allow for conflicting values, but drop non-matching keys.
  3. Require all keys and values to match.

We can argue about which of these should be the default option. My inclination is to be as flexible as possible by using 1 or 2 in most cases.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

I would default to 3, and in the exception suggest using a different merge option. Imagine merging two datasets with different _FillValue, unit, or compression attributes.

@shoyer
Copy link
Member Author

shoyer commented Feb 26, 2014

Dataset.merge is also triggered by assigning a DatasetArray to a dataset or by doing a mathematical operation on two DatasetArrays (e.g., x + y). The later is how I encountered this issue today.

For merge itself, I would agree that we may want to default to stricter behavior, but for these other versions of merge we should default to something more flexible.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

It depends on whether x+y does attribute checking before performing the
merge. Again, if units don't match then maybe you shouldn't add. I always
favor the strictest approach so you don't get strange surprises.

On Wed, Feb 26, 2014 at 2:47 PM, Stephan Hoyer [email protected]:

Dataset.merge is also triggered by assigning a DatasetArray to a dataset
or by doing a mathematical operation on two DatasetArrays (e.g., x + y).
The later is how I encountered this issue today.

For merge itself, I would agree that we may want to default to stricter
behavior, but for these other versions of merge we should default to
something more flexible.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36187723
.

@shoyer
Copy link
Member Author

shoyer commented Feb 26, 2014

x + y could indeed check variable attributes before trying to do the merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not be required to match, because that metadata will almost always be conflicting. Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if values were not automatically masked/scaled) should be specifically blacklisted to prohibit conflicts.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

This is an option, but these lists will break if we try to express other
data formats using these conventions. For example, grib likely has other
conventions. We would have to overload attribute or variable depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer [email protected]:

x + y could indeed check variable attributes before trying to do the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not be
required to match, because that metadata will almost always be conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if values
were not automatically masked/scaled) should be specifically blacklisted to
prohibit conflicts.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36189171
.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

Also, there are plenty of other bits where you don't want conflicts.
Imagine that you have variables indexed on different basemap projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected] wrote:

This is an option, but these lists will break if we try to express other
data formats using these conventions. For example, grib likely has other
conventions. We would have to overload attribute or variable depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer [email protected]:

x + y could indeed check variable attributes before trying to do the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not be
required to match, because that metadata will almost always be conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if
values were not automatically masked/scaled) should be specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36189171
.

@shoyer
Copy link
Member Author

shoyer commented Feb 26, 2014

I see your point, but I favor a more pragmatic approach by default. See my
fourth bullet under "Design Goals" in the README and bullet ii under Iris
in "Prior Art".

My vision here is a more powerful ndarray enhanced rather than limited by
metadata. This is closer to what pandas does, which even allows for
conflicting indices resulting in NaN values (a feature I would love to
copy).

I think that both use cases can be covered as long as the merge/conflict
logic is clearly documented and it is possible to write stricter logic for
library code (which by necessity will be more verbose). If it is essential
for units to agree before doing x + y, you can add assert x.attribubes.get('units') == y.attributes.get('units'). Otherwise, we will
end up prohibiting operations like that when x has units of Celsius and y
has units of Kelvin.

On Wed, Feb 26, 2014 at 3:23 PM, ebrevdo [email protected] wrote:

Also, there are plenty of other bits where you don't want conflicts.
Imagine that you have variables indexed on different basemap projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected] wrote:

This is an option, but these lists will break if we try to express other
data formats using these conventions. For example, grib likely has other
conventions. We would have to overload attribute or variable depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer <[email protected]
wrote:

x + y could indeed check variable attributes before trying to do the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not be
required to match, because that metadata will almost always be
conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if
values were not automatically masked/scaled) should be specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36189171>
.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36190935
.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

I don't think that example has your intended affect. I don't know why
anyone would add something of units kelvin with those of celsius. I
understand what you're saying, so maybe we should just throw a stern
warning listing which units conflict and how, every single time.

On Wed, Feb 26, 2014 at 3:42 PM, Stephan Hoyer [email protected]:

I see your point, but I favor a more pragmatic approach by default. See my
fourth bullet under "Design Goals" in the README and bullet ii under Iris
in "Prior Art".

My vision here is a more powerful ndarray enhanced rather than limited by
metadata. This is closer to what pandas does, which even allows for
conflicting indices resulting in NaN values (a feature I would love to
copy).

I think that both use cases can be covered as long as the merge/conflict
logic is clearly documented and it is possible to write stricter logic for
library code (which by necessity will be more verbose). If it is essential
for units to agree before doing x + y, you can add assert x.attribubes.get('units') == y.attributes.get('units'). Otherwise, we will
end up prohibiting operations like that when x has units of Celsius and y
has units of Kelvin.

On Wed, Feb 26, 2014 at 3:23 PM, ebrevdo [email protected] wrote:

Also, there are plenty of other bits where you don't want conflicts.
Imagine that you have variables indexed on different basemap projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected]
wrote:

This is an option, but these lists will break if we try to express
other
data formats using these conventions. For example, grib likely has
other
conventions. We would have to overload attribute or variable depending
on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer <
[email protected]
wrote:

x + y could indeed check variable attributes before trying to do the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not
be
required to match, because that metadata will almost always be
conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if

values were not automatically masked/scaled) should be specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36189171>
.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36190935>

.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36192859
.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 26, 2014

err, which attributes conflict.

On Wed, Feb 26, 2014 at 3:45 PM, Eugene Brevdo [email protected] wrote:

I don't think that example has your intended affect. I don't know why
anyone would add something of units kelvin with those of celsius. I
understand what you're saying, so maybe we should just throw a stern
warning listing which units conflict and how, every single time.

On Wed, Feb 26, 2014 at 3:42 PM, Stephan Hoyer [email protected]:

I see your point, but I favor a more pragmatic approach by default. See my
fourth bullet under "Design Goals" in the README and bullet ii under Iris
in "Prior Art".

My vision here is a more powerful ndarray enhanced rather than limited by
metadata. This is closer to what pandas does, which even allows for
conflicting indices resulting in NaN values (a feature I would love to
copy).

I think that both use cases can be covered as long as the merge/conflict
logic is clearly documented and it is possible to write stricter logic for
library code (which by necessity will be more verbose). If it is essential
for units to agree before doing x + y, you can add assert x.attribubes.get('units') == y.attributes.get('units'). Otherwise, we
will
end up prohibiting operations like that when x has units of Celsius and y
has units of Kelvin.

On Wed, Feb 26, 2014 at 3:23 PM, ebrevdo [email protected]
wrote:

Also, there are plenty of other bits where you don't want conflicts.
Imagine that you have variables indexed on different basemap
projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected]
wrote:

This is an option, but these lists will break if we try to express
other
data formats using these conventions. For example, grib likely has
other
conventions. We would have to overload attribute or variable
depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer <
[email protected]
wrote:

x + y could indeed check variable attributes before trying to do the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should not
be
required to match, because that metadata will almost always be
conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if

values were not automatically masked/scaled) should be specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36189171>
.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36190935>

.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36192859
.

@shoyer
Copy link
Member Author

shoyer commented Feb 27, 2014

Stern warnings about conflicting attributes like units may be an
appropriate compromise. But if we go that way, I would advocate for making
units drop upon doing any mathematical operation. We could try to update
units automatically (e.g., kg * kg = kg^2), but that is tricky to always
right.

Celsius, for example, is a pretty weird physical unit because of how it can
take on negative values, so it actually makes a lot of sense to use mostly
use Kelvin instead (for which I can sensibly apply any math operation like
times or minus). That doesn't mean that I want to store all my raw data in
degrees K, though...

On Wed, Feb 26, 2014 at 3:45 PM, ebrevdo [email protected] wrote:

err, which attributes conflict.

On Wed, Feb 26, 2014 at 3:45 PM, Eugene Brevdo [email protected] wrote:

I don't think that example has your intended affect. I don't know why
anyone would add something of units kelvin with those of celsius. I
understand what you're saying, so maybe we should just throw a stern
warning listing which units conflict and how, every single time.

On Wed, Feb 26, 2014 at 3:42 PM, Stephan Hoyer <[email protected]
wrote:

I see your point, but I favor a more pragmatic approach by default. See
my
fourth bullet under "Design Goals" in the README and bullet ii under
Iris
in "Prior Art".

My vision here is a more powerful ndarray enhanced rather than limited
by
metadata. This is closer to what pandas does, which even allows for
conflicting indices resulting in NaN values (a feature I would love to
copy).

I think that both use cases can be covered as long as the merge/conflict
logic is clearly documented and it is possible to write stricter logic
for
library code (which by necessity will be more verbose). If it is
essential
for units to agree before doing x + y, you can add assert x.attribubes.get('units') == y.attributes.get('units'). Otherwise, we
will
end up prohibiting operations like that when x has units of Celsius and
y
has units of Kelvin.

On Wed, Feb 26, 2014 at 3:23 PM, ebrevdo [email protected]
wrote:

Also, there are plenty of other bits where you don't want conflicts.
Imagine that you have variables indexed on different basemap
projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected]
wrote:

This is an option, but these lists will break if we try to express
other
data formats using these conventions. For example, grib likely has
other
conventions. We would have to overload attribute or variable
depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer <
[email protected]
wrote:

x + y could indeed check variable attributes before trying to do
the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should
not
be
required to match, because that metadata will almost always be
conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset" (if

values were not automatically masked/scaled) should be specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36189171>
.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36190935>

.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36192859>
.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36193148
.

@ebrevdo
Copy link
Contributor

ebrevdo commented Feb 27, 2014

Agreed. I would avoid that kind of thing too. Maybe a stern warning for
all conflicting attributes, and saying that they will be dropped from the
new variable.

For units specifically, Python has a variety of unit libraries that wrap
numpy arrays and can probably do some magic. Not sure if we really want to
do that, though.

On Wed, Feb 26, 2014 at 4:07 PM, Stephan Hoyer [email protected]:

Stern warnings about conflicting attributes like units may be an
appropriate compromise. But if we go that way, I would advocate for making
units drop upon doing any mathematical operation. We could try to update
units automatically (e.g., kg * kg = kg^2), but that is tricky to always
right.

Celsius, for example, is a pretty weird physical unit because of how it can
take on negative values, so it actually makes a lot of sense to use mostly
use Kelvin instead (for which I can sensibly apply any math operation like
times or minus). That doesn't mean that I want to store all my raw data in
degrees K, though...

On Wed, Feb 26, 2014 at 3:45 PM, ebrevdo [email protected] wrote:

err, which attributes conflict.

On Wed, Feb 26, 2014 at 3:45 PM, Eugene Brevdo [email protected]
wrote:

I don't think that example has your intended affect. I don't know why
anyone would add something of units kelvin with those of celsius. I
understand what you're saying, so maybe we should just throw a stern
warning listing which units conflict and how, every single time.

On Wed, Feb 26, 2014 at 3:42 PM, Stephan Hoyer <
[email protected]
wrote:

I see your point, but I favor a more pragmatic approach by default.
See
my
fourth bullet under "Design Goals" in the README and bullet ii under
Iris
in "Prior Art".

My vision here is a more powerful ndarray enhanced rather than limited
by
metadata. This is closer to what pandas does, which even allows for
conflicting indices resulting in NaN values (a feature I would love to
copy).

I think that both use cases can be covered as long as the
merge/conflict
logic is clearly documented and it is possible to write stricter logic
for
library code (which by necessity will be more verbose). If it is
essential
for units to agree before doing x + y, you can add assert x.attribubes.get('units') == y.attributes.get('units'). Otherwise, we
will
end up prohibiting operations like that when x has units of Celsius
and
y
has units of Kelvin.

On Wed, Feb 26, 2014 at 3:23 PM, ebrevdo [email protected]
wrote:

Also, there are plenty of other bits where you don't want
conflicts.
Imagine that you have variables indexed on different basemap
projections.
Creating exceptions to the rule seems like a bit of a rabbit hole.

On Wed, Feb 26, 2014 at 3:13 PM, Eugene Brevdo [email protected]
wrote:

This is an option, but these lists will break if we try to express
other
data formats using these conventions. For example, grib likely has
other
conventions. We would have to overload attribute or variable
depending on
what the underlying datastore is.

On Wed, Feb 26, 2014 at 3:03 PM, Stephan Hoyer <
[email protected]
wrote:

x + y could indeed check variable attributes before trying to do
the
merge. I don't know if it does in the current implementation.

My concern is more that metadata like "title" or "source" should
not
be
required to match, because that metadata will almost always be
conflicting.
Perhaps "units", "_FIllValue", "scale_factor" and "add_offset"
(if

values were not automatically masked/scaled) should be
specifically
blacklisted to prohibit conflicts.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36189171>
.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36190935>

.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36192859>
.

Reply to this email directly or view it on GitHub<
https://github.com/akleeman/xray/issues/25#issuecomment-36193148>

.

Reply to this email directly or view it on GitHubhttps://github.com/akleeman/xray/issues/25#issuecomment-36194729
.

@shoyer shoyer added this to the 0.1 milestone Mar 3, 2014
shoyer added a commit that referenced this issue Apr 27, 2014
Also:
1. Don't try to preserve attributes under mathematical operations.
2. Finish up some cleanup related to "equals" and "identical" for testing.
3. Options for how strictly to compare varaibles when merging or concatenating
   (see #25).

Fixes #103 and #104.
shoyer added a commit that referenced this issue Apr 27, 2014
Also:
1. Don't try to preserve attributes under mathematical operations.
2. Finish up some cleanup related to "equals" and "identical" for testing.
3. Options for how strictly to compare varaibles when merging or concatenating
   (see #25).

Fixes #103 and #104.
@shoyer
Copy link
Member Author

shoyer commented May 6, 2014

I think this has been mostly resolved by the identical and equals methods and the corresponding compat option for Dataset.merge and Dataset.concat.

@shoyer shoyer removed this from the 0.1 milestone May 6, 2014
@shoyer
Copy link
Member Author

shoyer commented Sep 4, 2014

I'm going to close this issue as fixed, but feel free to complain if you feel otherwise (particularly if you have ideas for how we should improve this).

The rule that we seem to have settled on is that xray will either drop all attributes if the result could be ambiguous, or, if there is a clear priority, it will only keep around attributes from the first object. The one firm rule is that xray does not do any checking of attributes for conflicts.

Unless compat == 'identical', there's no checking for conflicts: operations are either keep them all (mostly just subsetting/indexing) or drop them all. For some unary operations like mean, an option keep_attrs allows for switching the default from "drop" to "keep". Binary mathematical operations like * are always "drop".

In cases where there are two objects to combine but where the priority is clearer (e.g., in concat and merge), we'll preserve attributes from the first object and ignore the second. We use the same rule for in-place binary operations.

@shoyer shoyer closed this as completed Sep 4, 2014
@TomNicholas TomNicholas added the topic-metadata Relating to the handling of metadata (i.e. attrs and encoding) label Apr 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design topic-metadata Relating to the handling of metadata (i.e. attrs and encoding)
Projects
None yet
Development

No branches or pull requests

3 participants