Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking the mean of a long-spanning np.datetime64 array produces the wrong value #10019

Open
5 tasks done
spencerkclark opened this issue Feb 2, 2025 · 1 comment
Open
5 tasks done
Labels

Comments

@spencerkclark
Copy link
Member

What happened?

As noted in #9977 (comment), taking the mean of a np.datetime64 array with values that span more than half of the resolution-dependent range produces the incorrect result, e.g.:

>>> import numpy as np; import xarray as xr
>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

This is due to overflow when computing the timedeltas relative to the minimum datetime value of the array (code).

What did you expect to happen?

This is the outcome we would expect to see in this example:

>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('1969-01-01T00:00:00.000000000', dtype='datetime64[ns]')

Minimal Complete Verifiable Example

>>> import numpy as np; import xarray as xr
>>> array = np.array(["1678-01-01", "2260-01-01"], dtype="datetime64[ns]")
>>> times = xr.DataArray(array, dims=["time"])
>>> times.mean()
<xarray.DataArray ()> Size: 8B
array('2261-04-11T23:47:16.854775808', dtype='datetime64[ns]')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.8 | packaged by conda-forge | (main, Dec  5 2024, 14:25:12) [Clang 18.1.8 ]
python-bits: 64
OS: Darwin
OS-release: 24.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.4
libnetcdf: 4.9.2

xarray: 2024.6.1.dev516+gf6716dc2.d20250112
pandas: 2.2.3
numpy: 2.0.2
scipy: 1.14.1
netCDF4: 1.7.2
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 2.18.4
cftime: 1.6.4
nc_time_axis: None
iris: 3.11.0
bottleneck: 1.4.2
dask: 2024.12.1
distributed: 2024.12.1
matplotlib: 3.10.0
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.12.0
cupy: None
pint: None
sparse: 0.15.4
flox: None
numpy_groupies: None
setuptools: 75.6.0
pip: 24.3.1
conda: None
pytest: 8.3.4
mypy: 1.13.0
IPython: 8.31.0
sphinx: 6.2.1

@spencerkclark spencerkclark added bug needs triage Issue that has not been reviewed by xarray team member and removed needs triage Issue that has not been reviewed by xarray team member labels Feb 2, 2025
@spencerkclark
Copy link
Member Author

This is a bit of a tricky problem. Pandas handles it essentially by converting the np.datetime64 values to np.float64 values, computing the mean, and converting back to np.datetime64 (code). This happens to work for this example:

>>> pd.Series(array).mean()
Timestamp('1969-01-01 00:00:00')

but also, due to precision loss, fails on some other examples. For instance, taking the mean of the minimum representable Timestamp erroneously results in "NaT":

>>> pd.Series(pd.Timestamp.min).mean()
NaT

or taking the mean of a time distant from the Unix epoch can change its value:

>>> time = pd.Timestamp.min + pd.to_timedelta(1, "D")
>>> pd.Series(time) == pd.Series(time).mean()
0    False
dtype: bool

Somewhat inspired by numpy/numpy#12901 (comment), a way around the precision loss issue would be to convert to np.longdouble instead of np.float64, but this would only be effective on some platforms, since np.longdouble has a platform-dependent precision (NumPy docs). For instance replacing:

offset = _datetime_nanmin(array)
# From version 2025.01.2 xarray uses np.datetime64[unit], where unit
# is one of "s", "ms", "us", "ns".
# To not have to worry about the resolution, we just convert the output
# to "timedelta64" (without unit) and let the dtype of offset take precedence.
# This is fully backwards compatible with datetime64[ns].
return (
_mean(
datetime_to_numeric(array, offset), axis=axis, skipna=skipna, **kwargs
).astype("timedelta64")
+ offset
)

with

        array_as_longdouble = where(isnat(array), np.nan, array.astype(np.longdouble))
        return _mean(array_as_longdouble, axis=axis, skipna=skipna, **kwargs).astype(
            array.dtype

produces nice results on some systems where np.longdouble is an 80-bit float, but does not do better than pandas on other systems where np.longdouble is merely a 64-bit float.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant