Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix tz-aware DatetimeIndex + TimedeltaIndex (#17558) #17583

Closed
wants to merge 2 commits into from

Conversation

azjps
Copy link

@azjps azjps commented Sep 18, 2017

Fix a bug that was causing a tz-aware DatetimeIndex + TimedeltaIndex to raise an error. (An internal function was documented to return an integer view, but instead was trying to use the DatetimeIndex's dtype.) Fix another bug causing the sum of a tz-aware DatetimeIndex and a numpy array of timedeltas to incorrectly have timezone applied twice, by adding another case in DatetimeIndex.__add__.

Fix minor bug causing DatetimeIndex + TimedeltaIndex to raise an error,
and fix another bug causing the sum of a tz-aware DatetimeIndex and a
numpy array of timedeltas to incorrectly have timezone applied twice.
@gfyoung gfyoung added Bug Timezones Timezone data dtype labels Sep 19, 2017
# and DatetimeIndex + TimedeltaIndex work as expected
dti = pd.DatetimeIndex([pd.Timestamp("2017/01/01")])
dti = dti.tz_localize('US/Eastern')
expected = pd.DatetimeIndex([pd.Timestamp("2017/01/01 01:00")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : add newline above this one.

td_np = np.array([np.timedelta64(1, 'h')], dtype="timedelta64[ns]")
tm.assert_index_equal(dti + td_np, expected)
tm.assert_index_equal(dti + td_np[0], expected)
tm.assert_index_equal(dti + TimedeltaIndex(td_np), expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just for-loop these checks instead? Also, let's be a little more explicit and write it this way:

actual = dti + ...
tm.assert_index_equal(actual, expected)

if is_timedelta64_dtype(other):
return self._add_delta(TimedeltaIndex(other))
elif is_integer_dtype(other):
return NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test that hits this branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but indirectly, without this branch it was breaking a couple of other tests that added a PeriodIndex and a np.array(dtype=int64). I didn't check what the behavior is there, but could add a more direct test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct test is certainly appreciated. Go for it!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I looked into the cases getting triggered with PeriodIndex and unfortunately they're buried pretty deeply in internals (e.g. not obvious this would get called from any public API, at least any more so than the tests that were already reaching this). Also @jreback commented that this should only be internal facing, so I figure its better not to "advertise" this API in a test.

elif is_integer_dtype(other):
return NotImplemented
else:
raise TypeError("cannot add {typ1} and np.ndarray[{typ2}]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as with the NotImplemented. Let's make sure we have a test to hit this branch.

@@ -506,6 +506,7 @@ Indexing
- Bug in ``CategoricalIndex`` reindexing in which specified indices containing duplicates were not being respected (:issue:`17323`)
- Bug in intersection of ``RangeIndex`` with negative step (:issue:`17296`)
- Bug in ``IntervalIndex`` where performing a scalar lookup fails for included right endpoints of non-overlapping monotonic decreasing indexes (:issue:`16417`, :issue:`17271`)
- Bug in adding ``DatetimeIndex`` with a ``TimedeltaIndex`` or a numpy array with ``dtype="timedelta64"`` (:issue:`17558`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct; the bug is only with a numpy array of dtype=timedelta64[ns] (timedelta64 is not a valid dtype for pandas)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "or" was a bit confusing, but there are two separate bugs:

  • adding tz-aware DatetimeIndex with TimedeltaIndex
  • adding tz-aware DatetimeIndex with numpy array which is timedelta64-dtype-like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how so? where does DTI + TDI tz-aware not work? your issues is w.r.t. np.array of timedeltas

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned it briefly in the issue ticket, although I neglected to post the full stack trace:

In [1]: import numpy as np, pandas as pd

In [2]: dti = pd.DatetimeIndex([pd.Timestamp("2017/01/01")]).tz_localize('US/Eastern')

In [3]: td_np = np.array([np.timedelta64(1, 'ns')]) 

In [4]: (dti + pd.TimedeltaIndex(td_np))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-20f28143c2bb> in <module>()
----> 1 (dti + pd.TimedeltaIndex(td_np))

/spare/local/azhu/venv-el6-20170811-stonelib/lib/python2.7/site-packages/pandas/tseries/base.pyc in __add__(self, other)
    617             from pandas.tseries.offsets import DateOffset
    618             if isinstance(other, TimedeltaIndex):
--> 619                 return self._add_delta(other)
    620             elif isinstance(self, TimedeltaIndex) and isinstance(other, Index):
    621                 if hasattr(other, '_add_delta'):

/spare/local/azhu/venv-el6-20170811-stonelib/lib/python2.7/site-packages/pandas/tseries/index.pyc in _add_delta(self, delta)
    787             new_values = self._add_delta_td(delta)
    788         elif isinstance(delta, TimedeltaIndex):
--> 789             new_values = self._add_delta_tdi(delta)
    790             # update name when delta is Index
    791             name = com._maybe_match_name(self, delta)

/spare/local/azhu/venv-el6-20170811-stonelib/lib/python2.7/site-packages/pandas/tseries/base.pyc in _add_delta_tdi(self, other)
    704             mask = (self._isnan) | (other._isnan)
    705             new_values[mask] = tslib.iNaT
--> 706         return new_values.view(self.dtype)
    707 
    708     def isin(self, values):

TypeError: data type not understood

@@ -651,6 +652,15 @@ def __add__(self, other):
raise TypeError("cannot add {typ1} and {typ2}"
.format(typ1=type(self).__name__,
typ2=type(other).__name__))
elif isinstance(other, np.ndarray):
if is_timedelta64_dtype(other):
return self._add_delta(TimedeltaIndex(other))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can only accept timedelta64 here, integer is not valid either fro datetimelikes in add

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in another comment above, the integer case is being used by internals for PeriodIndex. Example: pandas/tseries/offsets.py:_beg_apply_index#L468 adds base_period (PeriodIndex) + roll (ndarray[dtype=int64]).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't allow this user facting; this is ONLY an internal operation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but I don't quite understand what you're getting at -- are you saying there's still something to be done here? I just added the integer branch to prevent aforementioned internally used operation from breaking. (since it returns NotImplemented as it currently does, it will fall through to ufunc implementation.) I can add a code comment to explain the intent?

@@ -731,7 +741,7 @@ def _add_delta_tdi(self, other):
if self.hasnans or other.hasnans:
mask = (self._isnan) | (other._isnan)
new_values[mask] = iNaT
return new_values.view(self.dtype)
return new_values.view('i8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes adding tz-aware DatetimeIndex and TimedeltaIndex, since self.dtype here is a Datetime64TZDtype which causes new_values.view(..) to raise an exception. Would you prefer another solution, say checking if is_datetime64tz_dtype(self.dtype) first?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes use is_datetime64tz_dtype(self.dtype) is the correct idiom. but you'll have to demonstrate this bug specifically

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posted stack trace that hits this line above. So I can change this to

if is_datetime64tz_dtype(self.dtype):
  return new_values.view('i8')
return new_values.view(self.dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the not the right fix.

tz-aware need to be turned into local i8, added, then turned back into the original timezone via localization. normally we do add i8 directly, BUT you cannot do this with tz-aware, e.g. the days are shifted otherwise (imagine adding days), or don't work when you cross DST.

pandas.tseries.offsets already does all of this. I haven't stepped thru this code lately, but need to hit a similar path to that.

@@ -460,6 +460,20 @@ def test_add_dti_dti(self):
with pytest.raises(TypeError):
dti + dti_tz

def test_add_dti_ndarray(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test which raises for numpy arrays of incorrect type (e.g. not m8[ns]);

# GH 17558
# Check that tz-aware DatetimeIndex + np.array(dtype="timedelta64")
# and DatetimeIndex + TimedeltaIndex work as expected
dti = pd.DatetimeIndex([pd.Timestamp("2017/01/01")])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this entire test to datetimelike.py; it will hit datetime, period, timedelta that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should prob move more of these add tests there as well, but can do that in another PR

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2017

Hello @azjps! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 20, 2017 at 00:50 Hours UTC

idx = idx.tz_localize("US/Eastern")

expected = idx + np.timedelta64(1, 'D')
tm.assert_index_equal(idx, expected - np.timedelta64(1, 'D'))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to explain the awkwardness here: was trying to figure out how to demonstrate that expected actually has the properties we expect, e.g. has name = "x" and is a day ahead of idx

@@ -76,3 +76,27 @@ def test_union(self):
for case in cases:
result = first.union(case)
assert tm.equalContents(result, everything)

def test_add_dti_td(self):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is almost the same as special case of the above test in datetimelike.py for DatetimeIndex, if that test looks good I will remove this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above you can simply re-define create_index to make this work (in another test class)

The internal implementation of _add_delta_td already correctly
adds a datetime index and an array of timedeltas, so use that
implementation instead of wrapping in a TimedeltaIndex (which
requires a bit more metadata like name to be passed around).

Move test to datetimelike to check addition of TimedeltaIndex
and np.array(timedelta64) with each of {DatetimeIndex,
TimedeltaIndex, PeriodIndex}. Fix the latter two to explicitly
support addition with a numpy array.

Clarify some comments such as the whatsnew and move tests to
test_datetimelike.
@codecov
Copy link

codecov bot commented Sep 20, 2017

Codecov Report

Merging #17583 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17583      +/-   ##
==========================================
+ Coverage   91.19%    91.2%   +<.01%     
==========================================
  Files         163      163              
  Lines       49626    49638      +12     
==========================================
+ Hits        45258    45270      +12     
  Misses       4368     4368
Flag Coverage Δ
#multiple 88.98% <100%> (+0.02%) ⬆️
#single 40.18% <0%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/datetimelike.py 96.94% <100%> (+0.05%) ⬆️
pandas/core/indexes/period.py 94.22% <100%> (+1.44%) ⬆️
pandas/core/indexes/datetimes.py 95.54% <100%> (ø) ⬆️
pandas/core/indexes/timedeltas.py 91.22% <100%> (+0.03%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21a3800...92bf771. Read the comment docs.

elif isinstance(other, np.ndarray):
if is_timedelta64_dtype(other):
return self._add_delta(other)
elif is_integer_dtype(other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of this, we want to specifically define add in PeriodIndex which handles this case and calls super otherwise.

@@ -731,6 +744,8 @@ def _add_delta_tdi(self, other):
if self.hasnans or other.hasnans:
mask = (self._isnan) | (other._isnan)
new_values[mask] = iNaT
if is_datetime64tz_dtype(self.dtype):
return new_values.view('i8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, see my comments

@@ -801,6 +802,8 @@ def _add_delta(self, delta):

if isinstance(delta, (Tick, timedelta, np.timedelta64)):
new_values = self._add_delta_td(delta)
elif isinstance(delta, np.ndarray) and is_timedelta64_dtype(delta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should simply be incorporated into the case right below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by below, do I understand you correctly that we can consolidate as:

if isinstance(delta, (Tick, timedelta, np.timedelta64, np.ndarray)):
    new_values = self._add_delta_td(delta)
elif ..:

or are you suggesting to wrap delta into a TimedeltaIndex?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes
but i think we need to check the dtype as well if it's a numpy array (can be separate from this)

if isinstance(delta, (Tick, timedelta, np.timedelta64)):
new_values = self._add_delta_td(delta)
name = self.name
elif isinstance(delta, np.ndarray) and is_timedelta64_dtype(delta):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

# and DatetimeIndex + TimedeltaIndex work as expected
idx = self.create_index()
idx.name = "x"
if isinstance(idx, pd.DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use an if like this.

let's create a sub-class that specifically tests tz-aware (so .create_index()) works correctly. (add in datetimes/test_datetimelikes)

Copy link
Author

@azjps azjps Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with this, but unfortunately a tz-aware-based subclass fails a whole bunch of other common tests, so the scope of this PR would have to be expanded further (for which I'd probably need help). Here's one example to illustrate, basically DatetimeIndex constructor will read its input list into a (tz-naive) numpy array and "re-tz_localize" based on dtype's timezone:

    def test_repr_roundtrip(self):
    
        idx = self.create_index()
>       tm.assert_index_equal(eval(repr(idx)), idx)

>       raise AssertionError(msg)
E       AssertionError: Index are different
E       
E       Index values are different (100.0 %)
E       [left]:  DatetimeIndex(['2013-01-01 05:00:00-05:00', '2013-01-02 05:00:00-05:00',
E                      '2013-01-03 05:00:00-05:00', '2013-01-04 05:00:00-05:00',
E                      '2013-01-05 05:00:00-05:00'],
E                     dtype='datetime64[ns, US/Eastern]', freq='D')
E       [right]: DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
E                      '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
E                      '2013-01-05 00:00:00-05:00'],
E                     dtype='datetime64[ns, US/Eastern]', freq='D')

(Pdb) repr(idx)  # broke up output for readability
"DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',\n"
"               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',\n"
"               '2013-01-05 00:00:00-05:00'],\n"
"              dtype='datetime64[ns, US/Eastern]', freq='D')"
(Pdb) eval(repr(idx))
DatetimeIndex(['2013-01-01 05:00:00-05:00', '2013-01-02 05:00:00-05:00',
               '2013-01-03 05:00:00-05:00', '2013-01-04 05:00:00-05:00',
               '2013-01-05 05:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

Most of the other test failures can also be treated as variants of this, as these tests also construct a separate DatetimeIndex instance without tz information, although its possible to (say) modify those tests directly to tz_localize to UTC and then tz_convert to appropriate timezone if create_index().tz is available.

Also the tests related to union, intersection, symmetric_difference, etc try to do these operations against the underlying numpy .values which are tz-naive, leading to issues. We maybe could get around these by just disabling that part of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the right way to do this is to do another PR which separates out the DTI tests to 2 classes for tz-naive and tz-aware and modifies create_index as appropriate. If there are failing tests we can look closer. Want to start on that?

@@ -76,3 +76,27 @@ def test_union(self):
for case in cases:
result = first.union(case)
assert tm.equalContents(result, everything)

def test_add_dti_td(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above you can simply re-define create_index to make this work (in another test class)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put on hold until testing is unwound.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

can you rebase and move the note to 0.21.1

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

can you rebase / update. we have slightly better testing in place now FYI.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

closing as stale

@jreback jreback closed this Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: tz-aware DatetimeIndex + array(timedelta) gives incorrect result
4 participants