Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect resampling due to DST #5694

Closed
dalbani opened this issue Dec 13, 2013 · 8 comments
Closed

Incorrect resampling due to DST #5694

dalbani opened this issue Dec 13, 2013 · 8 comments
Labels
API Design Bug Duplicate Report Duplicate issue or pull request Resample resample method Timezones Timezone data dtype
Milestone

Comments

@dalbani
Copy link

dalbani commented Dec 13, 2013

related #5172

Given this DataFrame, with an index containing the moment when DST changes (October 27th in the case of the "Europe/Paris" timezone):

index = pandas.date_range('2013-09-30', '2013-11-02', freq = '30Min', tz = 'UTC').tz_convert('Europe/Paris')
column_a = pandas.np.random.random(index.size)
column_b = pandas.np.random.random(index.size)
df = pandas.DataFrame({ "a": column_a, "b": column_b }, index = index)

Let's say I want to find the "min" and "max" values for each month:

df.resample("MS", how = { "a": "min", "b": "max" })

Here's the incorrect result:

                                  a         b
2013-09-01 00:00:00+02:00  0.015856  0.979541
2013-10-01 00:00:00+02:00  0.002039  0.999960
2013-10-31 23:00:00+01:00       NaN       NaN

Same problem with a "W-MON" frequency:

                                  a         b
2013-09-30 00:00:00+02:00  0.015856  0.979541
2013-10-07 00:00:00+02:00  0.007961  0.999734
2013-10-14 00:00:00+02:00  0.002614  0.993354
2013-10-21 00:00:00+02:00  0.005655  0.999960
2013-10-27 23:00:00+01:00       NaN       NaN
2013-11-03 23:00:00+01:00       NaN       NaN

Whereas it works fine with a "D" frequency.

                                  a         b
...
2013-10-26 00:00:00+02:00  0.004645  0.983281
2013-10-27 00:00:00+02:00  0.030151  0.986827
2013-10-28 00:00:00+01:00  0.015891  0.981455
2013-10-29 00:00:00+01:00  0.024176  0.999306
...

Should I resample only the "a" column, it also works fine:

df["a"].resample("MS", how = "min")
2013-09-01 00:00:00+02:00    0.015856
2013-10-01 00:00:00+02:00    0.002039
2013-11-01 00:00:00+01:00    0.000747
Freq: MS, dtype: float64

Tested with latest pandas from GIT master.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2013

looks like a dupe of #5172, yes?

@dalbani
Copy link
Author

dalbani commented Dec 13, 2013

Compared to #5172, in my case resampling by "D" works and I don't get any AmbiguousTimeError.
Don't know if that matters.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2013

ok....you have nice examples anyhow...will leave them both open/linked then

@dalbani
Copy link
Author

dalbani commented Dec 14, 2013

Further differences between various resampling:

Whereas "W-MON" and "MS" doesn't work, it does with "2W-MON" and "2MS":

df.resample("2W-MON", how = { "a": "min", "b": "max" })
                                  a         b
2013-09-30 00:00:00+02:00  0.015856  0.979541
2013-10-14 00:00:00+02:00  0.002614  0.999734
2013-10-28 00:00:00+01:00  0.003269  0.999960
2013-11-11 00:00:00+01:00  0.000747  0.999306
df.resample("2MS", how = { "a": "min", "b": "max" })
                                  a         b
2013-09-01 00:00:00+02:00  0.002039  0.999960
2013-11-01 00:00:00+01:00  0.000747  0.986369

And in fact, "W-MON" and "MS" resampling can work if, strangely enough, you use the "count" operator:

df.resample("W-MON", how = { "a": "min", "b": "count" })
                                  a    b
2013-09-30 00:00:00+02:00  0.015856   44
2013-10-07 00:00:00+02:00  0.007961  336
2013-10-14 00:00:00+02:00  0.002614  336
2013-10-21 00:00:00+02:00  0.005655  336
2013-10-28 00:00:00+01:00  0.003269  338
2013-11-04 00:00:00+01:00  0.000747  195
df.resample("MS", how = { "a": "min", "b": "count" })
                                  a     b
2013-09-01 00:00:00+02:00  0.015856    44
2013-10-01 00:00:00+02:00  0.002039  1490
2013-11-01 00:00:00+01:00  0.000747    51

So I'm beginning to doubt that it has something to do with DST.
In fact, I've have already encountered this issue recently. See this thread on the mailing-list: https://groups.google.com/d/msg/pydata/4Xoas_m4pHA/OE-DrXBVg-QJ

@rockg
Copy link
Contributor

rockg commented Dec 19, 2013

So, I'm confident that some of this will be fixed by or is at least closely related to #5175. Basically, the data is right until it creates a unioned index via generate_range. At this point start and end have the correct time, but in making the MS range between start and end it ignores DST which is what is being fixed. Also, with just one group by column, no unioning is done and so the resulting index is fine. I actually think that a lot of work can be shortcut here in core.index._union_indexes by perhaps checking if the indices are the same. Perhaps that is too expensive, but if done smartly I think it would save some unnecessary work.

@rockg
Copy link
Contributor

rockg commented Dec 19, 2013

And D works fine because it uses the native timedelta which accounts for DST. Count also goes through a different code path in tseries.index.union_many (offset is missing from the count DatetimeIndex and so the indices cannot be fast unioned and so go through regular Index.union).

@rockg
Copy link
Contributor

rockg commented Dec 28, 2013

This characterize what happens. When start/end have a timezone the DatetimeIndex tries to generate the range rather than using the cached_range. _generate_regular_range should work properly with #5175 integrated. Also, I came across another strange error with timezone comparison. The _utcoffset of the Timestamps tzinfo and one created with pytz.timezone are different hence the __repr__ is different.

            if _use_cached_range(offset, _normalized, start, end):
                index = cls._cached_range(start, end, periods=periods,
                                          offset=offset, name=name)
            else:
                index = _generate_regular_range(start, end, periods, offset)

Some test cases:

# Wrong
import pandas as pd
pd.date_range(pd.Timestamp('2013-09-30', tz='Europe/Paris'), pd.Timestamp('2013-11-05', tz='Europe/Paris'), freq='MS')
Out[1]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:00:00, 2013-10-31 23:00:00]
Length: 2, Freq: MS, Timezone: Europe/Paris

#TZ issue
pd.date_range(pd.Timestamp('2013-09-30', tz='Europe/Paris'), pd.Timestamp('2013-11-05', tz='Europe/Paris'), freq='MS', tz='Europe/Paris')
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2821, in run_code
    exec code_obj in self.user_global_ns, self.user_ns
  File "<ipython-input-1-22f04ffd6a85>", line 1, in <module>
    pd.date_range(pd.Timestamp('2013-09-30', tz='Europe/Paris'), pd.Timestamp('2013-11-05', tz='Europe/Paris'), freq='MS', tz='Europe/Paris')
  File "/Users/rockg/Projects/pandas_temp/pandas/tseries/index.py", line 1789, in date_range
    closed=closed)
  File "/Users/rockg/Projects/pandas_temp/pandas/tseries/index.py", line 193, in __new__
    infer_dst=infer_dst)
  File "/Users/rockg/Projects/pandas_temp/pandas/tseries/index.py", line 339, in _generate
    raise AssertionError("Inferred time zone not equal to passed "
AssertionError: Inferred time zone not equal to passed time zone

#Right
pd.date_range(pd.Timestamp('2013-09-30'), pd.Timestamp('2013-11-05'), freq='MS', tz='Europe/Paris')
Out[1]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-10-01 00:00:00, 2013-11-01 00:00:00]
Length: 2, Freq: MS, Timezone: Europe/Paris

Here is the timezone issue:

ts = pd.Timestamp('2013-09-30', tz='Europe/Paris')
ts.tzinfo._utcoffset
Out[1]: datetime.timedelta(0, 7200)
tz = pytz.timezone('Europe/Paris')
tz._utcoffset
Out[1]: datetime.timedelta(0, 540)
tz == ts.tzinfo
Out[1]: False

@jreback
Copy link
Contributor

jreback commented Nov 6, 2014

closing as dupe of #5172 for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Duplicate Report Duplicate issue or pull request Resample resample method Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants