-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect resampling due to DST #5694
Comments
looks like a dupe of #5172, yes? |
Compared to #5172, in my case resampling by "D" works and I don't get any AmbiguousTimeError. |
ok....you have nice examples anyhow...will leave them both open/linked then |
Further differences between various resampling: Whereas "W-MON" and "MS" doesn't work, it does with "2W-MON" and "2MS":
And in fact, "W-MON" and "MS" resampling can work if, strangely enough, you use the "count" operator:
So I'm beginning to doubt that it has something to do with DST. |
So, I'm confident that some of this will be fixed by or is at least closely related to #5175. Basically, the data is right until it creates a unioned index via generate_range. At this point start and end have the correct time, but in making the MS range between start and end it ignores DST which is what is being fixed. Also, with just one group by column, no unioning is done and so the resulting index is fine. I actually think that a lot of work can be shortcut here in core.index._union_indexes by perhaps checking if the indices are the same. Perhaps that is too expensive, but if done smartly I think it would save some unnecessary work. |
And D works fine because it uses the native timedelta which accounts for DST. Count also goes through a different code path in tseries.index.union_many (offset is missing from the count DatetimeIndex and so the indices cannot be fast unioned and so go through regular Index.union). |
This characterize what happens. When start/end have a timezone the DatetimeIndex tries to generate the range rather than using the cached_range.
Some test cases:
Here is the timezone issue:
|
closing as dupe of #5172 for now |
related #5172
Given this DataFrame, with an index containing the moment when DST changes (October 27th in the case of the "Europe/Paris" timezone):
Let's say I want to find the "min" and "max" values for each month:
Here's the incorrect result:
Same problem with a "W-MON" frequency:
Whereas it works fine with a "D" frequency.
Should I resample only the "a" column, it also works fine:
Tested with latest pandas from GIT master.
The text was updated successfully, but these errors were encountered: