Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taking first row from each group in groupby sometimes strips tzinfo #10668

Closed
louispotok opened this issue Jul 24, 2015 · 8 comments
Closed

Taking first row from each group in groupby sometimes strips tzinfo #10668

louispotok opened this issue Jul 24, 2015 · 8 comments
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Testing pandas testing functions or related to the test suite Timezones Timezone data dtype

Comments

@louispotok
Copy link
Contributor

xref #12898 (same fix)

(c.f. http://stackoverflow.com/questions/31617084/how-to-have-groupby-first-not-remove-timezone-info-from-datetime-columns)
Take a dataframe with a column of tz-aware datetime.datetime objects, and group it by a different column, then return the first row from each group. There are some ways to do this that leave the datetime as it is; and then at least two ways that convert it to a tz-naive pandas Timestamp object.

In [1]: import pandas as pd

In [2]: import datetime

In [3]: import pytz

In [4]: dates = [datetime.datetime(2015,1,i,tzinfo=pytz.timezone('US/Pacific')) for i in range(1,5)]

In [5]: df = pd.DataFrame({'A': ['a','b']*2,'B': dates})

In [6]: df
Out[6]: 
   A                          B
0  a  2015-01-01 00:00:00-08:00
1  b  2015-01-02 00:00:00-08:00
2  a  2015-01-03 00:00:00-08:00
3  b  2015-01-04 00:00:00-08:00

In [7]: grouped = df.groupby('A') 

In [8]: grouped.nth(0) #B stays a datetime.datetime with timezone info
Out[8]: 
                           B
A                           
a  2015-01-01 00:00:00-08:00
b  2015-01-02 00:00:00-08:00

In [9]: grouped.head(1) #B stays a datetime.datetime with timezone 
Out[9]: 
                           B
0  2015-01-01 00:00:00-08:00
1  2015-01-02 00:00:00-08:00

In [10]: grouped.first() #B is naive pd.TimeStamp in UTC
Out[10]: 
                    B
A                    
a 2015-01-01 08:00:00
b 2015-01-02 08:00:00

And apparently grouped.apply(lambda x: x.iloc[0]) does the same as .first().

@louispotok
Copy link
Contributor Author

And according to this comment the same thing happens if you replace cell [4] with the more pandonic line:

dates = pd.date_range('2015-01-01',periods=4,tz='US/Pacific') 

@jreback
Copy link
Contributor

jreback commented Jul 24, 2015

its a bug. I thought we had an issue for this already, but can't seem to find it.

@jreback jreback added Bug Groupby Timezones Timezone data dtype labels Jul 24, 2015
@jreback jreback added this to the Next Major Release milestone Jul 24, 2015
@cfperez
Copy link
Contributor

cfperez commented Oct 26, 2015

I can also confirm bug for grouped.last() and grouped.apply(lambda x: x.iloc[-1]).

But does work correctly for grouped.agg(lambda x: x.iloc[-1]).

@jreback
Copy link
Contributor

jreback commented Oct 26, 2015

This is all ok on master, so all this issue needs is probably a few confirming tests.

@cfperez, @louispotok interesested in a pull-request?

In [20]: In [8]: grouped.nth(0)
Out[20]: 
                          B
A                          
a 2014-12-31 23:53:00-08:00
b 2015-01-01 23:53:00-08:00

In [21]: grouped.head(1)
Out[21]: 
                          B
0 2014-12-31 23:53:00-08:00
1 2015-01-01 23:53:00-08:00

In [22]: grouped.first()
Out[22]: 
                          B
A                          
a 2015-01-01 07:53:00-08:00
b 2015-01-02 07:53:00-08:00

In [23]: grouped.apply(lambda x: x.iloc[0]) 
Out[23]: 
A
a   2014-12-31 23:53:00-08:00
b   2015-01-01 23:53:00-08:00
dtype: datetime64[ns, US/Pacific]

In [24]: grouped.first().dtypes
Out[24]: 
B    datetime64[ns, US/Pacific]
dtype: object

@jreback jreback added Testing pandas testing functions or related to the test suite and removed Bug labels Oct 26, 2015
@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 26, 2015
@cfperez
Copy link
Contributor

cfperez commented Oct 26, 2015

@jreback I'm working of the latest commit, and problem now is that the timestamp is wrong (exactly 8 hours off reflecting the timezone difference) even while the timezone is preserved. Note that nth(0) and first() return different times for the same date and timezone.

Also, why don't these two methods return the same indices? In your example, nth(0) and head(1) agree, but first() does not.

I can add tests but still think this is a bug (and unsure how deep the rabbit hole goes.)

@jreback
Copy link
Contributor

jreback commented Oct 26, 2015

ahh wasn't paying enough attention
yeh this got localized twice I think

ok will mark it has a bug again then

@sinhrks
Copy link
Member

sinhrks commented Apr 6, 2016

Dupe of #12716.

@jreback jreback removed this from the 0.18.1 milestone Apr 26, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Feb 16, 2017
@jreback
Copy link
Contributor

jreback commented Feb 16, 2017

better example I think in #15426

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby Testing pandas testing functions or related to the test suite Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

5 participants