Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating row data within apply from other rows doesn't work #8662

Closed
lamakaha opened this issue Oct 28, 2014 · 7 comments
Closed

updating row data within apply from other rows doesn't work #8662

lamakaha opened this issue Oct 28, 2014 · 7 comments

Comments

@lamakaha
Copy link

I'm grouping rows by column investor id. within apply, when a group has more than 1 row, i'm copying a value in the second row in the group into the a column of the first row in that group. I'm using groupby with apply but the assembled result is incorrect - all rows except row 1 show the same investor B - seems like a bug?

In [1]: import pandas as pd
In [2]: import numpy as np

In [3]: df=pd.DataFrame({'InvestorID':[6,6,17,17,19,19,30,40],
   ...:                  'Investor':['AAA','AAA','BBB','BBB','CCC','CCC','EEE','FFF'],
   ...:                  'CurrentPeriodEnd':['7/31/2014','5/31/2014','7/31/2014','5/31/2014','7/31/2014','5/31/2014','7/31/2014','7/31/2014'],
   ...:                  'PriorPeriodEnd':np.nan*8})

In [4]: df
Out[4]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
1        5/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
3        5/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
5        5/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

this function updates the first positional row in a group with values from the second row
where a group has 2 rows or more

In [5]: def f(x):
   ...:     if len(x) >1:
   ...:         x.iloc[0,x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')] 
   ...:     return x.iloc[0]

In [6]: df.groupby(['InvestorID']).apply(f).reset_index(drop=True)

this is not an expected result - the investor column should not have repeating BBB values

Out[6]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      5/31/2014
1        7/31/2014      BBB          17      5/31/2014
2        7/31/2014      BBB          19      5/31/2014
3        7/31/2014      BBB          30      5/31/2014
4        7/31/2014      BBB          40      5/31/2014

the function below copies the first row first before making updates and it works as expected

In [7]: def f(x):
   ...:     top_row=pd.Series(x.iloc[0])
   ...:     if len(x) >1:
   ...:         top_row.iloc[x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')] 
   ...:     return top_row

this is what the result should look like

In [8]: df.groupby(['InvestorID']).apply(f).reset_index(drop=True)
Out[8]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      5/31/2014
1        7/31/2014      BBB          17      5/31/2014
2        7/31/2014      CCC          19      5/31/2014
3        7/31/2014      EEE          30            NaN
4        7/31/2014      FFF          40            NaN
@jreback
Copy link
Contributor

jreback commented Oct 28, 2014

Not really sure what you are trying to do. This will get you started.

In [52]: dfs = df.sort(['InvestorID','CurrentPeriodEnd'],ascending=[1, 0])

In [53]: dfs
Out[53]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
1        5/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
3        5/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
5        5/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

In [54]: indexer = dfs.groupby(['Investor']).head(1).index

In [55]: dfs.groupby(['Investor']).head(1)      
Out[55]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

In [56]: indexer = dfs.groupby(['Investor']).head(1).index

In [57]: dfs.loc[indexer,'PriorPeriodEnd'] = dfs.CurrentPeriodEnd

In [58]: dfs
Out[58]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      7/31/2014
1        5/31/2014      AAA           6            NaN
2        7/31/2014      BBB          17      7/31/2014
3        5/31/2014      BBB          17            NaN
4        7/31/2014      CCC          19      7/31/2014
5        5/31/2014      CCC          19            NaN
6        7/31/2014      EEE          30      7/31/2014
7        7/31/2014      FFF          40      7/31/2014

Your function doesn't work because you have duplicates in x so the indexing of the column returns returns multiple values.

@lamakaha
Copy link
Author

when i change my function not to update back the first row but rather a copy of the first row it works as expected but shouldn't it work even when i update the first row directly in apply?

def f(x):
top_row=pd.Series(x.iloc[0])
if len(x) >1:
top_row.iloc[x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')]
return top_row

CurrentPeriodEnd Investor InvestorID PriorPeriodEnd
0 7/31/2014 AAA 6 5/31/2014
1 7/31/2014 BBB 17 5/31/2014
2 7/31/2014 CCC 19 5/31/2014
3 7/31/2014 EEE 30 NaN
4 7/31/2014 FFF 40 NaN

@jreback
Copy link
Contributor

jreback commented Oct 28, 2014

their is some detection logic to see if you have mutated the return from groupby-apply. in general this is not a good (and is completely a non-performant soln). This might be an edge case, but since you have duplicates and you are not handling them this is not a bug.

You don't want to mutate things, its bad practice and can lead to very subtle issues.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2014

tell you what. I'll mark it as a bug. can you update the top of the issue with 2 examples both copy-pastable that should yield the same results?

and if you want to look at it would be great! (those 2 examples are basically the tests). I won't have time for a while.

@jreback jreback added this to the 0.16.0 milestone Oct 28, 2014
@lamakaha
Copy link
Author

thanks for looking into this. I have updated the top comment with both versions of the function - one which mutates and doesn't work and the other one which copies and works. I still don't understand where the duplicates come into play? Once the dataframe is split into groups by investor id, I don't have any duplicates in x - neither indexes nor columns. I'm using positional reference iloc to set values of the first row with values of the second row. Also I debugged this code and it picks up singular values (in my case dates) both from the assignor and assignee (once the assignment of the value is made). I also used the print statements within the apply function and everything is printing as expected. If you could clarify or demonstrate where I have duplicates in x, it would be great

@jreback
Copy link
Contributor

jreback commented Oct 29, 2014

on 2nd thought, I think this is almost the same bug as #8467 (but in a different case). So should be straightforward to fix (I think).

want to give a shot?

basically step thru the code with the working one, and note what is happening (in the reassembly portion).
Then do the same for the non-working. It is probably something simple. If its not or you get stuck, lmk.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@rhshadrach
Copy link
Member

Closing in favor of #12653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants