updating row data within apply from other rows doesn't work #8662

lamakaha · 2014-10-28T19:06:14Z

I'm grouping rows by column investor id. within apply, when a group has more than 1 row, i'm copying a value in the second row in the group into the a column of the first row in that group. I'm using groupby with apply but the assembled result is incorrect - all rows except row 1 show the same investor B - seems like a bug?

In [1]: import pandas as pd
In [2]: import numpy as np

In [3]: df=pd.DataFrame({'InvestorID':[6,6,17,17,19,19,30,40],
   ...:                  'Investor':['AAA','AAA','BBB','BBB','CCC','CCC','EEE','FFF'],
   ...:                  'CurrentPeriodEnd':['7/31/2014','5/31/2014','7/31/2014','5/31/2014','7/31/2014','5/31/2014','7/31/2014','7/31/2014'],
   ...:                  'PriorPeriodEnd':np.nan*8})

In [4]: df
Out[4]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
1        5/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
3        5/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
5        5/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

this function updates the first positional row in a group with values from the second row
where a group has 2 rows or more

In [5]: def f(x):
   ...:     if len(x) >1:
   ...:         x.iloc[0,x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')] 
   ...:     return x.iloc[0]

In [6]: df.groupby(['InvestorID']).apply(f).reset_index(drop=True)

this is not an expected result - the investor column should not have repeating BBB values

Out[6]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      5/31/2014
1        7/31/2014      BBB          17      5/31/2014
2        7/31/2014      BBB          19      5/31/2014
3        7/31/2014      BBB          30      5/31/2014
4        7/31/2014      BBB          40      5/31/2014

the function below copies the first row first before making updates and it works as expected

In [7]: def f(x):
   ...:     top_row=pd.Series(x.iloc[0])
   ...:     if len(x) >1:
   ...:         top_row.iloc[x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')] 
   ...:     return top_row

this is what the result should look like

In [8]: df.groupby(['InvestorID']).apply(f).reset_index(drop=True)
Out[8]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      5/31/2014
1        7/31/2014      BBB          17      5/31/2014
2        7/31/2014      CCC          19      5/31/2014
3        7/31/2014      EEE          30            NaN
4        7/31/2014      FFF          40            NaN

The text was updated successfully, but these errors were encountered:

jreback · 2014-10-28T19:38:53Z

Not really sure what you are trying to do. This will get you started.

In [52]: dfs = df.sort(['InvestorID','CurrentPeriodEnd'],ascending=[1, 0])

In [53]: dfs
Out[53]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
1        5/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
3        5/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
5        5/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

In [54]: indexer = dfs.groupby(['Investor']).head(1).index

In [55]: dfs.groupby(['Investor']).head(1)      
Out[55]: 
  CurrentPeriodEnd Investor  InvestorID  PriorPeriodEnd
0        7/31/2014      AAA           6             NaN
2        7/31/2014      BBB          17             NaN
4        7/31/2014      CCC          19             NaN
6        7/31/2014      EEE          30             NaN
7        7/31/2014      FFF          40             NaN

In [56]: indexer = dfs.groupby(['Investor']).head(1).index

In [57]: dfs.loc[indexer,'PriorPeriodEnd'] = dfs.CurrentPeriodEnd

In [58]: dfs
Out[58]: 
  CurrentPeriodEnd Investor  InvestorID PriorPeriodEnd
0        7/31/2014      AAA           6      7/31/2014
1        5/31/2014      AAA           6            NaN
2        7/31/2014      BBB          17      7/31/2014
3        5/31/2014      BBB          17            NaN
4        7/31/2014      CCC          19      7/31/2014
5        5/31/2014      CCC          19            NaN
6        7/31/2014      EEE          30      7/31/2014
7        7/31/2014      FFF          40      7/31/2014

Your function doesn't work because you have duplicates in x so the indexing of the column returns returns multiple values.

lamakaha · 2014-10-28T19:41:03Z

when i change my function not to update back the first row but rather a copy of the first row it works as expected but shouldn't it work even when i update the first row directly in apply?

def f(x):
top_row=pd.Series(x.iloc[0])
if len(x) >1:
top_row.iloc[x.columns.get_loc('PriorPeriodEnd')]=x.iloc[1,x.columns.get_loc('CurrentPeriodEnd')]
return top_row

CurrentPeriodEnd Investor InvestorID PriorPeriodEnd
0 7/31/2014 AAA 6 5/31/2014
1 7/31/2014 BBB 17 5/31/2014
2 7/31/2014 CCC 19 5/31/2014
3 7/31/2014 EEE 30 NaN
4 7/31/2014 FFF 40 NaN

jreback · 2014-10-28T19:45:43Z

their is some detection logic to see if you have mutated the return from groupby-apply. in general this is not a good (and is completely a non-performant soln). This might be an edge case, but since you have duplicates and you are not handling them this is not a bug.

You don't want to mutate things, its bad practice and can lead to very subtle issues.

jreback · 2014-10-28T19:56:53Z

tell you what. I'll mark it as a bug. can you update the top of the issue with 2 examples both copy-pastable that should yield the same results?

and if you want to look at it would be great! (those 2 examples are basically the tests). I won't have time for a while.

lamakaha · 2014-10-29T01:26:55Z

thanks for looking into this. I have updated the top comment with both versions of the function - one which mutates and doesn't work and the other one which copies and works. I still don't understand where the duplicates come into play? Once the dataframe is split into groups by investor id, I don't have any duplicates in x - neither indexes nor columns. I'm using positional reference iloc to set values of the first row with values of the second row. Also I debugged this code and it picks up singular values (in my case dates) both from the assignor and assignee (once the assignment of the value is made). I also used the print statements within the apply function and everything is printing as expected. If you could clarify or demonstrate where I have duplicates in x, it would be great

jreback · 2014-10-29T01:35:28Z

on 2nd thought, I think this is almost the same bug as #8467 (but in a different case). So should be straightforward to fix (I think).

want to give a shot?

basically step thru the code with the working one, and note what is happening (in the reassembly portion).
Then do the same for the non-working. It is probably something simple. If its not or you get stuck, lmk.

rhshadrach · 2020-07-12T12:31:23Z

Closing in favor of #12653

jreback added the Usage Question label Oct 28, 2014

jreback added Bug Groupby labels Oct 28, 2014

jreback added this to the 0.16.0 milestone Oct 28, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Mar 17, 2016

API: ban mutation within groupby.apply #12653

Closed

jorisvandenbossche removed the Usage Question label Jun 23, 2016

rhshadrach closed this as completed Jul 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating row data within apply from other rows doesn't work #8662

updating row data within apply from other rows doesn't work #8662

lamakaha commented Oct 28, 2014

jreback commented Oct 28, 2014

lamakaha commented Oct 28, 2014

jreback commented Oct 28, 2014

jreback commented Oct 28, 2014

lamakaha commented Oct 29, 2014

jreback commented Oct 29, 2014

rhshadrach commented Jul 12, 2020

updating row data within apply from other rows doesn't work #8662

updating row data within apply from other rows doesn't work #8662

Comments

lamakaha commented Oct 28, 2014

jreback commented Oct 28, 2014

lamakaha commented Oct 28, 2014

jreback commented Oct 28, 2014

jreback commented Oct 28, 2014

lamakaha commented Oct 29, 2014

jreback commented Oct 29, 2014

rhshadrach commented Jul 12, 2020