Groupby built by columns : cannot use .head() or .apply() #9772

JonasAbernot · 2015-04-01T13:50:32Z

import numpy as np
import pandas as pd

df = pd.DataFrame({i:pd.Series(np.random.normal(size=10),
                                index=range(10)) for i in range(11)})

df_g = df.groupby(['a']*6+['b']*5, axis=1)

This, if I well understood, should build a groupby object grouping columns, and so give the possibility to later aggregate them. And indeed :

df_g.sum()

works well. But

df_g.head()

Throws an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jonas/Code/pandas/pandas/core/groupby.py", line 986, in head
    in_head = self._cumcount_array() < n
  File "/home/jonas/Code/pandas/pandas/core/groupby.py", line 1044, in _cumcount_array
    cumcounts[indices] = values
IndexError: index 10 is out of bounds for axis 1 with size 10

and

df_g.apply(lambda x : x.sum())

from which I expected the same result as the first example, gives this table :

           a         b
0  -0.381070       NaN
1  -1.214075       NaN
2  -1.496252       NaN
3   3.392565       NaN
4  -0.782376       NaN
5   1.306043       NaN
6        NaN -1.772334
7        NaN  4.125280
8        NaN  1.992329
9        NaN  4.283854
10       NaN -4.791092

I didn't really get what's happening, I don't exclude a misunderstanding or an error from myself.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.16.0-28-gcb8c130
nose: 1.3.4
Cython: 0.20.2
numpy: 1.9.2
scipy: 0.14.0
statsmodels: None
IPython: 3.0.0-dev
sphinx: 1.2.2
patsy: None
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.5.3 (dt dec mx pq3 ext)

The text was updated successfully, but these errors were encountered:

shoyer · 2015-04-01T21:52:10Z

These do look like bugs to me -- thanks for the report! If you're interested in digging in to figure out what's going on, such efforts would be appreciated :).

jreback · 2015-04-02T21:56:55Z

this is an error in cumcount_array.

JonasAbernot · 2015-04-03T12:41:33Z

Eventually, there seems to be 2 different bugs.

The problem with head() may not be so relevant, cause if you groups columns, when you show the head of each group you see the head of the whole df. Yet, the error's still here and maybe address with an exception or warning (cause atm it's silent when you have more rows than columns.).

The pb with apply seems to be a bit deeper. The given function is applied without transmitting the 'axis' argument given in the first groupby call. I couldn't find where this 'axis' argument is stored, if so. Would you agree setting a new attribute to BaseGrouper class, so as to remind the orientation of the build ? If so I will propose a correction.

jreback · 2015-04-03T13:56:00Z

no it just needs to be passed thru

evanpw · 2015-05-14T11:21:46Z

I don't think the issue with apply is actually a bug. In the original example, it's actually not possible to pass in the axis argument even if we knew to try:

>>> df_g.apply(lambda x : x.sum(), axis=1)
...
TypeError: <lambda>() got an unexpected keyword argument 'axis'

If you want to pass an extra argument, you can add an argument to the lambda and pass it to apply, or you could just pass it directly:

>>> df_g.apply(lambda x, axis : x.sum(axis=axis), axis=1)
>>> df_g.apply(lambda x: x.sum(axis=1))

We could actually inspect the arguments of the passed-in function using reflection, and pass through the axis parameter whenever it takes an argument named 'axis', but that seems like it might be overkill:

>>> import inspect
>>> inspect.getargspec(lambda x, axis : x.sum(axis=axis))
ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=None)

JonasAbernot · 2015-05-18T09:42:25Z

Yep, I agree it's not a bug. I was thinking about adding a small warning in the doc, just for people like me not to forget the 'axis' argument in the applied function, but I had no time to do so yet.

Still, it's surprising that

df_g.sum()

and

df_g.apply(sum)

have not the same result.

rhshadrach · 2020-11-07T21:42:56Z

_cumcount_array is okay here; the issue is the use of mask within groupby(...).head when axis=1:

mask = self._cumcount_array() < n
return self._selected_obj[mask]

When axis=1, the mask is computed along the columns, but then applied to the index. I think it should instead applied the columns.

The issue with .apply(lambda x: x.sum()) with axis=1 is trickier. The main issue is that when pandas feeds a group of values into the UDF, they are not transposed. It seems reasonable to me to argue that they should be, but one technical hurdle here is what happens with a frame where the columns are different dtypes. Upon transposing, you now have columns of mixed dtypes, which are coerced to object type. So upon transposing the result back you lose type information. Since the UDF can return anything, there is no way to reliably determine that the resulting dtypes should be.

Of course, an argument against transposing the group when passing it to the UDF is that this would be a rather large change for what seems to me to be of little value. After all, any UDF can be rewritten under the presumption that the values passed in haven't been transposed. In this case, the UDF would be:

lambda x: x.sum(axis=1)

Using this in the OP example then produces the same result as df_g.sum().

shoyer added Bug Groupby labels Apr 1, 2015

jreback added this to the Next Major Release milestone Apr 2, 2015

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

This was referenced Nov 8, 2020

CLN: Simplify groupby head/tail tests #37702

Merged

BUG: Groupby head/tail with axis=1 fails #37777

Closed

BUG: Groupby head/tail with axis=1 fails #37778

Merged

jreback modified the milestones: Someday, 1.2 Nov 13, 2020

jreback closed this as completed in #37778 Nov 13, 2020

rhshadrach mentioned this issue Nov 24, 2020

ENH: groupby.apply axis=1 behavior #38042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby built by columns : cannot use .head() or .apply() #9772

Groupby built by columns : cannot use .head() or .apply() #9772

JonasAbernot commented Apr 1, 2015

shoyer commented Apr 1, 2015

jreback commented Apr 2, 2015

JonasAbernot commented Apr 3, 2015

jreback commented Apr 3, 2015

evanpw commented May 14, 2015

JonasAbernot commented May 18, 2015

rhshadrach commented Nov 7, 2020

Groupby built by columns : cannot use .head() or .apply() #9772

Groupby built by columns : cannot use .head() or .apply() #9772

Comments

JonasAbernot commented Apr 1, 2015

shoyer commented Apr 1, 2015

jreback commented Apr 2, 2015

JonasAbernot commented Apr 3, 2015

jreback commented Apr 3, 2015

evanpw commented May 14, 2015

JonasAbernot commented May 18, 2015

rhshadrach commented Nov 7, 2020