Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby built by columns : cannot use .head() or .apply() #9772

Closed
JonasAbernot opened this issue Apr 1, 2015 · 7 comments · Fixed by #37778
Closed

Groupby built by columns : cannot use .head() or .apply() #9772

JonasAbernot opened this issue Apr 1, 2015 · 7 comments · Fixed by #37778
Milestone

Comments

@JonasAbernot
Copy link
Contributor

import numpy as np
import pandas as pd

df = pd.DataFrame({i:pd.Series(np.random.normal(size=10),
                                index=range(10)) for i in range(11)})

df_g = df.groupby(['a']*6+['b']*5, axis=1)

This, if I well understood, should build a groupby object grouping columns, and so give the possibility to later aggregate them. And indeed :

df_g.sum()

works well. But

df_g.head()

Throws an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jonas/Code/pandas/pandas/core/groupby.py", line 986, in head
    in_head = self._cumcount_array() < n
  File "/home/jonas/Code/pandas/pandas/core/groupby.py", line 1044, in _cumcount_array
    cumcounts[indices] = values
IndexError: index 10 is out of bounds for axis 1 with size 10

and

df_g.apply(lambda x : x.sum())

from which I expected the same result as the first example, gives this table :

           a         b
0  -0.381070       NaN
1  -1.214075       NaN
2  -1.496252       NaN
3   3.392565       NaN
4  -0.782376       NaN
5   1.306043       NaN
6        NaN -1.772334
7        NaN  4.125280
8        NaN  1.992329
9        NaN  4.283854
10       NaN -4.791092

I didn't really get what's happening, I don't exclude a misunderstanding or an error from myself.

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.16.0-28-gcb8c130
nose: 1.3.4
Cython: 0.20.2
numpy: 1.9.2
scipy: 0.14.0
statsmodels: None
IPython: 3.0.0-dev
sphinx: 1.2.2
patsy: None
dateutil: 2.4.1
pytz: 2015.2
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.3
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.5.3 (dt dec mx pq3 ext)
@shoyer
Copy link
Member

shoyer commented Apr 1, 2015

These do look like bugs to me -- thanks for the report! If you're interested in digging in to figure out what's going on, such efforts would be appreciated :).

@jreback
Copy link
Contributor

jreback commented Apr 2, 2015

this is an error in cumcount_array.

@jreback jreback added this to the Next Major Release milestone Apr 2, 2015
@JonasAbernot
Copy link
Contributor Author

Eventually, there seems to be 2 different bugs.

The problem with head() may not be so relevant, cause if you groups columns, when you show the head of each group you see the head of the whole df. Yet, the error's still here and maybe address with an exception or warning (cause atm it's silent when you have more rows than columns.).

The pb with apply seems to be a bit deeper. The given function is applied without transmitting the 'axis' argument given in the first groupby call. I couldn't find where this 'axis' argument is stored, if so. Would you agree setting a new attribute to BaseGrouper class, so as to remind the orientation of the build ? If so I will propose a correction.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2015

no it just needs to be passed thru

@evanpw
Copy link
Contributor

evanpw commented May 14, 2015

I don't think the issue with apply is actually a bug. In the original example, it's actually not possible to pass in the axis argument even if we knew to try:

>>> df_g.apply(lambda x : x.sum(), axis=1)
...
TypeError: <lambda>() got an unexpected keyword argument 'axis'

If you want to pass an extra argument, you can add an argument to the lambda and pass it to apply, or you could just pass it directly:

>>> df_g.apply(lambda x, axis : x.sum(axis=axis), axis=1)
>>> df_g.apply(lambda x: x.sum(axis=1))

We could actually inspect the arguments of the passed-in function using reflection, and pass through the axis parameter whenever it takes an argument named 'axis', but that seems like it might be overkill:

>>> import inspect
>>> inspect.getargspec(lambda x, axis : x.sum(axis=axis))
ArgSpec(args=['x', 'axis'], varargs=None, keywords=None, defaults=None)

@JonasAbernot
Copy link
Contributor Author

Yep, I agree it's not a bug. I was thinking about adding a small warning in the doc, just for people like me not to forget the 'axis' argument in the applied function, but I had no time to do so yet.

Still, it's surprising that

df_g.sum()

and

df_g.apply(sum)

have not the same result.

@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@rhshadrach
Copy link
Member

_cumcount_array is okay here; the issue is the use of mask within groupby(...).head when axis=1:

mask = self._cumcount_array() < n
return self._selected_obj[mask]

When axis=1, the mask is computed along the columns, but then applied to the index. I think it should instead applied the columns.

The issue with .apply(lambda x: x.sum()) with axis=1 is trickier. The main issue is that when pandas feeds a group of values into the UDF, they are not transposed. It seems reasonable to me to argue that they should be, but one technical hurdle here is what happens with a frame where the columns are different dtypes. Upon transposing, you now have columns of mixed dtypes, which are coerced to object type. So upon transposing the result back you lose type information. Since the UDF can return anything, there is no way to reliably determine that the resulting dtypes should be.

Of course, an argument against transposing the group when passing it to the UDF is that this would be a rather large change for what seems to me to be of little value. After all, any UDF can be rewritten under the presumption that the values passed in haven't been transposed. In this case, the UDF would be:

lambda x: x.sum(axis=1)

Using this in the OP example then produces the same result as df_g.sum().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants