Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: accessing .dtypes in a subclass constructor with large frames causes infinite loop #50708

Open
2 of 3 tasks
ryandvmartin opened this issue Jan 12, 2023 · 11 comments
Open
2 of 3 tasks
Assignees
Labels
Subclassing Subclassing pandas objects

Comments

@ryandvmartin
Copy link

ryandvmartin commented Jan 12, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

class MyFrame(pd.DataFrame): 
    def __init__(self, *args, **kwargs): 
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self): 
        return type(self)

def get_frame(N): 
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T, 
        columns=[f"col{i}" for i in range(10)]
    )

# When N is smallish, no issue
frame = get_frame(5000)
frame.dropna(subset=["col0", "col1"])
print("5000 passed")

# When N is largeish, `dropna` recurses in the `__init__` through `self.dtypes[col]` access
frame = get_frame(5000000)
frame.dropna(subset=["col0", "col1"])
print("5000000 passed")

Modifying the class __init__ to (remove self.dtypes[col]):

class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col, dt in zip(self.columns, self.dtypes):
            if dt == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

Issue Description

I think there has been a regression with access to .dtypes property in inherited DataFrame constructors, as noted in the MRE.

We noticed this on pandas 1.5.2 when upgrading our production environment , but reproduced with pandas 1.4.4, 1.4.0. The code works as expected going back to 1.3.5.

As far as what should be done, perhaps more notes about what can/can't/should not be called/done in subclass __init__ routines when inheriting from pd.DataFrame?

Expected Behavior

No infinite loop?

Installed Versions

In [2]: pd.show_versions()
C:\Users\user\Python\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.10.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22621
machine          : AMD64
processor        : AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United States.1252

pandas           : 1.5.2
numpy            : 1.21.6
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.6.3
pip              : 22.3.1
Cython           : 0.29.33
pytest           : 6.2.5
hypothesis       : 6.62.0
sphinx           : 5.3.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 3.1.2
IPython          : 8.8.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : None
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : 3.5.3
numba            : 0.56.4
numexpr          : 2.8.3
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.0
snappy           : None
sqlalchemy       : 1.4.46
tables           : 3.7.0
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
zstandard        : 0.19.0
tzdata           : None
@ryandvmartin ryandvmartin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023
@phofl
Copy link
Member

phofl commented Jan 12, 2023

This works as expected for me on main, can reproduce on 1.5.2. Could you double check on our nightly builds?

@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions Subclassing Subclassing pandas objects and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023
@natmokval
Copy link
Contributor

take

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 29, 2023

From https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=117662725 I'm seeing #49551 as the commit that fixed this, cc @rhshadrach

doesn't really seem plausible though?

EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called

@MarcoGorelli
Copy link
Member

Looks like this has crept back in

Am doing a bisect to see what brought it back, maybe that'll shed some light into what's going on

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 31, 2023

#51335 brought it back:

git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40
. compile-c-extensions.sh 
python myt.py  # hangs indefinitely
git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40~1
. compile-c-extensions.sh 
python myt.py  # runs nearly instantly

myt.py contains:

import pandas as pd
import numpy as np
from pandas import *


class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

def get_frame(N):
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T,
        columns=[f"col{i}" for i in range(10)]
    )

def long_running_function(n):
    get_frame(n).dropna()


long_running_function(1000000)

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 31, 2023

Investigations:

dropna goes here:

elif how == "all":

which then goes to

df = df.T

The transpose calls the constructor, which here is overwritten and involves a Python for-loop which goes through millions of elements. So, that's why it's become a lot slower

Before:

  • numeric_only and axis=1 would end up calling df.T

Now:

  • axis=1 goes to df.T regardless of numeric_only

@rhshadrach reckon this is a cause for concern or that anything should be done here? Not sure pandas can support arbitrary subclasses anyway, might be OK to just close as out-of-scope?

@rhshadrach
Copy link
Member

#52250 would fix again.

@MarcoGorelli
Copy link
Member

ok thanks

if we do want to make a PR for the sake of this, then I'd really suggest getting a test like the one in #50751 merged, else this'll happen again

@MarcoGorelli
Copy link
Member

#52250 would fix again.

Looks like that one's been closed

Is there still scope to fix this one? Or should we just close as not-supported?

@rhshadrach
Copy link
Member

@MarcoGorelli - I plan to look into this.

@rhshadrach rhshadrach removed good first issue Needs Tests Unit test(s) needed to prevent regressions labels Apr 20, 2023
@msm416
Copy link

msm416 commented Feb 10, 2025

RE: #50708 (comment)

I'm seeing #49551 as the commit that fixed this
doesn't really seem plausible though?
EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called

I can confirm that applying #49551 on top of the _reduce in pandas 1.5.3 fixes the performance problem. I didn't need any of the other files to be changed for this particular version.

For anyone else stumbling upon this and not wanting to upgrade pandas to unaffected version: apply the MR change line by line on top of their installed source file's _reduce method (pandas/core/frame.py) or alternatively patch at runtime:

from pandas.core.frame import DataFrame
def _fixed_reduce(self,
    # copy-paste method signature
    ):
# copy-paste method definition and additionally import any internal objects as needed
# apply the MR changes on top of the definition of  _reduce in your pandas runtime - you'll need to look it up from `pandas/core/frame.py` and potentially decide how to resolve conflicts
DataFrame._reduce = _fixed_reduce

Note: to be safe, you might want to patch the remaining files/methods in #49551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants