BUG: accessing `.dtypes` in a subclass constructor with large frames causes infinite loop #50708

ryandvmartin · 2023-01-12T15:40:44Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

class MyFrame(pd.DataFrame): 
    def __init__(self, *args, **kwargs): 
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self): 
        return type(self)

def get_frame(N): 
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T, 
        columns=[f"col{i}" for i in range(10)]
    )

# When N is smallish, no issue
frame = get_frame(5000)
frame.dropna(subset=["col0", "col1"])
print("5000 passed")

# When N is largeish, `dropna` recurses in the `__init__` through `self.dtypes[col]` access
frame = get_frame(5000000)
frame.dropna(subset=["col0", "col1"])
print("5000000 passed")

Modifying the class __init__ to (remove self.dtypes[col]):

class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col, dt in zip(self.columns, self.dtypes):
            if dt == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

Issue Description

I think there has been a regression with access to .dtypes property in inherited DataFrame constructors, as noted in the MRE.

We noticed this on pandas 1.5.2 when upgrading our production environment , but reproduced with pandas 1.4.4, 1.4.0. The code works as expected going back to 1.3.5.

As far as what should be done, perhaps more notes about what can/can't/should not be called/done in subclass __init__ routines when inheriting from pd.DataFrame?

Expected Behavior

No infinite loop?

Installed Versions

In [2]: pd.show_versions()
C:\Users\user\Python\lib\site-packages\_distutils_hack\__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.10.8.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22621
machine          : AMD64
processor        : AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United States.1252

pandas           : 1.5.2
numpy            : 1.21.6
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.6.3
pip              : 22.3.1
Cython           : 0.29.33
pytest           : 6.2.5
hypothesis       : 6.62.0
sphinx           : 5.3.0
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : None
pymysql          : None
psycopg2         : 2.9.3
jinja2           : 3.1.2
IPython          : 8.8.0
pandas_datareader: None
bs4              : 4.11.1
bottleneck       : None
brotli           :
fastparquet      : None
fsspec           : 2022.11.0
gcsfs            : None
matplotlib       : 3.5.3
numba            : 0.56.4
numexpr          : 2.8.3
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.1
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.0
snappy           : None
sqlalchemy       : 1.4.46
tables           : 3.7.0
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
zstandard        : 0.19.0
tzdata           : None

The text was updated successfully, but these errors were encountered:

phofl · 2023-01-12T22:33:34Z

This works as expected for me on main, can reproduce on 1.5.2. Could you double check on our nightly builds?

natmokval · 2023-01-13T21:11:13Z

take

MarcoGorelli · 2023-01-29T22:50:59Z

From https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=117662725 I'm seeing #49551 as the commit that fixed this, cc @rhshadrach

doesn't really seem plausible though?

EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called

MarcoGorelli · 2023-03-31T10:06:00Z

Looks like this has crept back in

Am doing a bisect to see what brought it back, maybe that'll shed some light into what's going on

MarcoGorelli · 2023-03-31T10:21:15Z

#51335 brought it back:

git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40
. compile-c-extensions.sh 
python myt.py  # hangs indefinitely
git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40~1
. compile-c-extensions.sh 
python myt.py  # runs nearly instantly

myt.py contains:

import pandas as pd
import numpy as np
from pandas import *


class MyFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        for col in self.columns:
            if self.dtypes[col] == "O":
                self[col] = pd.to_numeric(self[col], errors='ignore')
    @property
    def _constructor(self):
        return type(self)

def get_frame(N):
    return MyFrame(
        data=np.vstack(
            [np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
        ).T,
        columns=[f"col{i}" for i in range(10)]
    )

def long_running_function(n):
    get_frame(n).dropna()


long_running_function(1000000)

MarcoGorelli · 2023-03-31T11:04:35Z

Investigations:

dropna goes here:

pandas/pandas/core/frame.py

Line 6398 in b836a88

elif how == "all":

which then goes to

pandas/pandas/core/frame.py

Line 10482 in b836a88

df = df.T

The transpose calls the constructor, which here is overwritten and involves a Python for-loop which goes through millions of elements. So, that's why it's become a lot slower

Before:

numeric_only and axis=1 would end up calling df.T

Now:

axis=1 goes to df.T regardless of numeric_only

@rhshadrach reckon this is a cause for concern or that anything should be done here? Not sure pandas can support arbitrary subclasses anyway, might be OK to just close as out-of-scope?

rhshadrach · 2023-03-31T19:58:30Z

#52250 would fix again.

MarcoGorelli · 2023-03-31T20:20:20Z

ok thanks

if we do want to make a PR for the sake of this, then I'd really suggest getting a test like the one in #50751 merged, else this'll happen again

MarcoGorelli · 2023-04-11T15:07:44Z

#52250 would fix again.

Looks like that one's been closed

Is there still scope to fix this one? Or should we just close as not-supported?

rhshadrach · 2023-04-12T00:45:21Z

@MarcoGorelli - I plan to look into this.

msm416 · 2025-02-10T13:51:52Z

RE: #50708 (comment)

I'm seeing #49551 as the commit that fixed this
doesn't really seem plausible though?
EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called

I can confirm that applying #49551 on top of the _reduce in pandas 1.5.3 fixes the performance problem. I didn't need any of the other files to be changed for this particular version.

For anyone else stumbling upon this and not wanting to upgrade pandas to unaffected version: apply the MR change line by line on top of their installed source file's _reduce method (pandas/core/frame.py) or alternatively patch at runtime:

from pandas.core.frame import DataFrame
def _fixed_reduce(self,
    # copy-paste method signature
    ):
# copy-paste method definition and additionally import any internal objects as needed
# apply the MR changes on top of the definition of  _reduce in your pandas runtime - you'll need to look it up from `pandas/core/frame.py` and potentially decide how to resolve conflicts
DataFrame._reduce = _fixed_reduce

Note: to be safe, you might want to patch the remaining files/methods in #49551

ryandvmartin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023

phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions Subclassing Subclassing pandas objects and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023

github-actions bot assigned natmokval Jan 13, 2023

natmokval mentioned this issue Jan 14, 2023

TEST: dropna() doesn't fall into infinite loop #50751

Closed

3 tasks

ryandvmartin mentioned this issue Mar 31, 2023

BUG: Performance regression in .dropna on DataFrame subclasses #50780

Closed

3 tasks

rhshadrach removed good first issue Needs Tests Unit test(s) needed to prevent regressions labels Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: accessing `.dtypes` in a subclass constructor with large frames causes infinite loop #50708

BUG: accessing `.dtypes` in a subclass constructor with large frames causes infinite loop #50708

ryandvmartin commented Jan 12, 2023 •

edited

Loading

phofl commented Jan 12, 2023

natmokval commented Jan 13, 2023

MarcoGorelli commented Jan 29, 2023 •

edited

Loading

MarcoGorelli commented Mar 31, 2023

MarcoGorelli commented Mar 31, 2023 •

edited

Loading

MarcoGorelli commented Mar 31, 2023 •

edited

Loading

rhshadrach commented Mar 31, 2023

MarcoGorelli commented Mar 31, 2023

MarcoGorelli commented Apr 11, 2023

rhshadrach commented Apr 12, 2023

msm416 commented Feb 10, 2025 •

edited

Loading

BUG: accessing .dtypes in a subclass constructor with large frames causes infinite loop #50708

BUG: accessing .dtypes in a subclass constructor with large frames causes infinite loop #50708

Comments

ryandvmartin commented Jan 12, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

phofl commented Jan 12, 2023

natmokval commented Jan 13, 2023

MarcoGorelli commented Jan 29, 2023 • edited Loading

MarcoGorelli commented Mar 31, 2023

MarcoGorelli commented Mar 31, 2023 • edited Loading

MarcoGorelli commented Mar 31, 2023 • edited Loading

rhshadrach commented Mar 31, 2023

MarcoGorelli commented Mar 31, 2023

MarcoGorelli commented Apr 11, 2023

rhshadrach commented Apr 12, 2023

msm416 commented Feb 10, 2025 • edited Loading

BUG: accessing `.dtypes` in a subclass constructor with large frames causes infinite loop #50708

BUG: accessing `.dtypes` in a subclass constructor with large frames causes infinite loop #50708

ryandvmartin commented Jan 12, 2023 •

edited

Loading

MarcoGorelli commented Jan 29, 2023 •

edited

Loading

MarcoGorelli commented Mar 31, 2023 •

edited

Loading

MarcoGorelli commented Mar 31, 2023 •

edited

Loading

msm416 commented Feb 10, 2025 •

edited

Loading