-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: accessing .dtypes
in a subclass constructor with large frames causes infinite loop
#50708
Comments
This works as expected for me on main, can reproduce on 1.5.2. Could you double check on our nightly builds? |
take |
From https://www.kaggle.com/code/marcogorelli/pandas-regression-example?scriptVersionId=117662725 I'm seeing #49551 as the commit that fixed this, cc @rhshadrach doesn't really seem plausible though? EDIT: it totally is plausible, because the numeric_only determined whether a transpose was called |
Looks like this has crept back in Am doing a bisect to see what brought it back, maybe that'll shed some light into what's going on |
#51335 brought it back: git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40
. compile-c-extensions.sh
python myt.py # hangs indefinitely
git checkout b836a88f81c575e86a67b47208b1b5a1067b6b40~1
. compile-c-extensions.sh
python myt.py # runs nearly instantly
import pandas as pd
import numpy as np
from pandas import *
class MyFrame(pd.DataFrame):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
for col in self.columns:
if self.dtypes[col] == "O":
self[col] = pd.to_numeric(self[col], errors='ignore')
@property
def _constructor(self):
return type(self)
def get_frame(N):
return MyFrame(
data=np.vstack(
[np.where(np.random.rand(N) > 0.36, np.random.rand(N), np.nan) for _ in range(10)]
).T,
columns=[f"col{i}" for i in range(10)]
)
def long_running_function(n):
get_frame(n).dropna()
long_running_function(1000000) |
Investigations:
Line 6398 in b836a88
which then goes to Line 10482 in b836a88
The transpose calls the constructor, which here is overwritten and involves a Python for-loop which goes through millions of elements. So, that's why it's become a lot slower Before:
Now:
@rhshadrach reckon this is a cause for concern or that anything should be done here? Not sure pandas can support arbitrary subclasses anyway, might be OK to just close as out-of-scope? |
#52250 would fix again. |
ok thanks if we do want to make a PR for the sake of this, then I'd really suggest getting a test like the one in #50751 merged, else this'll happen again |
Looks like that one's been closed Is there still scope to fix this one? Or should we just close as not-supported? |
@MarcoGorelli - I plan to look into this. |
RE: #50708 (comment)
I can confirm that applying #49551 on top of the For anyone else stumbling upon this and not wanting to upgrade pandas to unaffected version: apply the MR change line by line on top of their installed source file's from pandas.core.frame import DataFrame
def _fixed_reduce(self,
# copy-paste method signature
):
# copy-paste method definition and additionally import any internal objects as needed
# apply the MR changes on top of the definition of _reduce in your pandas runtime - you'll need to look it up from `pandas/core/frame.py` and potentially decide how to resolve conflicts
DataFrame._reduce = _fixed_reduce Note: to be safe, you might want to patch the remaining files/methods in #49551 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Modifying the class
__init__
to (removeself.dtypes[col]
):Issue Description
I think there has been a regression with access to
.dtypes
property in inheritedDataFrame
constructors, as noted in the MRE.We noticed this on pandas 1.5.2 when upgrading our production environment , but reproduced with pandas 1.4.4, 1.4.0. The code works as expected going back to 1.3.5.
As far as what should be done, perhaps more notes about what can/can't/should not be called/done in subclass
__init__
routines when inheriting frompd.DataFrame
?Expected Behavior
No infinite loop?
Installed Versions
The text was updated successfully, but these errors were encountered: