-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016
Comments
Got the same result (essentially) on pandas 1.5.3.
DetailsInstalled versionscommit : 2e218d1 pandas : 1.5.3 |
take |
take |
cc @jbrockmendel; a viable PR to fix this is #51955 |
if you want to revive it OK. Like ive said MaskedArray reductions with axis=1 are just always going to be slow. |
What about a DtypeWarning or PerformanceWarning to notify the user of the slow op. |
To fix the performance issue, we can optimize the implementation of row-wise operations with pd.BooleanDtype by utilizing a more efficient method for performing these operations. Here is a proposed change to the pandas/core/arrays/boolean.py file to improve the performance of row-wise operations with pd.BooleanDtype: # pandas/core/arrays/boolean.py
# ...
# Add this import at the top of the file
import numexpr as ne
# ...
class BooleanArray(BaseMaskedArray):
# ...
# Add this optimized implementation of the `any` function
def any(self, axis=None, out=None, keepdims=False, skipna=True):
if axis == 1:
# Use the efficient numexpr library for row-wise operations
values = self._data.astype(bool)
mask = self._mask.astype(bool)
return pd.Series(ne.evaluate("any(values, ~mask, axis=1)"))
else:
return super().any(axis=axis, out=out, keepdims=keepdims, skipna=skipna)
# ... This change uses the numexpr library to perform row-wise operations more efficiently. |
I'd like to get others thoughts on whether we should have #51955 go into 2.0. cc @mroeschke @phofl @jorisvandenbossche. Summary: The perf regression was first noticed in #51335 (comment). I was in favor of partially reverting that PR if it got into the RC, but #51955 didn't make it. With this, I was originally fine with letting the perf regression stand and working on fixing for 2.1, but now I'm second guessing that due to its severity. |
I'd be ok with the partial revert, but @jbrockmendel is correct that these operations won't ever be really performant |
Maybe a decorator for from typing import Callable, ParamSpec, Concatenate, TypeAlias
import functools
P = ParamSpec("P")
BooleanReductionMethod: TypeAlias = "Callable[Concatenate[pd.DataFrame, P], pd.Series[bool]]"
def convert_axis1_bool_dtypes(method: BooleanReductionMethod) -> BooleanReductionMethod:
@functools.wraps(method)
def inner(self: pd.DataFrame, *args: P.args, **kwargs: P.kwargs) -> pd.Series[bool]:
return method(self.astype(bool) if kwargs.get("axis") in (1, "columns") else self, *args, **kwargs)
return inner |
I think short term the partial revert (#51955) might be the best option, but personally I think the more complete solution (also preserving the correct dtypes) like #51923 is doable as well. When you have DataFrames with all 1D-columnar storage and doing row operations, that is indeed always going to be slower compared to the same operation on a 2D block, no question. But currently we are also leaving a lot on the table and I think there are easy ways to improve the performance. |
Sorry for the inconvenience. However, I've noticed that the issue still persists in version 2.1.4. I had thought that the problem had been resolved by #54341. Could you tell me why it still exists? Thanks. |
I am not seeing this on 2.1.4. Can you post your timings from the code in the OP? |
@rhshadrach Sorry for the delayed response; I wasn't subscribed to this issue and missed the notification.
However, in the original case described, there seems to be a delay in executing %timeit pd_mask.any(axis=1). I'm not sure what's causing this. Could someone help clarify this behavior? Thanks!
Also, I initially encountered this performance issue in pandas version 2.0.3. So I'm curious if there are plans to backport the fix to versions prior to 2.1.x, as many users might still be using 2.0.x or earlier versions. Thanks for your assistance! |
Do you experience the performance issue on 2.1.0 and later?
No - we only support the most recent version. Once 2.1.0 is released, 2.0.x does not see any new releases (baring very exceptional circumstances). |
@rhshadrach Thank you for your response.
No, but I am still curious why it seems to exist in the test case.
I understand that you only support the most recent version except under very exceptional circumstances. With this in mind, I'm wondering if a significant performance issue were to be identified, would there be a consideration for backporting a fix to an older version? |
The version you're encountering this on is 2.0.x, and it's only fixed in 2.1.0. Am I misunderstanding?
No, there will not be any consideration in backporting for the performance regression in axis=1 reductions here. |
@rhshadrach Sorry for the late response.......
My work does not encounter this in 2.1.4. But the test case below has delay when performing reduction operations in axis = 1.
and I checked the version is 2.1.4.
|
I cannot reproduce; your times are about 10x slower relatively for the 3rd line. Here are my timings on 2.1.4:
Can you open up a new issue and include the details of |
Sure, will do. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Edit: additional context and unexpected behavior
The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change
it would make sense to use the np.bool_ dtype.
Installed Versions
INSTALLED VERSIONS
commit : 1a2e300
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.0rc0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None
Prior Performance
No response
The text was updated successfully, but these errors were encountered: