Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

Closed
3 tasks done
leaver2000 opened this issue Mar 16, 2023 · 20 comments · Fixed by #54341
Closed
3 tasks done

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

leaver2000 opened this issue Mar 16, 2023 · 20 comments · Fixed by #54341
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@leaver2000
Copy link

leaver2000 commented Mar 16, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms
16.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.86 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.1 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.73 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Edit: additional context and unexpected behavior

import pandas as pd
import numpy as np
import pyarrow as pa

# Columns WIND_SPEED and WIND_GUST might be any number between 0 and 100, while the QC codes are 0-3
df = pd.DataFrame(
    {
        "WIND_SPEED": np.random.randint(0, 100, size=(100,)),
        "WIND_SPEED_QC": np.random.randint(0, 3, size=(100,)),
        "WIND_GUST": np.random.randint(0, 100, size=(100,)),
        "WIND_GUST_QC": np.random.randint(0, 3, size=(100,)),
    }
# I've been looking into the pyarrow dtypes and encountered a unexpected behavior
).astype(pd.ArrowDtype(pa.uint8()))
# the equality comparison returns a pd.BooleanDtype rather than pd.ArrowDtype
mask = df[["WIND_SPEED_QC", "WIND_GUST_QC"]].__ge__(1)
# this is not expected
assert all(isinstance(x, pd.BooleanDtype) for x in mask.dtypes)
# pd.BooleanDtype
%timeit mask.any(axis=1) # 5.35 ms
# pd.ArrowDtype
%timeit mask.astype(pd.ArrowDtype(pa.bool_())).any(axis=1) # 7.24 ms 
# np.bool_
%timeit mask.astype(bool).any(axis=1) # 197 µs

The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change
it would make sense to use the np.bool_ dtype.

5.35 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.24 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
197 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS

commit : 1a2e300
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Prior Performance

No response

@leaver2000 leaver2000 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 16, 2023
@molsonkiko
Copy link

molsonkiko commented Mar 16, 2023

Got the same result (essentially) on pandas 1.5.3.

20.3 ms ± 920 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.01 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Details

Installed versions

commit : 2e218d1
python : 3.11.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.5.3
numpy : 1.25.0.dev0+896.g071388f95
pytz : 2021.3
dateutil : 2.8.2
setuptools : 59.2.0
pip : 23.0.1
Cython : 0.29.33
pytest : 6.2.5
hypothesis : 6.24.1
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.9.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : 2.0.2
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2

@steliospetrakis02
Copy link
Contributor

take

@GrammatikakisDimitris
Copy link
Contributor

take

@rhshadrach
Copy link
Member

cc @jbrockmendel; a viable PR to fix this is #51955

@rhshadrach rhshadrach added NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2023
@jbrockmendel
Copy link
Member

if you want to revive it OK. Like ive said MaskedArray reductions with axis=1 are just always going to be slow.

@leaver2000
Copy link
Author

leaver2000 commented Mar 18, 2023

if you want to revive it OK. Like ive said MaskedArray reductions with axis=1 are just always going to be slow.

What about a DtypeWarning or PerformanceWarning to notify the user of the slow op.

@JonathanGrant
Copy link

To fix the performance issue, we can optimize the implementation of row-wise operations with pd.BooleanDtype by utilizing a more efficient method for performing these operations. Here is a proposed change to the pandas/core/arrays/boolean.py file to improve the performance of row-wise operations with pd.BooleanDtype:

# pandas/core/arrays/boolean.py

# ...
# Add this import at the top of the file
import numexpr as ne

# ...

class BooleanArray(BaseMaskedArray):
    # ...

    # Add this optimized implementation of the `any` function
    def any(self, axis=None, out=None, keepdims=False, skipna=True):
        if axis == 1:
            # Use the efficient numexpr library for row-wise operations
            values = self._data.astype(bool)
            mask = self._mask.astype(bool)
            return pd.Series(ne.evaluate("any(values, ~mask, axis=1)"))
        else:
            return super().any(axis=axis, out=out, keepdims=keepdims, skipna=skipna)

    # ...

This change uses the numexpr library to perform row-wise operations more efficiently.

@rhshadrach
Copy link
Member

rhshadrach commented Mar 19, 2023

I'd like to get others thoughts on whether we should have #51955 go into 2.0. cc @mroeschke @phofl @jorisvandenbossche.

Summary: The perf regression was first noticed in #51335 (comment). I was in favor of partially reverting that PR if it got into the RC, but #51955 didn't make it. With this, I was originally fine with letting the perf regression stand and working on fixing for 2.1, but now I'm second guessing that due to its severity.

@phofl
Copy link
Member

phofl commented Mar 19, 2023

I'd be ok with the partial revert, but @jbrockmendel is correct that these operations won't ever be really performant

@leaver2000
Copy link
Author

Maybe a decorator for any & all

from typing import Callable, ParamSpec, Concatenate, TypeAlias
import functools

P = ParamSpec("P")
BooleanReductionMethod: TypeAlias = "Callable[Concatenate[pd.DataFrame, P], pd.Series[bool]]"

def convert_axis1_bool_dtypes(method: BooleanReductionMethod) -> BooleanReductionMethod:
    @functools.wraps(method)
    def inner(self: pd.DataFrame, *args: P.args, **kwargs: P.kwargs) -> pd.Series[bool]:
        return method(self.astype(bool) if kwargs.get("axis") in (1, "columns") else self, *args, **kwargs)
    return inner

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 20, 2023

I think short term the partial revert (#51955) might be the best option, but personally I think the more complete solution (also preserving the correct dtypes) like #51923 is doable as well.

When you have DataFrames with all 1D-columnar storage and doing row operations, that is indeed always going to be slower compared to the same operation on a 2D block, no question. But currently we are also leaving a lot on the table and I think there are easy ways to improve the performance.
For example, for the specific case here (pd_mask.any(axis=1)), a majority of the time is spent in transposing the dataframe, of which most of the time is spent in our generic EA._from_sequence constructor for the masked arrays. While we perfectly know that in this case we don't have a generic sequence. While for axis=1 operations, we might actually be able to avoid the transpose call at all (@rhshadrach's PRs), we can also quite easily improve the performance of transpose as well by avoiding to call _from_sequence, see some exploration in #52083.

@Alexia-I
Copy link

Alexia-I commented Jan 4, 2024

Sorry for the inconvenience. However, I've noticed that the issue still persists in version 2.1.4. I had thought that the problem had been resolved by #54341. Could you tell me why it still exists? Thanks.

@rhshadrach
Copy link
Member

I am not seeing this on 2.1.4. Can you post your timings from the code in the OP?

@Alexia-I
Copy link

@rhshadrach Sorry for the delayed response; I wasn't subscribed to this issue and missed the notification.
I've tested the case you mentioned and found no issues.

df = pd.DataFrame(np.random.randn(10000, 4), dtype="Float64")
df = df.astype(pd.BooleanDtype())
%timeit df.any(axis=1)

1.34 ms ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

However, in the original case described, there seems to be a delay in executing %timeit pd_mask.any(axis=1). I'm not sure what's causing this. Could someone help clarify this behavior? Thanks!

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms

12.5 ms ± 968 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.27 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8 s ± 677 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.72 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, I initially encountered this performance issue in pandas version 2.0.3. So I'm curious if there are plans to backport the fix to versions prior to 2.1.x, as many users might still be using 2.0.x or earlier versions. Thanks for your assistance!

@rhshadrach
Copy link
Member

Also, I initially encountered this performance issue in pandas version 2.0.3.

Do you experience the performance issue on 2.1.0 and later?

So I'm curious if there are plans to backport the fix to versions prior to 2.1.x, as many users might still be using 2.0.x or earlier versions.

No - we only support the most recent version. Once 2.1.0 is released, 2.0.x does not see any new releases (baring very exceptional circumstances).

@Alexia-I
Copy link

Alexia-I commented Jan 13, 2024

@rhshadrach Thank you for your response.

Do you experience the performance issue on 2.1.0 and later?

No, but I am still curious why it seems to exist in the test case.

No - we only support the most recent version. Once 2.1.0 is released, 2.0.x does not see any new releases (baring very exceptional circumstances).

I understand that you only support the most recent version except under very exceptional circumstances. With this in mind, I'm wondering if a significant performance issue were to be identified, would there be a consideration for backporting a fix to an older version?
If yes, could you tell me on how the significance of a performance issue is determined for such considerations? For instance, are there specific criteria like a considerable time lag, say 1000X or more, that guide this decision?

@rhshadrach
Copy link
Member

No, but I am still curious why it seems to exist in the test case.

The version you're encountering this on is 2.0.x, and it's only fixed in 2.1.0. Am I misunderstanding?

I'm wondering if a significant performance issue were to be identified, would there be a consideration for backporting a fix to an older version?

No, there will not be any consideration in backporting for the performance regression in axis=1 reductions here.

@Alexia-I
Copy link

@rhshadrach Sorry for the late response.......

The version you're encountering this on is 2.0.x, and it's only fixed in 2.1.0. Am I misunderstanding?

My work does not encounter this in 2.1.4. But the test case below has delay when performing reduction operations in axis = 1.

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms
13.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.66 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.68 s ± 2.33 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.74 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and I checked the version is 2.1.4.

pd.__version__
'2.1.4'

@rhshadrach
Copy link
Member

rhshadrach commented Jan 15, 2024

I cannot reproduce; your times are about 10x slower relatively for the 3rd line. Here are my timings on 2.1.4:

8.76 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.73 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
344 ms ± 8.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.42 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can you open up a new issue and include the details of pd.show_versions()

@Alexia-I
Copy link

Sure, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants