PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

leaver2000 · 2023-03-16T15:01:42Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms

16.3 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.86 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
14.1 s ± 329 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
6.73 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Edit: additional context and unexpected behavior

import pandas as pd
import numpy as np
import pyarrow as pa

# Columns WIND_SPEED and WIND_GUST might be any number between 0 and 100, while the QC codes are 0-3
df = pd.DataFrame(
    {
        "WIND_SPEED": np.random.randint(0, 100, size=(100,)),
        "WIND_SPEED_QC": np.random.randint(0, 3, size=(100,)),
        "WIND_GUST": np.random.randint(0, 100, size=(100,)),
        "WIND_GUST_QC": np.random.randint(0, 3, size=(100,)),
    }
# I've been looking into the pyarrow dtypes and encountered a unexpected behavior
).astype(pd.ArrowDtype(pa.uint8()))
# the equality comparison returns a pd.BooleanDtype rather than pd.ArrowDtype
mask = df[["WIND_SPEED_QC", "WIND_GUST_QC"]].__ge__(1)
# this is not expected
assert all(isinstance(x, pd.BooleanDtype) for x in mask.dtypes)
# pd.BooleanDtype
%timeit mask.any(axis=1) # 5.35 ms
# pd.ArrowDtype
%timeit mask.astype(pd.ArrowDtype(pa.bool_())).any(axis=1) # 7.24 ms 
# np.bool_
%timeit mask.astype(bool).any(axis=1) # 197 µs

The pd.BooleanDtype is faster than the pd.ArrowDtype which makes sense, but if the backend is going to change
it would make sense to use the np.bool_ dtype.

5.35 ms ± 377 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.24 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
197 µs ± 7.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Installed Versions

INSTALLED VERSIONS

commit : 1a2e300
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.90.1-microsoft-standard-WSL2
Version : #1 SMP Fri Jan 27 02:56:13 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.0rc0
numpy : 1.24.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.11.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.3.0
scipy : None
snappy : None
sqlalchemy : 2.0.4
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : None
qtpy : None
pyqt5 : None

Prior Performance

No response

The text was updated successfully, but these errors were encountered:

molsonkiko · 2023-03-16T21:50:52Z

Got the same result (essentially) on pandas 1.5.3.

20.3 ms ± 920 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13.9 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.01 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Details

Installed versions

commit : 2e218d1
python : 3.11.1.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 126 Stepping 5, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.5.3
numpy : 1.25.0.dev0+896.g071388f95
pytz : 2021.3
dateutil : 2.8.2
setuptools : 59.2.0
pip : 23.0.1
Cython : 0.29.33
pytest : 6.2.5
hypothesis : 6.24.1
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.1.2
IPython : 8.9.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : 2.0.2
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2

steliospetrakis02 · 2023-03-17T10:24:53Z

take

GrammatikakisDimitris · 2023-03-17T10:29:29Z

take

rhshadrach · 2023-03-18T13:11:24Z

cc @jbrockmendel; a viable PR to fix this is #51955

jbrockmendel · 2023-03-18T15:32:29Z

if you want to revive it OK. Like ive said MaskedArray reductions with axis=1 are just always going to be slow.

leaver2000 · 2023-03-18T16:54:59Z

if you want to revive it OK. Like ive said MaskedArray reductions with axis=1 are just always going to be slow.

What about a DtypeWarning or PerformanceWarning to notify the user of the slow op.

JonathanGrant · 2023-03-19T13:26:26Z

To fix the performance issue, we can optimize the implementation of row-wise operations with pd.BooleanDtype by utilizing a more efficient method for performing these operations. Here is a proposed change to the pandas/core/arrays/boolean.py file to improve the performance of row-wise operations with pd.BooleanDtype:

# pandas/core/arrays/boolean.py

# ...
# Add this import at the top of the file
import numexpr as ne

# ...

class BooleanArray(BaseMaskedArray):
    # ...

    # Add this optimized implementation of the `any` function
    def any(self, axis=None, out=None, keepdims=False, skipna=True):
        if axis == 1:
            # Use the efficient numexpr library for row-wise operations
            values = self._data.astype(bool)
            mask = self._mask.astype(bool)
            return pd.Series(ne.evaluate("any(values, ~mask, axis=1)"))
        else:
            return super().any(axis=axis, out=out, keepdims=keepdims, skipna=skipna)

    # ...

This change uses the numexpr library to perform row-wise operations more efficiently.

rhshadrach · 2023-03-19T14:10:15Z

I'd like to get others thoughts on whether we should have #51955 go into 2.0. cc @mroeschke @phofl @jorisvandenbossche.

Summary: The perf regression was first noticed in #51335 (comment). I was in favor of partially reverting that PR if it got into the RC, but #51955 didn't make it. With this, I was originally fine with letting the perf regression stand and working on fixing for 2.1, but now I'm second guessing that due to its severity.

phofl · 2023-03-19T18:03:35Z

I'd be ok with the partial revert, but @jbrockmendel is correct that these operations won't ever be really performant

leaver2000 · 2023-03-19T18:32:59Z

Maybe a decorator for any & all

from typing import Callable, ParamSpec, Concatenate, TypeAlias
import functools

P = ParamSpec("P")
BooleanReductionMethod: TypeAlias = "Callable[Concatenate[pd.DataFrame, P], pd.Series[bool]]"

def convert_axis1_bool_dtypes(method: BooleanReductionMethod) -> BooleanReductionMethod:
    @functools.wraps(method)
    def inner(self: pd.DataFrame, *args: P.args, **kwargs: P.kwargs) -> pd.Series[bool]:
        return method(self.astype(bool) if kwargs.get("axis") in (1, "columns") else self, *args, **kwargs)
    return inner

jorisvandenbossche · 2023-03-20T12:40:05Z

I think short term the partial revert (#51955) might be the best option, but personally I think the more complete solution (also preserving the correct dtypes) like #51923 is doable as well.

When you have DataFrames with all 1D-columnar storage and doing row operations, that is indeed always going to be slower compared to the same operation on a 2D block, no question. But currently we are also leaving a lot on the table and I think there are easy ways to improve the performance.
For example, for the specific case here (pd_mask.any(axis=1)), a majority of the time is spent in transposing the dataframe, of which most of the time is spent in our generic EA._from_sequence constructor for the masked arrays. While we perfectly know that in this case we don't have a generic sequence. While for axis=1 operations, we might actually be able to avoid the transpose call at all (@rhshadrach's PRs), we can also quite easily improve the performance of transpose as well by avoiding to call _from_sequence, see some exploration in #52083.

Alexia-I · 2024-01-04T11:21:42Z

Sorry for the inconvenience. However, I've noticed that the issue still persists in version 2.1.4. I had thought that the problem had been resolved by #54341. Could you tell me why it still exists? Thanks.

rhshadrach · 2024-01-04T22:07:03Z

I am not seeing this on 2.1.4. Can you post your timings from the code in the OP?

Alexia-I · 2024-01-11T02:29:01Z

@rhshadrach Sorry for the delayed response; I wasn't subscribed to this issue and missed the notification.
I've tested the case you mentioned and found no issues.

df = pd.DataFrame(np.random.randn(10000, 4), dtype="Float64")
df = df.astype(pd.BooleanDtype())
%timeit df.any(axis=1)

1.34 ms ± 4.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

However, in the original case described, there seems to be a delay in executing %timeit pd_mask.any(axis=1). I'm not sure what's causing this. Could someone help clarify this behavior? Thanks!

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms

12.5 ms ± 968 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.27 ms ± 134 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8 s ± 677 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.72 ms ± 22.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Also, I initially encountered this performance issue in pandas version 2.0.3. So I'm curious if there are plans to backport the fix to versions prior to 2.1.x, as many users might still be using 2.0.x or earlier versions. Thanks for your assistance!

rhshadrach · 2024-01-12T21:52:06Z

Also, I initially encountered this performance issue in pandas version 2.0.3.

Do you experience the performance issue on 2.1.0 and later?

So I'm curious if there are plans to backport the fix to versions prior to 2.1.x, as many users might still be using 2.0.x or earlier versions.

No - we only support the most recent version. Once 2.1.0 is released, 2.0.x does not see any new releases (baring very exceptional circumstances).

Alexia-I · 2024-01-13T09:18:29Z

@rhshadrach Thank you for your response.

Do you experience the performance issue on 2.1.0 and later?

No, but I am still curious why it seems to exist in the test case.

No - we only support the most recent version. Once 2.1.0 is released, 2.0.x does not see any new releases (baring very exceptional circumstances).

I understand that you only support the most recent version except under very exceptional circumstances. With this in mind, I'm wondering if a significant performance issue were to be identified, would there be a consideration for backporting a fix to an older version?
If yes, could you tell me on how the significance of a performance issue is determined for such considerations? For instance, are there specific criteria like a considerable time lag, say 1000X or more, that guide this decision?

rhshadrach · 2024-01-13T11:51:29Z

No, but I am still curious why it seems to exist in the test case.

The version you're encountering this on is 2.0.x, and it's only fixed in 2.1.0. Am I misunderstanding?

I'm wondering if a significant performance issue were to be identified, would there be a consideration for backporting a fix to an older version?

No, there will not be any consideration in backporting for the performance regression in axis=1 reductions here.

Alexia-I · 2024-01-15T07:38:46Z

@rhshadrach Sorry for the late response.......

The version you're encountering this on is 2.0.x, and it's only fixed in 2.1.0. Am I misunderstanding?

My work does not encounter this in 2.1.4. But the test case below has delay when performing reduction operations in axis = 1.

import pandas as pd
import numpy as np

shape = 250_000, 100
mask = pd.DataFrame(np.random.randint(0, 1, size=shape))


np_mask = mask.astype(bool)
pd_mask = mask.astype(pd.BooleanDtype())

assert all(isinstance(dtype, pd.BooleanDtype) for dtype in pd_mask.dtypes)
assert all(isinstance(dtype, np.dtype) for dtype in np_mask.dtypes)
# column operations are not that much slower
%timeit pd_mask.any(axis=0) # 16.3 ms
%timeit np_mask.any(axis=0) # 5.86 ms
# using pandas.BooleanDtype back end for ROW operations is MUCH SLOWER
%timeit pd_mask.any(axis=1) # 14.1 s 
%timeit np_mask.any(axis=1) # 6.73 ms

13.7 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.66 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.68 s ± 2.33 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
7.74 ms ± 62.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and I checked the version is 2.1.4.

pd.__version__
'2.1.4'

rhshadrach · 2024-01-15T21:05:30Z

I cannot reproduce; your times are about 10x slower relatively for the 3rd line. Here are my timings on 2.1.4:

8.76 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.73 ms ± 15 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
344 ms ± 8.44 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2.42 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Can you open up a new issue and include the details of pd.show_versions()

Alexia-I · 2024-01-16T01:40:40Z

Sure, will do.

leaver2000 added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Mar 16, 2023

github-actions bot assigned steliospetrakis02 Mar 17, 2023

github-actions bot assigned GrammatikakisDimitris Mar 17, 2023

rhshadrach added NA - MaskedArrays Related to pd.NA and nullable extension arrays Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 18, 2023

jorisvandenbossche mentioned this issue Mar 20, 2023

PERF: improve 2D array access / transpose() for masked dtypes #52083

Closed

steliospetrakis02 removed their assignment Apr 3, 2023

GrammatikakisDimitris removed their assignment Apr 6, 2023

This was referenced Aug 1, 2023

PERF: axis=1 reductions with EA dtypes #54341

Merged

PERF: DataFrame.all(axis="columns") orders of magnitude slower for bool[pyarrow] #54389

Closed

rhshadrach added this to the 2.1 milestone Aug 13, 2023

rhshadrach closed this as completed in #54341 Aug 13, 2023

Alexia-I mentioned this issue Jan 16, 2024

PERF: pd.BooleanDtype in row operations is still very slow #56903

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

leaver2000 commented Mar 16, 2023 •

edited

Loading

INSTALLED VERSIONS

molsonkiko commented Mar 16, 2023 •

edited

Loading

Installed versions

steliospetrakis02 commented Mar 17, 2023

GrammatikakisDimitris commented Mar 17, 2023

rhshadrach commented Mar 18, 2023

jbrockmendel commented Mar 18, 2023

leaver2000 commented Mar 18, 2023 •

edited

Loading

JonathanGrant commented Mar 19, 2023

rhshadrach commented Mar 19, 2023 •

edited

Loading

phofl commented Mar 19, 2023

leaver2000 commented Mar 19, 2023

jorisvandenbossche commented Mar 20, 2023 •

edited

Loading

Alexia-I commented Jan 4, 2024

rhshadrach commented Jan 4, 2024

Alexia-I commented Jan 11, 2024

rhshadrach commented Jan 12, 2024

Alexia-I commented Jan 13, 2024 •

edited

Loading

rhshadrach commented Jan 13, 2024

Alexia-I commented Jan 15, 2024

rhshadrach commented Jan 15, 2024 •

edited

Loading

Alexia-I commented Jan 16, 2024

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

PERF: pd.BooleanDtype in row operations is 2000000 times slower #52016

Comments

leaver2000 commented Mar 16, 2023 • edited Loading

Pandas version checks

Reproducible Example

Edit: additional context and unexpected behavior

Installed Versions

INSTALLED VERSIONS

Prior Performance

molsonkiko commented Mar 16, 2023 • edited Loading

Details

Installed versions

steliospetrakis02 commented Mar 17, 2023

GrammatikakisDimitris commented Mar 17, 2023

rhshadrach commented Mar 18, 2023

jbrockmendel commented Mar 18, 2023

leaver2000 commented Mar 18, 2023 • edited Loading

JonathanGrant commented Mar 19, 2023

rhshadrach commented Mar 19, 2023 • edited Loading

phofl commented Mar 19, 2023

leaver2000 commented Mar 19, 2023

jorisvandenbossche commented Mar 20, 2023 • edited Loading

Alexia-I commented Jan 4, 2024

rhshadrach commented Jan 4, 2024

Alexia-I commented Jan 11, 2024

rhshadrach commented Jan 12, 2024

Alexia-I commented Jan 13, 2024 • edited Loading

rhshadrach commented Jan 13, 2024

Alexia-I commented Jan 15, 2024

rhshadrach commented Jan 15, 2024 • edited Loading

Alexia-I commented Jan 16, 2024

leaver2000 commented Mar 16, 2023 •

edited

Loading

molsonkiko commented Mar 16, 2023 •

edited

Loading

leaver2000 commented Mar 18, 2023 •

edited

Loading

rhshadrach commented Mar 19, 2023 •

edited

Loading

jorisvandenbossche commented Mar 20, 2023 •

edited

Loading

Alexia-I commented Jan 13, 2024 •

edited

Loading

rhshadrach commented Jan 15, 2024 •

edited

Loading