-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improve 2D array access / transpose() for masked dtypes #52083
PERF: improve 2D array access / transpose() for masked dtypes #52083
Conversation
if isinstance(dtype, BaseMaskedDtype): | ||
data, mask = self._mgr.as_array_masked() | ||
new_values = [arr_type(data[i], mask[i]) for i in range(self.shape[0])] | ||
else: | ||
values = self.values | ||
new_values = [ | ||
arr_type._from_sequence(row, dtype=dtype) for row in values | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is the main change (bringing the large part of the speed-up for transpose)
# if len(df._mgr) > 0: | ||
# common_dtype = find_common_type(list(df._mgr.get_dtypes())) | ||
# is_masked_ea = isinstance(common_dtype, BaseMaskedDtype) | ||
# is_np = isinstance(common_dtype, np.dtype) | ||
# else: | ||
# common_dtype = None | ||
|
||
# if axis == 1 and common_dtype and is_masked_ea: | ||
# data, mask = self._mgr.as_array_masked() | ||
# ea2d = common_dtype.construct_array_type()(data, mask) | ||
# result = ea2d._reduce(name, axis=axis, skipna=skipna, **kwds) | ||
# labels = self._get_agg_axis(axis) | ||
# result = self._constructor_sliced(result, index=labels, copy=False) | ||
# return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach this relates to #51923, and making use of the as_array_masked
added in this PR could be a potential different way of tackling it.
Manually running one of the benchmarks you mentioned in #51955 (the bottom row of - 8.94±0s 24.3±0.3ms 0.00 stat_ops.FrameOps.time_op('mean', 'Int64', 1)
), with the above uncommented, I get:
In [1]: values = np.random.randn(100000, 4)
...: df = pd.DataFrame(values.astype(int)).astype("Int64")
In [2]: %timeit df.mean(axis=1)
5.97 s ± 434 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- main
1.74 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # <-- PR
Now, this is probably a bit too optimistic, because the as_array_masked
is currently assuming all the same dtypes, but with the use of find_common_type
here, it would need to handle some extra cases as well. But in general it should also give a way to speed-up this specific case (while using the actual masked reduction implementations, which already support 2D with axis=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, this is probably a bit too optimistic, because the
as_array_masked
is currently assuming all the same dtypes, but with the use offind_common_type
here, it would need to handle some extra cases as well.
As long as we're not doing inference, the transpose is guaranteed to be a single dtype. This is because two different dtypes in the transpose could only arise from two different dtypes in the rows of the original. So we can determine the common dtype, cast to a single dtype, and then transpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a cast beforehand might indeed be the easiest, then as_array_masked
doesn't need to handle the case of multiple dtypes itself.
Is the concern about transpose specifically or more about the axis=1 reductions? If the latter, would a reduce_axis1 approach that avoids the transpose altogether be more robust? It'd be nice to avoid special-casing our internal EAs. Can any of this be extended eventually to arrow-backed cases? |
For those ops where there is a simple online algorithm (e.g. mean, sum, stddev, variance, sem, prod, kurtosis), I do think this would be a good option to look into. However there will be ops for which this won't work (median, quantile, rank, mode, nunique). |
Yes, I also mentioned that in #52016 (comment), but I think we can certainly try to avoid transpose altogether, but improving transpose itself as well might also be worth it? (edit: plus for methods where the transpose cannot easily be avoided, as Richard just commented)
Yes, in general the arrow dtypes will suffer the same problems for row based operations (as long as arrow doesn't add some specific functionality for that). I think it could be quite easy to expand the approach here to work for arrow dtypes as well (converting their bitmask to a bytemask) |
# # TODO(CoW) handle case where resulting array is a view | ||
# if len(self.blocks) == 0: | ||
# arr = np.empty(self.shape, dtype=float) | ||
# return arr.transpose() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche - I don't follow this todo - can you elaborate?
@@ -2544,6 +2548,7 @@ def _from_arrays( | |||
dtype=dtype, | |||
verify_integrity=verify_integrity, | |||
typ=manager, | |||
is_1d_ea_only=is_1d_ea_only, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrays_to_mgr already has a consolidate keyword. could we use that instead?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Closing to clear the queue, but feel free to reopen when you have time to revisit |
xref #52016
For
transpose()
or operations that do a transpose under the hood (eg typically operations with axis=1), this is generally expensive if you have extension dtypes. This explores the possibilities to add a fast path for the specific case of a DataFrame with all-masked dtypes.Using the example dataframe from #52016, this gives a speed-up
Note there are some various different optimizations added together (that can be splitted, and still needs to be cleaned-up in general).
The main speed-up comes from calling
BooleanArray(..)
instead ofBooleanArray._from_sequence(..)
(or in general with any other masked array class), because this generic_from_sequence
has a lot of overhead (that adds up when calling it 250_000 times as in this example). To be able to use the main__init__
, I added a version ofmgr.as_array()
to return separate data and mask arrays.Further, after the arrays are constructed, the BlockManager construction can be optimized by passing
verify_integrity=False
and by knowing that we have all 1D EAs (so no need to do block dtype grouping etc).Another tiny optimization is in the BooleanArray(..) to avoid creating a new dtype instance every time (this can easily be split of, we might want to do that generally for all primitive dtypes, and maybe rather in the init of the dtype)