-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: make_block #56815
Comments
Thinking about this again, I wonder if we need to explicitly deprecate this function, or if this would be auto-deprecated when we deprecate the entirety of core for 3.0. |
I have been working on a branch for pyarrow testing an alternative I ran some benchmarks using that implementation, for varying sizes (1 million and 10 million elements) and number of columns (10, 100, 1000). I am using mixed dtypes so the result will have multiple blocks, where the creation of the block array involves a copy (so it's not benchmarking a zero-copy conversion where the python overhead would relatively be bigger). At the same time I am just using int and float, so data types where this conversion is relatively cheap (compared to converting strings to object dtype). Summary figure ("blocks" is the current implementation, "concat" the new, for varying data sizes. the dark part of each bar is the time of the actual conversion of Arrow memory to numpy arrays, the lighter transparent part of each bar is the additional python overhead to create the final dataframe): For the two left-most cases (1 million elements, 10 or 100 columns) the slowdown is around 2x and 1.5x, respectively. Of course, this is for small numbers (around 1 to 2 ms). For the larger dataframe (10 million elements), the relative slowdown is much smaller. That is for the default behaviour of Obviously there are alternative ways to create the DataFrame. Just using |
Above is just a description of the testing and benchmarking I did. Now for how to move forward: For the default consolidated DataFrame creation, two obvious options:
For the non-consolidated (
For this case also I think my personal preference would be to provide those two helper functions for both cases. They are easy to implement on our side (and should be relatively easy to maintain), provide the fastest option (even though the performance benefit is only minor in the consolidated case, it still cheaply avoids some overhead), and also give convenience to the few users who need this (no need to reimplement this subframes+concat+reindex logic, although not that complex, in multiple places). |
What this could look like is something like this: from pandas.core.dtypes.dtypes import ExtensionDtype,
from pandas.core.internals.blocks import (
ensure_block_shape,
get_block_type,
maybe_coerce_values,
)
from pandas.core.internals.managers import BlockManager
def _make_block(values: ExtensionArray | np.ndarray, placement: np.ndarray) -> Block:
"""
This is an analogue to blocks.new_block(_2d) that ensures:
1) correct dimension for EAs that support 2D (`ensure_block_shape`), and
2) correct EA class for datetime64/timedelta64 (`maybe_coerce_values`).
The input `values` is assumed to be either numpy array or ExtensionArray:
- In case of a numpy array, it is assumed to already be in the expected
shape for Blocks (2D, (cols, rows)).
- In case of an ExtensionArray the input can be 1D, also for EAs that are
internally stored as 2D.
For the rest no preprocessing or validation is done, except for those dtypes
that are internally stored as EAs but have an exact numpy equivalent (and at
the moment use that numpy dtype), i.e. datetime64/timedelta64.
"""
dtype = values.dtype
klass = get_block_type(dtype)
placement = BlockPlacement(placement)
if isinstance(dtype, ExtensionDtype) and dtype._supports_2d:
values = ensure_block_shape(values, ndim=2)
values = maybe_coerce_values(values)
return klass(values, ndim=2, placement=placement)
def dataframe_from_blocks(blocks, index, columns):
blocks = [_make_block(*block) for block in blocks]
axes = [columns, index]
mgr = BlockManager(blocks, axes)
return DataFrame._from_mgr(mgr, mgr.axes) I think from my analysis above it is clear that performance is not a very strong argument for needing this (given it is mostly a small and fixed overhead, that becomes insignificant for very large data; unless you need to construct a lot of small dataframes). But given that it is very easy to do on our side, I would say, why not just expose something like the above and provide a constructor with the most minimal overhead (without exposing any actual internal objects), and make the life for the few downstream projects that could use this easier. |
make_block
was deprecated and then reverted before the 2.2 release: #56481On the dev call today (2024-01-10), it was agreed that a deprecation would be issued during 3.0
The text was updated successfully, but these errors were encountered: