Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: make_block #56815

Closed
mroeschke opened this issue Jan 10, 2024 · 4 comments · Fixed by #57754
Closed

DEPR: make_block #56815

mroeschke opened this issue Jan 10, 2024 · 4 comments · Fixed by #57754
Labels
Blocker for rc Blocking issue or pull request for release candidate Deprecate Functionality to remove in pandas Internals Related to non-user accessible pandas implementation
Milestone

Comments

@mroeschke
Copy link
Member

mroeschke commented Jan 10, 2024

make_block was deprecated and then reverted before the 2.2 release: #56481

On the dev call today (2024-01-10), it was agreed that a deprecation would be issued during 3.0

@mroeschke mroeschke added Internals Related to non-user accessible pandas implementation Deprecate Functionality to remove in pandas labels Jan 10, 2024
@mroeschke mroeschke added this to the 3.0 milestone Jan 10, 2024
@lithomas1
Copy link
Member

Thinking about this again, I wonder if we need to explicitly deprecate this function, or if this would be auto-deprecated when we deprecate the entirety of core for 3.0.

@lithomas1 lithomas1 added the Blocker Blocking issue or pull request for an upcoming release label Jan 11, 2024
@lithomas1 lithomas1 added Blocker for rc Blocking issue or pull request for release candidate and removed Blocker Blocking issue or pull request for an upcoming release labels Feb 10, 2024
@jbrockmendel jbrockmendel mentioned this issue Mar 7, 2024
5 tasks
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 29, 2024

I have been working on a branch for pyarrow testing an alternative to_pandas implementation. For the draft code see apache/arrow#40897 (currently it adds a use_blocks keyword to toggle the new implementation, that's of course not meant to stay, but just to make the different implementations easier to test alongside each other)

I ran some benchmarks using that implementation, for varying sizes (1 million and 10 million elements) and number of columns (10, 100, 1000). I am using mixed dtypes so the result will have multiple blocks, where the creation of the block array involves a copy (so it's not benchmarking a zero-copy conversion where the python overhead would relatively be bigger). At the same time I am just using int and float, so data types where this conversion is relatively cheap (compared to converting strings to object dtype).
Code: https://gist.github.com/jorisvandenbossche/858017871ad7850aaf2cae142d4d4ad0

Summary figure ("blocks" is the current implementation, "concat" the new, for varying data sizes. the dark part of each bar is the time of the actual conversion of Arrow memory to numpy arrays, the lighter transparent part of each bar is the additional python overhead to create the final dataframe):

image

For the two left-most cases (1 million elements, 10 or 100 columns) the slowdown is around 2x and 1.5x, respectively. Of course, this is for small numbers (around 1 to 2 ms). For the larger dataframe (10 million elements), the relative slowdown is much smaller.

That is for the default behaviour of pyarrow.Table.to_pandas to create a default consolidated pandas.DataFrame.
There is however a split_blocks option that can be enabled to avoid this consolidation and have potentially zero-copy conversions (although that still depends on the dtype and presence of nulls) and less memory increase.
Right now both split_blocks options use the same block+dataframe creation. If we also just use the new implementation for split_blocks we are creating a DataFrame (or Series might give a small improvement) for each column, and so that obviously gives quite some overhead:

image

Obviously there are alternative ways to create the DataFrame. Just using DataFrame(dict(zip(names, arrays)), copy=False) is already faster, but still gives some avoidable overhead.
We do have pd.DataFrame._from_arrays (as was also mentioned in the discussion in #56422), but the problem right now is that this is 1) also private API, and 2) consolidates (but it could easily not do that by just passing down a keyword).

@jorisvandenbossche
Copy link
Member

Above is just a description of the testing and benchmarking I did. Now for how to move forward:

For the default consolidated DataFrame creation, two obvious options:

  • We are fine with the reported overhead of the subframes+concat+reindex implementation, and this is the recommended way to create a DataFrame from a set of 2D arrays (and pyarrow can do this conditional on the pandas version, i.e. only when running with pandas 3.0)
  • We provide a helper function for advanced users that essentially does what pyarrow was doing before (create blocks from the arrays, and combine blocks in a manager). But by implementing it on our side, we can happily use our internals, and external projects that need this don't have to touch blocks or managers. For example, we could expose a pandas.api.internals.dataframe_from_blocks(..) that accepts a list of (array, placement) tuples plus index and columns object.

For the non-consolidated (split_blocks=True) case, I am also thinking of two options:

  • We expose _from_arrays publicly as DataFrame.from_arrays and ensure that it has a keyword to disable the consolidation (e.g. a copy=False). I don't know the details of the current _from_arrays implementation enough to judge how "ready" this is to expose publicly, though (like, how robust is it for various user input, PyArrow is fine with a method that requires nicely prepared arrays)
  • We expose a form of _from_arrays (again without the consolidation or an option to disable it) in a helper function for advanced users. For example (to mimic the above), we could expose a pandas.api.internals.dataframe_from_arrays(..) that accepts a list of 1D arrays plus index and columns object.

For this case also DataFrame(dict) is an option, although that has some overhead compared to _from_arrays (compared to the current situation, it's good enough, but _from_arrays gives the option to actually improve performance here).


I think my personal preference would be to provide those two helper functions for both cases. They are easy to implement on our side (and should be relatively easy to maintain), provide the fastest option (even though the performance benefit is only minor in the consolidated case, it still cheaply avoids some overhead), and also give convenience to the few users who need this (no need to reimplement this subframes+concat+reindex logic, although not that complex, in multiple places).

@jorisvandenbossche
Copy link
Member

For example, we could expose a pandas.api.internals.dataframe_from_blocks(..) that accepts a list of (array, placement) tuples plus index and columns object.

What this could look like is something like this:

from pandas.core.dtypes.dtypes import ExtensionDtype,
from pandas.core.internals.blocks import (
    ensure_block_shape,
    get_block_type,
    maybe_coerce_values,
)
from pandas.core.internals.managers import BlockManager


def _make_block(values: ExtensionArray | np.ndarray, placement: np.ndarray) -> Block:
    """
    This is an analogue to blocks.new_block(_2d) that ensures:
    1) correct dimension for EAs that support 2D (`ensure_block_shape`), and
    2) correct EA class for datetime64/timedelta64 (`maybe_coerce_values`).

    The input `values` is assumed to be either numpy array or ExtensionArray:
    - In case of a numpy array, it is assumed to already be in the expected
      shape for Blocks (2D, (cols, rows)).
    - In case of an ExtensionArray the input can be 1D, also for EAs that are
      internally stored as 2D.

    For the rest no preprocessing or validation is done, except for those dtypes
    that are internally stored as EAs but have an exact numpy equivalent (and at
    the moment use that numpy dtype), i.e. datetime64/timedelta64.
    """
    dtype = values.dtype
    klass = get_block_type(dtype)
    placement = BlockPlacement(placement)

    if isinstance(dtype, ExtensionDtype) and dtype._supports_2d:
        values = ensure_block_shape(values, ndim=2)

    values = maybe_coerce_values(values)
    return klass(values, ndim=2, placement=placement)


def dataframe_from_blocks(blocks, index, columns):
    blocks = [_make_block(*block) for block in blocks]
    axes = [columns, index]
    mgr = BlockManager(blocks, axes)
    return DataFrame._from_mgr(mgr, mgr.axes)

I think from my analysis above it is clear that performance is not a very strong argument for needing this (given it is mostly a small and fixed overhead, that becomes insignificant for very large data; unless you need to construct a lot of small dataframes). But given that it is very easy to do on our side, I would say, why not just expose something like the above and provide a constructor with the most minimal overhead (without exposing any actual internal objects), and make the life for the few downstream projects that could use this easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker for rc Blocking issue or pull request for release candidate Deprecate Functionality to remove in pandas Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants