[Python] Allow disabling more components #33126

asfimport · 2022-10-03T14:21:31Z

Some users would like to build lightweight versions of PyArrow, for example for use in AWS Lambda or similar systems which constrain the total size of usable libraries.

However, PyArrow currently mandates some Arrow C++ components which can lead to a very sizable Arrow binary install: Compute, CSV, Dataset, Filesystem, HDFS and JSON.

Reporter: Antoine Pitrou / @pitrou

Related issues:

[C++] Split non-cast compute kernels into a separate shared library (is related to)

_{Note: This issue was originally created as ARROW-17916. Please see the migration documentation for further details.}

asfimport · 2022-10-03T14:23:56Z

Antoine Pitrou / @pitrou:
@AlenkaF @jorisvandenbossche

asfimport · 2022-10-04T08:12:52Z

Joris Van den Bossche / @jorisvandenbossche:
Dataset can already be disabled (we have a "failure_permitted" for that in setup.py, and at least in the past we did have some nightly build that covered this). And I suppose HDFS is also already optional?

But it would indeed be good to make more of the others optional as well. Compute would probably give the biggest benefit, although also the most difficult one? In the cython code this is actually already handled using the _pc object (so that we can call compute functions in lib.pyx without importing the module directly). But the PyArrow C++ code also depends on Compute for casting (and we depend on that in the numpy/pandas <-> arrow conversion, which is currently a part that is not meant to be optional)

asfimport · 2022-10-04T08:18:12Z

Antoine Pitrou / @pitrou:
Hmm, this is how ARROW_PYTHON is currently defined in cpp/cmake_modules/DefineOptions.cmake:

  define_option(ARROW_PYTHON
                "Build some components needed by PyArrow.;\
(This is a deprecated option. Use CMake presets instead.)"
                OFF
                DEPENDS
                ARROW_COMPUTE
                ARROW_CSV
                ARROW_DATASET
                ARROW_FILESYSTEM
                ARROW_HDFS
                ARROW_JSON)

asfimport · 2022-10-04T08:22:23Z

Antoine Pitrou / @pitrou:
As for casts, that's a good point. There's an issue open on the C++ side for this: ARROW-8891

asfimport · 2022-10-04T08:23:30Z

Joris Van den Bossche / @jorisvandenbossche:
Ah, yes, for Dataset and HDFS I was thinking about the cython level, not for our C++. But checking pyarrow/src, I don't think we actually require dataset or hdfs in pyarrow C++.
I think the scope of ARROW_PYTHON was also partly convenience to build the "common" things, and not strictly the required parts.

asfimport · 2022-10-04T08:28:37Z

Antoine Pitrou / @pitrou:
Ah, ok, that's a good point about ARROW_PYTHON. Since it's deprecated we may just ignore it then.

As for the Compute dependency: perhaps we can factor out the casting code in PyArrow C++ (there's not much of it) and use compilation directives to simply return NotImplemented if Compute was not enabled.

asfimport added this to the 11.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++] Split non-cast compute kernels into a separate shared library #25025

Open

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

jorisvandenbossche mentioned this issue May 4, 2023

PDEP-10: Add pyarrow as a required dependency pandas-dev/pandas#52711

Merged

1 task

vincentsarago mentioned this issue May 10, 2023

Is latest 3.6 compiled with parquet / arrow enabled? lambgeo/docker-lambda#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Allow disabling more components #33126

[Python] Allow disabling more components #33126

asfimport commented Oct 3, 2022 •

edited

Loading

asfimport commented Oct 3, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022 •

edited by jorisvandenbossche

Loading

asfimport commented Oct 4, 2022

[Python] Allow disabling more components #33126

[Python] Allow disabling more components #33126

Comments

asfimport commented Oct 3, 2022 • edited Loading

Related issues:

asfimport commented Oct 3, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022

asfimport commented Oct 4, 2022 • edited by jorisvandenbossche Loading

asfimport commented Oct 4, 2022

asfimport commented Oct 3, 2022 •

edited

Loading

asfimport commented Oct 4, 2022 •

edited by jorisvandenbossche

Loading