Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Allow disabling more components #33126

Open
asfimport opened this issue Oct 3, 2022 · 6 comments
Open

[Python] Allow disabling more components #33126

asfimport opened this issue Oct 3, 2022 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Oct 3, 2022

Some users would like to build lightweight versions of PyArrow, for example for use in AWS Lambda or similar systems which constrain the total size of usable libraries.

However, PyArrow currently mandates some Arrow C++ components which can lead to a very sizable Arrow binary install: Compute, CSV, Dataset, Filesystem, HDFS and JSON.

Reporter: Antoine Pitrou / @pitrou

Related issues:

Note: This issue was originally created as ARROW-17916. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Dataset can already be disabled (we have a "failure_permitted" for that in setup.py, and at least in the past we did have some nightly build that covered this). And I suppose HDFS is also already optional?

But it would indeed be good to make more of the others optional as well. Compute would probably give the biggest benefit, although also the most difficult one? In the cython code this is actually already handled using the _pc object (so that we can call compute functions in lib.pyx without importing the module directly). But the PyArrow C++ code also depends on Compute for casting (and we depend on that in the numpy/pandas <-> arrow conversion, which is currently a part that is not meant to be optional)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Hmm, this is how ARROW_PYTHON is currently defined in cpp/cmake_modules/DefineOptions.cmake:

  define_option(ARROW_PYTHON
                "Build some components needed by PyArrow.;\
(This is a deprecated option. Use CMake presets instead.)"
                OFF
                DEPENDS
                ARROW_COMPUTE
                ARROW_CSV
                ARROW_DATASET
                ARROW_FILESYSTEM
                ARROW_HDFS
                ARROW_JSON)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
As for casts, that's a good point. There's an issue open on the C++ side for this: ARROW-8891

@asfimport
Copy link
Collaborator Author

asfimport commented Oct 4, 2022

Joris Van den Bossche / @jorisvandenbossche:
Ah, yes, for Dataset and HDFS I was thinking about the cython level, not for our C++. But checking pyarrow/src, I don't think we actually require dataset or hdfs in pyarrow C++.
I think the scope of ARROW_PYTHON was also partly convenience to build the "common" things, and not strictly the required parts.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ah, ok, that's a good point about ARROW_PYTHON. Since it's deprecated we may just ignore it then.

As for the Compute dependency: perhaps we can factor out the casting code in PyArrow C++ (there's not much of it) and use compilation directives to simply return NotImplemented if Compute was not enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants