GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow #40717

jorisvandenbossche · 2024-03-21T16:26:23Z

Rationale for this change

PyArrow implementation for the specification additions being proposed in #40708

What changes are included in this PR?

New __arrow_c_device_array__ method to pyarrow.Array and pyarrow.RecordBatch, and support in the pyarrow.array(..), pyarrow.record_batch(..) and pyarrow.table(..) functions to consume objects that have those methods.

Are these changes tested?

Yes (for CPU only for now, #40385 is a prerequisite to test this for CUDA)

GitHub Issue: [Python] Expose the device interface through the Arrow PyCapsule protocol #38325

…a in PyArrow

jorisvandenbossche · 2024-03-21T16:27:07Z

python/pyarrow/array.pxi

+        if requested_schema is not None:
+            target_type = DataType._import_from_c_capsule(requested_schema)
+
+            if target_type != self.type:
+                # TODO should protect from trying to cast non-CPU data
+                try:
+                    casted_array = _pc().cast(self, target_type, safe=True)
+                    inner_array = pyarrow_unwrap_array(casted_array)
+                except ArrowInvalid as e:
+                    raise ValueError(
+                        f"Could not cast {self.type} to requested type {target_type}: {e}"
+                    )
+            else:
+                inner_array = self.sp_array
+        else:
+            inner_array = self.sp_array


This part is a bit repetitive with the non-device version. I could factor that out into a shared helper function

paleolimbot

I had a look through here and it is looking great! I don't see anything out of place (but I am also only a little bit familiar with the existing code). I don't personally mind the repeated cast bit (as long as the repetition is tested, which it looks like it is).

Just for this PR I prototyped something similar in nanoarrow ( apache/arrow-nanoarrow#409 ), and the only difference I see is that the device_id for the CPU is -1. I can't find in the spec exactly what it should be...is -1 the accepted value?

import pyarrow as pa
# ! pip install "https://github.com/paleolimbot/arrow-nanoarrow/archive/414bbc44d3e84ecac2807713438d6988ff4d5245.zip#egg=nanoarrow&subdirectory=python"
import nanoarrow as na
from nanoarrow import device

# Wrapper to prevent c_device_array() from falling back on __arrow_c_array__()
class DeviceArrayWrapper:
    def __init__(self, obj):
        self.obj = obj

    def __arrow_c_device_array__(self, requested_schema=None):
        return self.obj.__arrow_c_device_array__(requested_schema=requested_schema)
    

pa_array = pa.array([1, 2, 3])

device.c_device_array(DeviceArrayWrapper(pa_array))
#> <nanoarrow.device.CDeviceArray>
#> - device_type: 1
#> - device_id: -1
#> - array: <nanoarrow.c_lib.CArray int64>
#>   - length: 3
#>   - offset: 0
#>   - null_count: 0
#>   - buffers: (0, 2199023452480)
#>   - dictionary: NULL
#>   - children[0]:

jorisvandenbossche · 2024-03-26T16:21:47Z

Thanks for testing!

the only difference I see is that the device_id for the CPU is -1. I can't find in the spec exactly what it should be...is -1 the accepted value?

That's a good point. In practice this comes from our implementation in C++:

arrow/cpp/src/arrow/device.h

Lines 82 to 86 in 434f872

    
           /// \brief A device ID to identify this device if there are multiple of this type. 
        
           /// 
        
           /// If there is no "device_id" equivalent (such as for the main CPU device on 
        
           /// non-numa systems) returns -1. 
        
           virtual int64_t device_id() const { return -1; }

But we should probably clarify that in the spec whether that's allowed / expected.

I see that DLPack actually specified that as to be 0 for CPU: https://dmlc.github.io/dlpack/latest/c_api.html#c.DLDevice.device_id
Maybe we should follow that; will open a separate issue.

…e-impl

@jorisvandenbossche

…yView, and the CArray (#409) When device support was first added, the `CArrayView` was device-aware but the `CArray` was not. This worked well until it was clear that `__arrow_c_array__` needed to error if it did not represent a CPU array (and the `CArray` had no way to check). Now, the `CArray` has a `device_type` and `device_id`. A nice side-effect of this is that we get back the `view()` method (whose removal @jorisvandenbossche had lamented!). This also implements the device array protocol to help test apache/arrow#40717 . This protocol isn't finalized yet and I could remove that part until it is (although it doesn't seem likely to change). The non-cpu case is still hard to test without real-world CUDA support...this PR is just trying to get the right information in the right place as early as possible. ```python import nanoarrow as na array = na.c_array([1, 2, 3], na.int32()) array.device_type, array.device_id #> (1, 0) ``` --------- Co-authored-by: Dane Pitkin <[email protected]>

…e-impl

jorisvandenbossche · 2024-06-20T16:23:48Z

I updated this PR to include the **kwarg handling.

I think in the meantime, we also have stream support for the device interface, so that could be added as well (although given this is already quite big, I can also do that in a separate follow-up PR). I should also still explicitly test this on CUDA.

python/pyarrow/array.pxi

paleolimbot · 2024-06-21T02:21:38Z

python/pyarrow/array.pxi

+            target_type = DataType._import_from_c_capsule(requested_schema)
+
+            if target_type != self.type:
+                # TODO should protect from trying to cast non-CPU data


Is this check easy to do? (If the failure mode is a crash maybe this would be good to do?)

Yes, we actually expose a device_type on Array and RecordBatch, so we can easily validate this and raise an informative error when trying to cast to requested_schema for non-CPU data.

python/pyarrow/table.pxi

…e-impl

paleolimbot

Thanks!

jorisvandenbossche · 2024-06-26T15:40:51Z

Thanks for the review! Going to merge this then, and open follow-up issues for expanding to the stream interface and CUDA tests

conbench-apache-arrow · 2024-06-27T02:38:33Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 1815a67.

There was 1 benchmark result indicating a performance regression:

Commit Run on ec2-m5-4xlarge-us-east-2 at 2024-06-27 00:42:01Z
- partitioned-dataset-filter (R) with dataset=dataset-taxi-parquet, language=R, query=vignette

The full Conbench report has more details. It also includes information about 42 possible false positives for unstable benchmarks that are known to sometimes produce them.

…a in PyArrow (apache#40717) ### Rationale for this change PyArrow implementation for the specification additions being proposed in apache#40708 ### What changes are included in this PR? New `__arrow_c_device_array__` method to `pyarrow.Array` and `pyarrow.RecordBatch`, and support in the `pyarrow.array(..)`, `pyarrow.record_batch(..)` and `pyarrow.table(..)` functions to consume objects that have those methods. ### Are these changes tested? Yes (for CPU only for now, apache#40385 is a prerequisite to test this for CUDA) * GitHub Issue: apache#38325

apacheGH-38325: [Python] Implement PyCapsule interface for Device dat…

9c6ff7d

…a in PyArrow

github-actions bot added the Component: Python label Mar 21, 2024

jorisvandenbossche commented Mar 21, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 21, 2024

paleolimbot reviewed Mar 25, 2024

View reviewed changes

paleolimbot mentioned this pull request Mar 25, 2024

feat(python): Clarify interaction between the CDeviceArray, the CArrayView, and the CArray apache/arrow-nanoarrow#409

Merged

This was referenced Mar 26, 2024

[Format][C++] Recommended/required value for ArrowDeviceArray.device_id int in case of CPU data #40801

Closed

[Python] Expose the device interface through the Arrow PyCapsule protocol #38325

Closed

jorisvandenbossche added 2 commits April 9, 2024 09:38

Merge remote-tracking branch 'upstream/main' into 38325-capsule-devic…

f123925

…e-impl

switch order of consuming __arrow_c_array__ and __arrow_c_device_array__

e81c5eb

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 9, 2024

jorisvandenbossche added 2 commits June 20, 2024 18:04

Merge remote-tracking branch 'upstream/main' into 38325-capsule-devic…

c922f44

…e-impl

add kwarg handling

6946d19

paleolimbot reviewed Jun 21, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting review Awaiting review and removed awaiting change review Awaiting change review awaiting review Awaiting review awaiting changes Awaiting changes labels Jun 21, 2024

jorisvandenbossche added 2 commits June 26, 2024 11:48

Merge remote-tracking branch 'upstream/main' into 38325-capsule-devic…

b2ad739

…e-impl

document kwargs + raise error when trying to cast non-CPU data

671efda

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 26, 2024

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 26, 2024

paleolimbot approved these changes Jun 26, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Jun 26, 2024

jorisvandenbossche merged commit 1815a67 into apache:main Jun 26, 2024
13 of 15 checks passed

jorisvandenbossche deleted the 38325-capsule-device-impl branch June 26, 2024 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow #40717

GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow #40717

jorisvandenbossche commented Mar 21, 2024 •

edited by github-actions bot

Loading

jorisvandenbossche Mar 21, 2024

paleolimbot left a comment

jorisvandenbossche commented Mar 26, 2024

jorisvandenbossche commented Jun 20, 2024 •

edited

Loading

paleolimbot Jun 21, 2024

jorisvandenbossche Jun 26, 2024

paleolimbot left a comment

jorisvandenbossche commented Jun 26, 2024

conbench-apache-arrow bot commented Jun 27, 2024

GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow #40717

GH-38325: [Python] Implement PyCapsule interface for Device data in PyArrow #40717

Conversation

jorisvandenbossche commented Mar 21, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

jorisvandenbossche Mar 21, 2024

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 26, 2024

jorisvandenbossche commented Jun 20, 2024 • edited Loading

paleolimbot Jun 21, 2024

Choose a reason for hiding this comment

jorisvandenbossche Jun 26, 2024

Choose a reason for hiding this comment

paleolimbot left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jun 26, 2024

conbench-apache-arrow bot commented Jun 27, 2024

jorisvandenbossche commented Mar 21, 2024 •

edited by github-actions bot

Loading

jorisvandenbossche commented Jun 20, 2024 •

edited

Loading