Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34737: [C#] C Data interface for schemas and types #34133

Merged
merged 20 commits into from
Apr 4, 2023

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented Feb 10, 2023

Rationale for this change

This starts the C Data Interface implementation for C# with integration for ArrowSchema. ArrowArray will come in a follow-up PR.

What changes are included in this PR?

  • Adds classes CArrowSchema and ImportedArrowSchema which allow interacting with the CArrowSchema.
  • Adds integration tests with PyArrow, inspired by the similar integration tests in arrow-rs

Are these changes tested?

Yes, the PyArrow integration tests validate the functionality.

Are there any user-facing changes?

This only adds new APIs, and doesn't change any existing ones.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #33856 has been automatically assigned in GitHub to PR creator.

@wjones127 wjones127 force-pushed the GH-33856-csharp-c-data branch 2 times, most recently from fce7987 to 59eb7ea Compare February 11, 2023 00:10
@wjones127 wjones127 force-pushed the GH-33856-csharp-c-data branch 2 times, most recently from 59f34ac to 5198a0f Compare February 18, 2023 00:06
@wjones127 wjones127 force-pushed the GH-33856-csharp-c-data branch 6 times, most recently from 132b21c to 87e734f Compare February 25, 2023 00:05
@wjones127 wjones127 marked this pull request as ready for review February 25, 2023 00:20
@kou kou changed the title GH-33856: [C#] c data interface for schemas and types GH-33856: [C#] C Data interface for schemas and types Feb 25, 2023
@kou kou requested a review from eerhardt February 25, 2023 21:41
@wjones127
Copy link
Member Author

@adamreeve would you be interested in reviewing?

public string metadata;
public long flags;
public long n_children;
public IntPtr* children;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should children just be IntPtr rather than IntPtr*? That would allow the struct itself to not need to be unsafe, although might make dealing with children internally a bit more fiddly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looking at this, I don't know how I would do this without unsafe:

Marshal.AllocHGlobal(numFields * sizeof(IntPtr))

And that's called in the constructor. (Compiler says that you can't call sizeof(IntPtr) outside of an unsafe context; not sure why.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about IntPtr.Size?

Marshal.AllocHGlobal(numFields * IntPtr.Size);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think IntPtr.Size works here, for other struct types you'd generally want to use Marshal.SizeOf in this scenario

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've narrowed it so unsafe only appears when constructing and accessing children. Does that seem like the improvement we are looking for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 yeah I think that is preferable

@github-actions github-actions bot added the awaiting committer review Awaiting committer review label Feb 28, 2023
@wjones127 wjones127 force-pushed the GH-33856-csharp-c-data branch 2 times, most recently from d4aff85 to 293db1d Compare February 28, 2023 22:08
@wjones127
Copy link
Member Author

Thanks for the feedback @adamreeve. I've realized I could simplify the API by changing the Export* methods into constructors for CArrowSchema. And I've moved the ImportedArrowSchema into an internal class within CArrowSchema so that it can handle the common import and disposal logic. Let me know what you think.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just barely started looking at this but will have to come back to it tomorrow.

using System.Runtime.InteropServices;
using Apache.Arrow.Types;

[UnmanagedFunctionPointer(CallingConvention.StdCall)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why StdCall?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not sure on this. It was the default and seemed to work with the integration tests fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I read up some more on this. I don't think StdCall is the default (that would be WinApi which defaults to StdCall on Windows and CDecl on Linux).

That being said, I don't think we need to worry about calling convention anywhere as it is only relevant when working with 32-bit binaries: https://stackoverflow.com/questions/34832679/is-the-callingconvention-ignored-in-64-bit-net-applications

I'm pretty sure Arrow-C++ doesn't even build on 32-bit (#32111) so this is probably a non-issue. With that in mind, perhaps skip the UnamangedFunctionPointer entirely? It seems its only purpose is to specify the calling convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace Arrow C++ does build on 32-bit Linux (we have nightly crossbow tests for that).

public string metadata;
public long flags;
public long n_children;
public IntPtr* children;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about IntPtr.Size?

Marshal.AllocHGlobal(numFields * IntPtr.Size);

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 16, 2023
Copy link
Contributor

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a few more comments that can be addressed post merge.

Thanks for the contribution here, @wjones127!

I'll merge this early this week, unless I see more feedback.

}
}

public class CDataSchemaPythonTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we split this into a new file? Having a single class per file makes it easier to navigate the code and find what you are looking for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

public long n_children;
public CArrowSchema** children;
public CArrowSchema* dictionary;
public delegate* unmanaged[Stdcall]<CArrowSchema*, void> release;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future (when we start targeting net6.0+), I think we will want to drop the [Stdcall] here and just use the default:

        public delegate* unmanaged<CArrowSchema*, void> release;

But that doesn't work in the older TFMs, so what we have now is fine.

{
ExportType(field.DataType, schema);
schema->name = StringUtil.ToCStringUtf8(field.Name);
// TODO: field metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you log issues for these TODOs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created an umbrella issue to track the remaining C data interface tasks (#33856) and another one to track remaining data types to implement (#34736).

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting merge Awaiting merge labels Mar 25, 2023
@wjones127 wjones127 changed the title GH-33856: [C#] C Data interface for schemas and types GH-34737: [C#] C Data interface for schemas and types Mar 27, 2023
@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34737 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 27, 2023
@eerhardt
Copy link
Contributor

Re-running the macOS 12 Python 3 leg which failed. Once it passes, I'll merge this PR.

@eerhardt
Copy link
Contributor

@westonpace - I see you still have changes requested. Is that stale? or are you still requesting changes on the current proposal?

The macOS 12 Python 3 leg is still failing. Does anyone know if that is related to these changes? @wjones127, can you merge with main to see if the errors go away?

@wjones127
Copy link
Member Author

@eerhardt I think we can ignore that error for now. The test is flakey and we are still looking into the root cause.

#34743

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my review was stale. I'm happy with where this is now.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Mar 31, 2023
@eerhardt eerhardt merged commit f02d351 into apache:main Apr 4, 2023
@ursabot
Copy link

ursabot commented Apr 4, 2023

Benchmark runs are scheduled for baseline = 7c307b1 and contender = f02d351. f02d351 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Failed ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] f02d3511 ec2-t3-xlarge-us-east-2
[Failed] f02d3511 test-mac-arm
[Finished] f02d3511 ursa-i9-9960x
[Failed] f02d3511 ursa-thinkcentre-m75q
[Finished] 7c307b14 ec2-t3-xlarge-us-east-2
[Failed] 7c307b14 test-mac-arm
[Finished] 7c307b14 ursa-i9-9960x
[Failed] 7c307b14 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…4133)

### Rationale for this change

This starts the C Data Interface implementation for C# with integration for `ArrowSchema`. `ArrowArray` will come in a follow-up PR.

### What changes are included in this PR?

* Adds classes `CArrowSchema` and `ImportedArrowSchema` which allow interacting with the `CArrowSchema`.
* Adds integration tests with PyArrow, inspired by the similar integration tests in [arrow-rs](https://github.com/apache/arrow-rs/blob/master/arrow/src/pyarrow.rs)

### Are these changes tested?

Yes, the PyArrow integration tests validate the functionality.

### Are there any user-facing changes?

This only adds new APIs, and doesn't change any existing ones.

* Closes: apache#33856
* Closes: apache#34737

Lead-authored-by: Will Jones <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Signed-off-by: Eric Erhardt <[email protected]>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…4133)

### Rationale for this change

This starts the C Data Interface implementation for C# with integration for `ArrowSchema`. `ArrowArray` will come in a follow-up PR.

### What changes are included in this PR?

* Adds classes `CArrowSchema` and `ImportedArrowSchema` which allow interacting with the `CArrowSchema`.
* Adds integration tests with PyArrow, inspired by the similar integration tests in [arrow-rs](https://github.com/apache/arrow-rs/blob/master/arrow/src/pyarrow.rs)

### Are these changes tested?

Yes, the PyArrow integration tests validate the functionality.

### Are there any user-facing changes?

This only adds new APIs, and doesn't change any existing ones.

* Closes: apache#33856
* Closes: apache#34737

Lead-authored-by: Will Jones <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
Signed-off-by: Eric Erhardt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Define C Data Interface for schemas/fields/types [C#] Implement C Data Interface for C#
7 participants