Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(csharp/src/Drivers/Apache/Spark): poc - Support for Apache Spark over HTTP (non-Arrow) #2018

Merged

Conversation

birschick-bq
Copy link
Contributor

@birschick-bq birschick-bq commented Jul 17, 2024

Proof-of-concept

Adds support for standard Apache Spark data source

  • Requires adbc.spark.type parameter option to specify the Spark server type.
  • Handles the possibility that data source does not support "catalog" in the table object namespace (i.e., only schema and table names).

Closes #2015

@birschick-bq birschick-bq changed the title feat(csharp/src/Drivers/Apache/Spark): poc - add support for standard Apache Spark (non-Arrow) feat(csharp/src/Drivers/Apache/Spark): poc - add support for Apache Spark over HTTP (non-Arrow) Jul 26, 2024
@birschick-bq birschick-bq marked this pull request as ready for review August 15, 2024 17:18
@birschick-bq birschick-bq requested a review from lidavidm as a code owner August 15, 2024 17:18
@github-actions github-actions bot added this to the ADBC Libraries 14 milestone Aug 15, 2024
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! This is only a partial review as I've exceeded the time I had allocated for a first pass.

| BOOLEAN | Boolean | bool |
| CHAR | String | string |
| DATE* | *String* | *string* |
| DECIMAL* | *String* | *string* |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a fundamental limitation or an implementation issue? Are we getting back enough information to do the conversion locally when the data is returned in a Thrift vs Arrow format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CurtHagenlocher
This is a "Thrift" limitation - if we don't do any further conversion. We have the opportunity to convert the string types to native type - at the expense of memory/performance. Should we do always do the conversion on the client driver? Or possibly allow an option to decide?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct thing is to do the conversion locally, yes. I think it would be very surprising for a user to run a query that produces a decimal or date value and to get that back as a string. I understand that we've been avoiding this for structured types where it's perhaps a little more justifiable but this just seems wrong.

One thing we could do to unblock this checkin is to add an option for converting scalar types and to fail if the value is not explicitly passed as false. This would let us add support in a subsequent change without breaking consumers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced option data_type_conv with value none to indicate that no conversion should be done. Indicated future support for scalar to convert date, decimal and timestamp types.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! I do think we can't just return e.g. dates or decimals as strings, but I've made a suggestion for a small change that would unblock this change and let us fix that separately.

var memory = offsetBuffer.AsMemory();
var typedMemory = Unsafe.As<Memory<byte>, Memory<int>>(ref memory).Slice(0, length + 1);

for(int _i197 = 0; _i197 < length; ++_i197)
{
//typedMemory.Span[_i197] = offset;
StreamExtensions.WriteInt32LittleEndian(offset, memory.Span, _i197 * 4);
StreamExtensions.WriteInt32LittleEndian(offset, memory.Span, _i197 * IntSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For .NET 6+ we could use e.g. BinaryPrimitives.WriteInt32LittleEndian for what I suspect is an efficiency gain. This could be done as a followup and/or performance-tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will create a follow-up to .NET 6+ support for BinaryPrimitives.WriteInt32LittleEndian

| BOOLEAN | Boolean | bool |
| CHAR | String | string |
| DATE* | *String* | *string* |
| DECIMAL* | *String* | *string* |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correct thing is to do the conversion locally, yes. I think it would be very surprising for a user to run a query that produces a decimal or date value and to get that back as a string. I understand that we've been avoiding this for structured types where it's perhaps a little more justifiable but this just seems wrong.

One thing we could do to unblock this checkin is to add an option for converting scalar types and to fail if the value is not explicitly passed as false. This would let us add support in a subsequent change without breaking consumers.

@birschick-bq birschick-bq changed the title feat(csharp/src/Drivers/Apache/Spark): poc - add support for Apache Spark over HTTP (non-Arrow) feat(csharp/src/Drivers/Apache/Spark): poc - Support for Apache Spark over HTTP (non-Arrow) Sep 9, 2024
@CurtHagenlocher CurtHagenlocher merged commit e564635 into apache:main Sep 10, 2024
8 checks passed
@birschick-bq birschick-bq deleted the dev/birschick-bq/spark-ds-variants branch October 15, 2024 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(csharp/src/Drivers/Apache/Spark): Add support for more versions and variants of Spark data source
3 participants