-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinot Spark Connector for Spark3 #10394
Conversation
Codecov Report
@@ Coverage Diff @@
## master #10394 +/- ##
============================================
- Coverage 67.87% 63.41% -4.47%
- Complexity 5742 5886 +144
============================================
Files 1521 2028 +507
Lines 80305 110627 +30322
Branches 12826 16846 +4020
============================================
+ Hits 54506 70152 +15646
- Misses 21957 35305 +13348
- Partials 3842 5170 +1328
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 862 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thanks for the contribution! Can you try to test it in cluster mode while running on a proper YARN, AWS EMR or DataProc cluster. Some of the times spark jobs inside pinot fail in these environments because of some conflicting runtime libs. Solution is simply to keep on shading until the problems get resolved imo. |
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md
Outdated
Show resolved
Hide resolved
pinot-connectors/pinot-spark-3-connector/documentation/read_model.md
Outdated
Show resolved
Hide resolved
...onnector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataSource.scala
Show resolved
Hide resolved
pinot-connectors/pinot-spark-3-connector/src/test/resources/schema/pinot-schema.json
Outdated
Show resolved
Hide resolved
pinot-connectors/pinot-spark-3-connector/src/test/resources/schema/spark-schema.json
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,140 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.
@@ -0,0 +1,69 @@ | |||
<!-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.
from @KKcorps :
Good suggestion. I ended up testing this in our YARN environment with success. Although, I have to note that our YARN environment runs Spark 3.0.2 so I had to downgrade Spark version of the connector for the test. Everything else worked as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! THanks for testing it out on YARN.
One suggestion would be to add a bash command to trigger this job in cluster mode that you use on YARN cluster.
Sometimes running the same command as local but adding deploy-mode cluster
doesn't work properly.
LGTM as well. Thank you for the changes |
Thanks for the reviews folks!
Good idea. I a dded a sample
|
LGTM! Thanks for your contribution @cbalci ! |
Background
Apache Spark has changed the Datasource interface significantly between Spark2 and Spark3, so current pinot-spark-connector doesn't work for Spark3. In a previous PR(#10321) I refactored the spark-connector into two modules (
pinot-spark-common
andpinot-spark-2-connector
) to be able to reuse shared logic which gives us a clean base to implement the new version.Change
In this PR I'm implementing the DataSourceV2 interface as published by Spark3. Functionality is exactly same as Pinot Spark 2 Connector and it supports all existing configuration options such as:
It can be used as a drop in replacement when migrating from Spark2 to Spark3. Spark3 also brings some new features and improvements such as 'Aggregation push down' which can be taken advantage of in the future.
Testing
I added basic unit test coverage as well as a good list of integration tests under
ExampleSparkPinotConnectorTest
similar to Spark2 Connector.feature
release-notes
(Added Spark3 support for Pinot Spark Connector)