Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinot Spark Connector for Spark3 #10394

Merged
merged 4 commits into from
Mar 18, 2023
Merged

Conversation

cbalci
Copy link
Contributor

@cbalci cbalci commented Mar 8, 2023

Background
Apache Spark has changed the Datasource interface significantly between Spark2 and Spark3, so current pinot-spark-connector doesn't work for Spark3. In a previous PR(#10321) I refactored the spark-connector into two modules (pinot-spark-common and pinot-spark-2-connector) to be able to reuse shared logic which gives us a clean base to implement the new version.

Change
In this PR I'm implementing the DataSourceV2 interface as published by Spark3. Functionality is exactly same as Pinot Spark 2 Connector and it supports all existing configuration options such as:

  • Ability to read from REALTIME, OFFLINE or HYBRID Pinot tables
  • Ability to scan using HTTP or GRPC server endpoints
  • Column pruning and filter push down
  • etc. (see docs)

It can be used as a drop in replacement when migrating from Spark2 to Spark3. Spark3 also brings some new features and improvements such as 'Aggregation push down' which can be taken advantage of in the future.

Testing
I added basic unit test coverage as well as a good list of integration tests under ExampleSparkPinotConnectorTest similar to Spark2 Connector.

feature
release-notes (Added Spark3 support for Pinot Spark Connector)

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2023

Codecov Report

Merging #10394 (a02e4a0) into master (5d0089a) will decrease coverage by 4.47%.
The diff coverage is 0.00%.

@@             Coverage Diff              @@
##             master   #10394      +/-   ##
============================================
- Coverage     67.87%   63.41%   -4.47%     
- Complexity     5742     5886     +144     
============================================
  Files          1521     2028     +507     
  Lines         80305   110627   +30322     
  Branches      12826    16846    +4020     
============================================
+ Hits          54506    70152   +15646     
- Misses        21957    35305   +13348     
- Partials       3842     5170    +1328     
Flag Coverage Δ
integration1 24.45% <0.00%> (?)
integration2 24.41% <0.00%> (?)
unittests1 68.02% <0.00%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../pinot/segment/local/upsert/ComparisonColumns.java 93.75% <0.00%> (-6.25%) ⬇️

... and 862 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xiangfu0 xiangfu0 requested review from xiangfu0 and KKcorps March 9, 2023 09:01
@KKcorps
Copy link
Contributor

KKcorps commented Mar 9, 2023

Thanks for the contribution! Can you try to test it in cluster mode while running on a proper YARN, AWS EMR or DataProc cluster.

Some of the times spark jobs inside pinot fail in these environments because of some conflicting runtime libs.

Solution is simply to keep on shading until the problems get resolved imo.

@@ -0,0 +1,140 @@
<!--
Copy link
Contributor Author

@cbalci cbalci Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.

@@ -0,0 +1,69 @@
<!--
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclosure: I copied this documentation from the Spark2 based implementation since at this point the functionality is pretty much same. However it is likely they will diverge pretty soon so I think they deserve separate doc pages.

@cbalci
Copy link
Contributor Author

cbalci commented Mar 10, 2023

from @KKcorps :

Can you try to test it in cluster mode while running on a proper YARN, AWS EMR or DataProc cluster.

Good suggestion. I ended up testing this in our YARN environment with success. Although, I have to note that our YARN environment runs Spark 3.0.2 so I had to downgrade Spark version of the connector for the test. Everything else worked as expected.

@cbalci
Copy link
Contributor Author

cbalci commented Mar 10, 2023

Thanks for the reviews @KKcorps @GSharayu @jackjlli ! I tried to address/resolve your comments, ptal.

@KKcorps KKcorps added documentation release-notes Referenced by PRs that need attention when compiling the next release notes enhancement labels Mar 10, 2023
Copy link
Contributor

@KKcorps KKcorps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! THanks for testing it out on YARN.
One suggestion would be to add a bash command to trigger this job in cluster mode that you use on YARN cluster.

Sometimes running the same command as local but adding deploy-mode cluster doesn't work properly.

@GSharayu
Copy link
Contributor

LGTM as well. Thank you for the changes

@cbalci
Copy link
Contributor Author

cbalci commented Mar 14, 2023

Thanks for the reviews folks!

One suggestion would be to add a bash command to trigger this job in cluster mode that you use on YARN cluster.

Sometimes running the same command as local but adding deploy-mode cluster doesn't work properly.

Good idea. I a dded a sample spark-submit command to the documentation for running the included examples in cluster mode.

spark-submit \
    --class org.apache.pinot.connector.spark.v3.datasource.ExampleSparkPinotConnectorTest \
    --jars ./target/pinot-spark-3-connector-0.13.0-SNAPSHOT-shaded.jar \
    --master $SPARK_CLUSTER \
    --deploy-mode cluster \
  ./target/pinot-spark-3-connector-0.13.0-SNAPSHOT-tests.jar

@xiangfu0
Copy link
Contributor

LGTM! Thanks for your contribution @cbalci !

@xiangfu0 xiangfu0 merged commit d9c4315 into apache:master Mar 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation enhancement release-notes Referenced by PRs that need attention when compiling the next release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants