Add Spark Job Launcher tool #9288

KKcorps · 2022-08-27T18:51:55Z

The users currently need to create the whole spark-submit command to run a spark job for batch ingestion. With so many plugins available inside pinot leads a lot of classpath errors and you also need to take care of various arguments based on the environment in which you are running. This new command in pinot-admin aims to simply this for the users.

Example

Previously if you had to run

export PINOT_VERSION=0.11.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=/Users/kharekartik/Documents/Developer/pinot/build/ spark-submit --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --master yarn --deploy-mode client --jars ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile parquet_ingestion_spec_spark3_students.yml

but now you can now use

export SPARK_HOME=/usr/lib/spark/ bin/pinot-admin.sh LaunchSparkDataIngestionJob -jobSpecFile parquet_ingestion_spec_spark3_students.yml -pluginsToLoad pinot-parquet:pinot-s3 -master yarn

Additional Options

You can also mention any additional spark configurations using the -sparkConf option
-sparkConf spark.executor.cores=3:num-executors=4
Users can also specify jars directly from S3/GCS instead of local disk for environments like EMR
-pinotBaseDir s3://your-bucket/apache-pinot-0.11.0-SNAPSHOT
You can choose whether to run spark 2.x or 3.x with the following option (default is SPARK_3)
-sparkVersion SPARK_2

codecov-commenter · 2022-08-27T19:27:54Z

Codecov Report

Merging #9288 (05ebd9d) into master (a5a83aa) will decrease coverage by 42.57%.
The diff coverage is 18.85%.

❗ Current head 05ebd9d differs from pull request most recent head 32fe741. Consider uploading reports for the commit 32fe741 to get more accurate results

@@              Coverage Diff              @@
##             master    #9288       +/-   ##
=============================================
- Coverage     68.66%   26.09%   -42.58%     
+ Complexity     4680       44     -4636     
=============================================
  Files          1859     1855        -4     
  Lines         99120    99278      +158     
  Branches      15075    15112       +37     
=============================================
- Hits          68062    25904    -42158     
- Misses        26174    70783    +44609     
+ Partials       4884     2591     -2293

Flag	Coverage Δ
integration1	`26.09% <18.85%> (-0.13%)`	⬇️
unittests1	`?`
unittests2	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...apache/pinot/broker/api/HttpRequesterIdentity.java	`28.57% <0.00%> (-57.15%)`	⬇️
...org/apache/pinot/broker/api/RequesterIdentity.java	`50.00% <0.00%> (-50.00%)`	⬇️
.../pinot/broker/api/resources/PinotBrokerLogger.java	`0.00% <0.00%> (ø)`
...mon/segment/generation/SegmentGenerationUtils.java	`7.29% <0.00%> (-13.77%)`	⬇️
...rg/apache/pinot/common/utils/LoggerFileServer.java	`0.00% <0.00%> (ø)`
...pache/pinot/common/utils/config/InstanceUtils.java	`12.19% <0.00%> (-77.95%)`	⬇️
...org/apache/pinot/common/utils/http/HttpClient.java	`68.13% <ø> (ø)`
...er/api/access/ZkBasicAuthAccessControlFactory.java	`0.00% <ø> (ø)`
...ache/pinot/controller/api/resources/Constants.java	`21.05% <ø> (-21.06%)`	⬇️
...er/api/resources/LLCSegmentCompletionHandlers.java	`62.37% <ø> (+18.81%)`	⬆️
... and 1365 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kkrugler · 2022-08-27T20:13:13Z

pinot-tools/pom.xml

+      <groupId>org.apache.spark</groupId>
+      <artifactId>spark-launcher_${scala.version}</artifactId>
+      <version>${spark.version}</version>
+      <exclusions>


I do have some minor concerns about adding dependencies on Spark here. But since it's only pinot-tools, it's not so important.

This dep is pretty basic. Only contains couple of classes and one more dependency. So including it isn't a main concern. Earlier I had included spark-core which was a major one and could have caused a lot of issues.

kkrugler · 2022-08-27T20:15:43Z

pinot-tools/pom.xml

@@ -34,6 +34,8 @@
  <properties>
    <pinot.root>${basedir}/..</pinot.root>
    <aws.version>2.14.28</aws.version>
+    <scala.version>2.12</scala.version>


Once you start pulling in Scala code, you need to ensure that every dependency that's also using Scala is using the same version. So I think this should go in the top-level pom.xml file. Also note that pinot-kafka and pinot-spark have dependencies on Scala 2.11, which I believe will cause runtime problems if they are on the classpath when the tool is being run (and it's using 2.12).

I have actually excluded the scala code from plugin. It is just that the plugin requires it in name.

xiangfu0

lgtm, it simplifies the Spark story a lot

KKcorps · 2022-08-29T20:33:58Z

It is failing in some cases like local environment but multi threaded. Working on fixing those post which we can merge.

walterddr · 2022-08-31T00:13:59Z

...s/src/main/java/org/apache/pinot/tools/admin/command/LaunchSparkDataIngestionJobCommand.java

+  // Kafka plugins need to be excluded as they contain scala dependencies which cause
+  // NoSuchMethodErrors with runtime spark.
+  // It is also fine to exclude Kafka plugins as they are not going to be used in batch ingestion in any case
+  @CommandLine.Option(names = {"-pluginsToExclude"}, defaultValue = "pinot-kafka-0.9:pinot-kafka-2.0", required =


i wonder if we already have a property file and a jobspec file. do we still want to support these commandline options

Yes, that can be done as well. My initial thought process when writing this was to not use ingestion spec at all in the command code and only use it inside the spark job. Reason being it puts the limitation of always providing ingestion spec as a local file path and not S3.

However, I did ended up using the spec because otherwise we can't load the PinotFS classes and find appropriate plugin jars in S3, GCS etc.

I will take this up as a seperate PR

Kartik Khare added 9 commits August 24, 2022 09:39

Add Spark data ingestion command

5634a2c

Add capability to automatically add classpaths

c3292fb

Only add spark launcher dependency and remove the rest

3348ea5

Add support for spark confs

4aeac5a

S3 filepaths and plugin filters working

0d4a40e

upgrade spark launcher version

41f9049

Fix s3 paths

b678399

Add support for additional args, cleanup description

b7e5cab

Fix Scala conflicts by excluding kafka plugins

2ba3254

KKcorps requested a review from xiangfu0 August 27, 2022 18:55

KKcorps force-pushed the spark_command_tool branch from 0479ea1 to 32fe741 Compare August 27, 2022 19:38

KKcorps requested a review from walterddr August 27, 2022 19:38

kkrugler reviewed Aug 27, 2022

View reviewed changes

xiangfu0 approved these changes Aug 28, 2022

View reviewed changes

jackjlli approved these changes Aug 30, 2022

View reviewed changes

walterddr reviewed Aug 31, 2022

View reviewed changes

Fix linting error

6854682

KKcorps force-pushed the spark_command_tool branch from 32fe741 to 6854682 Compare September 7, 2022 14:43

KKcorps merged commit 06b76e6 into apache:master Sep 7, 2022

KKcorps added the release-notes Referenced by PRs that need attention when compiling the next release notes label Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Spark Job Launcher tool #9288

Add Spark Job Launcher tool #9288

KKcorps commented Aug 27, 2022 •

edited

Loading

codecov-commenter commented Aug 27, 2022 •

edited

Loading

kkrugler Aug 27, 2022

KKcorps Aug 27, 2022

kkrugler Aug 27, 2022

KKcorps Aug 27, 2022

xiangfu0 left a comment

KKcorps commented Aug 29, 2022

walterddr Aug 31, 2022 •

edited

Loading

KKcorps Sep 7, 2022

KKcorps Sep 7, 2022

Add Spark Job Launcher tool #9288

Add Spark Job Launcher tool #9288

Conversation

KKcorps commented Aug 27, 2022 • edited Loading

Example

Additional Options

codecov-commenter commented Aug 27, 2022 • edited Loading

Codecov Report

kkrugler Aug 27, 2022

Choose a reason for hiding this comment

KKcorps Aug 27, 2022

Choose a reason for hiding this comment

kkrugler Aug 27, 2022

Choose a reason for hiding this comment

KKcorps Aug 27, 2022

Choose a reason for hiding this comment

xiangfu0 left a comment

Choose a reason for hiding this comment

KKcorps commented Aug 29, 2022

walterddr Aug 31, 2022 • edited Loading

Choose a reason for hiding this comment

KKcorps Sep 7, 2022

Choose a reason for hiding this comment

KKcorps Sep 7, 2022

Choose a reason for hiding this comment

KKcorps commented Aug 27, 2022 •

edited

Loading

codecov-commenter commented Aug 27, 2022 •

edited

Loading

walterddr Aug 31, 2022 •

edited

Loading