-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Spark Job Launcher tool #9288
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9288 +/- ##
=============================================
- Coverage 68.66% 26.09% -42.58%
+ Complexity 4680 44 -4636
=============================================
Files 1859 1855 -4
Lines 99120 99278 +158
Branches 15075 15112 +37
=============================================
- Hits 68062 25904 -42158
- Misses 26174 70783 +44609
+ Partials 4884 2591 -2293
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
0479ea1
to
32fe741
Compare
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-launcher_${scala.version}</artifactId> | ||
<version>${spark.version}</version> | ||
<exclusions> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do have some minor concerns about adding dependencies on Spark here. But since it's only pinot-tools, it's not so important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dep is pretty basic. Only contains couple of classes and one more dependency. So including it isn't a main concern. Earlier I had included spark-core
which was a major one and could have caused a lot of issues.
@@ -34,6 +34,8 @@ | |||
<properties> | |||
<pinot.root>${basedir}/..</pinot.root> | |||
<aws.version>2.14.28</aws.version> | |||
<scala.version>2.12</scala.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you start pulling in Scala code, you need to ensure that every dependency that's also using Scala is using the same version. So I think this should go in the top-level pom.xml
file. Also note that pinot-kafka and pinot-spark have dependencies on Scala 2.11, which I believe will cause runtime problems if they are on the classpath when the tool is being run (and it's using 2.12).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have actually excluded the scala code from plugin. It is just that the plugin requires it in name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, it simplifies the Spark story a lot
It is failing in some cases like local environment but multi threaded. Working on fixing those post which we can merge. |
// Kafka plugins need to be excluded as they contain scala dependencies which cause | ||
// NoSuchMethodErrors with runtime spark. | ||
// It is also fine to exclude Kafka plugins as they are not going to be used in batch ingestion in any case | ||
@CommandLine.Option(names = {"-pluginsToExclude"}, defaultValue = "pinot-kafka-0.9:pinot-kafka-2.0", required = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we already have a property file and a jobspec file. do we still want to support these commandline options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that can be done as well. My initial thought process when writing this was to not use ingestion spec at all in the command code and only use it inside the spark job. Reason being it puts the limitation of always providing ingestion spec as a local file path and not S3.
However, I did ended up using the spec because otherwise we can't load the PinotFS classes and find appropriate plugin jars in S3, GCS etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take this up as a seperate PR
32fe741
to
6854682
Compare
The users currently need to create the whole spark-submit command to run a spark job for batch ingestion. With so many plugins available inside pinot leads a lot of classpath errors and you also need to take care of various arguments based on the environment in which you are running. This new command in
pinot-admin
aims to simply this for the users.Example
Previously if you had to run
export PINOT_VERSION=0.11.0-SNAPSHOT export PINOT_DISTRIBUTION_DIR=/Users/kharekartik/Documents/Developer/pinot/build/ spark-submit --class org.apache.pinot.tools.admin.command.LaunchDataIngestionJobCommand --master yarn --deploy-mode client --jars ${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-input-format/pinot-parquet/pinot-parquet-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins/pinot-file-system/pinot-s3/pinot-s3-0.11.0-SNAPSHOT-shaded.jar,${PINOT_DISTRIBUTION_DIR}/plugins-external/pinot-batch-ingestion/pinot-batch-ingestion-spark-3.2/pinot-batch-ingestion-spark-3.2-0.11.0-SNAPSHOT-shaded.jar local://${PINOT_DISTRIBUTION_DIR}/lib/pinot-all-0.11.0-SNAPSHOT-jar-with-dependencies.jar -jobSpecFile parquet_ingestion_spec_spark3_students.yml
but now you can now use
export SPARK_HOME=/usr/lib/spark/ bin/pinot-admin.sh LaunchSparkDataIngestionJob -jobSpecFile parquet_ingestion_spec_spark3_students.yml -pluginsToLoad pinot-parquet:pinot-s3 -master yarn
Additional Options
You can also mention any additional spark configurations using the
-sparkConf
option-sparkConf spark.executor.cores=3:num-executors=4
Users can also specify jars directly from S3/GCS instead of local disk for environments like EMR
-pinotBaseDir s3://your-bucket/apache-pinot-0.11.0-SNAPSHOT
You can choose whether to run spark 2.x or 3.x with the following option (default is SPARK_3)
-sparkVersion SPARK_2