[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

gerashegalov · 2022-06-08T23:24:11Z

Describe the bug
Document all the caveats regarding the Plugin jar deployment.

Probably due to a Spark bug it looks like one cannot use --jars to specify a package from which we intend to use the shuffle manager. See if we need to contribute a fix to upstream Apache Spark.

On the other hand we face issues when an external package such as spark-avro is used via --jars or --packages whereas our plugin is added to Spark internal classpath. Either

by placing the plugin jar in $SPARK_HOME/jars
or by using driver-class-path/spark.*.extraClassPath

as noted in #5758

Steps/Code to reproduce bug
To repro the shuffle manager issue specifically:

BROKEN:

$SPARK_HOME/bin/pyspark --jars dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
  --conf spark.shuffle.service.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
  --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
  --master spark://localhost:7077 \
  --num-executors 1

and observe the executor instances crashing with

22/06/08 15:41:59 INFO TransportClientFactory: Successfully created connection to /10.0.0.132:33207 after 1 ms (0 ms spent in bootstraps)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:408)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark321.RapidsShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
	at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2642)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:315)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:207)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:468)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	... 4 more

WORKS

However, substituting --jars with the combo --driver-class-path/spark.executor.extraClassPath works fine:

$SPARK_HOME/bin/pyspark \
   --driver-class-path $PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar 
   --conf spark.executor.extraClassPath=$PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \ 
   --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
   --conf spark.shuffle.service.enabled=false \
   --conf spark.dynamicAllocation.enabled=false \
   --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
   --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
   --master spark://localhost:7077 \ 
   --num-executors 1

To reconcile this and #5758, placing all jars on the Spark's initial classpath works:

$SPARK_HOME/bin/pyspark \
  --driver-class-path $PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:$HOME/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar \
  --conf spark.executor.extraClassPath=$PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:$HOME/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.sql.format.avro.enabled=true \
  --conf spark.rapids.sql.format.avro.read.enabled=true \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
  --conf spark.shuffle.service.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
  --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
  --conf spark.rapids.memory.gpu.minAllocFraction=0 \
  --conf spark.rapids.memory.gpu.allocFraction=0.2 \
  --executor-cores=2 \
  --total-executor-cores=4 \
  --master spark://localhost:7077

>>> spark.read.format('avro').load('/tmp/a.avro').selectExpr('AVG(a)').collect()
22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <ShuffleExchangeExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <HashAggregateExec> will run on GPU
    *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
      *Expression <Average> avg(a#0) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:58 WARN GpuOverrides:                                (0 + 3) / 3]
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU

22/06/08 16:48:58 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU

[Row(avg(a)=5.0)]

Expected behavior
A coherent doc, ideally providing just one supported method of deploying jars that always works.
Otherwise explain why / when one or the other way is necessary

Environment details (please complete the following information)

local, cloud providers

The text was updated successfully, but these errors were encountered:

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 8, 2022

gerashegalov mentioned this issue Jun 8, 2022

[BUG] 22.08 ClassLoader-related issues #5757

Open

7 tasks

sameerz removed the ? - Needs Triage Need team to review and classify label Jun 21, 2022

sameerz added the shuffle things that impact the shuffle plugin label Jul 12, 2022

gerashegalov mentioned this issue Aug 1, 2022

[BUG] Fix IT discrepancy which depending on TEST_PARALLEL #6044

Merged

gerashegalov mentioned this issue Aug 26, 2022

[BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager #6417

Closed

viadea mentioned this issue Oct 5, 2022

[Doc]Add 22.10 download page[skip ci] #6670

Merged

gerashegalov mentioned this issue Nov 23, 2023

Detect multiple jars on the classpath when init plugin [databricks] #9654

Merged

thirtiseven mentioned this issue Nov 28, 2023

[FEA] Set spark.rapids.sql.allowMultipleJars to NEVER by default #9870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

gerashegalov commented Jun 8, 2022 •

edited

Loading

[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

Comments

gerashegalov commented Jun 8, 2022 • edited Loading

gerashegalov commented Jun 8, 2022 •

edited

Loading