Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RapidsShuffleManager and external packages for Spark Standalone #5796

Open
Tracked by #5757
gerashegalov opened this issue Jun 8, 2022 · 0 comments
Open
Tracked by #5757
Labels
bug Something isn't working shuffle things that impact the shuffle plugin

Comments

@gerashegalov
Copy link
Collaborator

gerashegalov commented Jun 8, 2022

Describe the bug
Document all the caveats regarding the Plugin jar deployment.

Probably due to a Spark bug it looks like one cannot use --jars to specify a package from which we intend to use the shuffle manager. See if we need to contribute a fix to upstream Apache Spark.

On the other hand we face issues when an external package such as spark-avro is used via --jars or --packages whereas our plugin is added to Spark internal classpath. Either

  • by placing the plugin jar in $SPARK_HOME/jars
  • or by using driver-class-path/spark.*.extraClassPath

as noted in #5758

Steps/Code to reproduce bug
To repro the shuffle manager issue specifically:

BROKEN:

$SPARK_HOME/bin/pyspark --jars dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
  --conf spark.shuffle.service.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
  --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
  --master spark://localhost:7077 \
  --num-executors 1

and observe the executor instances crashing with

22/06/08 15:41:59 INFO TransportClientFactory: Successfully created connection to /10.0.0.132:33207 after 1 ms (0 ms spent in bootstraps)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:408)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark321.RapidsShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
	at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2642)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:315)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:207)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:468)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	... 4 more

WORKS

However, substituting --jars with the combo --driver-class-path/spark.executor.extraClassPath works fine:

$SPARK_HOME/bin/pyspark \
   --driver-class-path $PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar 
   --conf spark.executor.extraClassPath=$PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
   --conf spark.plugins=com.nvidia.spark.SQLPlugin \ 
   --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
   --conf spark.shuffle.service.enabled=false \
   --conf spark.dynamicAllocation.enabled=false \
   --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
   --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
   --master spark://localhost:7077 \ 
   --num-executors 1

To reconcile this and #5758, placing all jars on the Spark's initial classpath works:

$SPARK_HOME/bin/pyspark \
  --driver-class-path $PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:$HOME/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar \
  --conf spark.executor.extraClassPath=$PWD/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar:$HOME/.ivy2/jars/org.apache.spark_spark-avro_2.12-3.2.1.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --conf spark.rapids.sql.explain=ALL \
  --conf spark.rapids.sql.format.avro.enabled=true \
  --conf spark.rapids.sql.format.avro.read.enabled=true \
  --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \
  --conf spark.shuffle.service.enabled=false \
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executorEnv.UCX_ERROR_SIGNALS= \
  --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \
  --conf spark.rapids.memory.gpu.minAllocFraction=0 \
  --conf spark.rapids.memory.gpu.allocFraction=0.2 \
  --executor-cores=2 \
  --total-executor-cores=4 \
  --master spark://localhost:7077
>>> spark.read.format('avro').load('/tmp/a.avro').selectExpr('AVG(a)').collect()
22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU
  *Exec <ShuffleExchangeExec> will run on GPU
    *Partitioning <SinglePartition$> will run on GPU
    *Exec <HashAggregateExec> will run on GPU
      *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
        *Expression <Average> avg(a#0) will run on GPU
      *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:57 WARN GpuOverrides: 
*Exec <ShuffleExchangeExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <HashAggregateExec> will run on GPU
    *Expression <AggregateExpression> partial_avg(a#0) will run on GPU
      *Expression <Average> avg(a#0) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

22/06/08 16:48:58 WARN GpuOverrides:                                (0 + 3) / 3]
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU

22/06/08 16:48:58 WARN GpuOverrides: 
*Exec <HashAggregateExec> will run on GPU
  *Expression <AggregateExpression> avg(a#0) will run on GPU
    *Expression <Average> avg(a#0) will run on GPU
  *Expression <Alias> avg(a#0)#2 AS avg(a)#3 will run on GPU

[Row(avg(a)=5.0)] 

Expected behavior
A coherent doc, ideally providing just one supported method of deploying jars that always works.
Otherwise explain why / when one or the other way is necessary

Environment details (please complete the following information)

  • local, cloud providers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working shuffle things that impact the shuffle plugin
Projects
None yet
Development

No branches or pull requests

2 participants