You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Document all the caveats regarding the Plugin jar deployment.
Probably due to a Spark bug it looks like one cannot use --jars to specify a package from which we intend to use the shuffle manager. See if we need to contribute a fix to upstream Apache Spark.
On the other hand we face issues when an external package such as spark-avro is used via --jars or --packages whereas our plugin is added to Spark internal classpath. Either
by placing the plugin jar in $SPARK_HOME/jars
or by using driver-class-path/spark.*.extraClassPath
22/06/08 15:41:59 INFO TransportClientFactory: Successfully created connection to /10.0.0.132:33207 after 1 ms (0 ms spent in bootstraps)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:419)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:408)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark321.RapidsShuffleManager
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:216)
at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2642)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:315)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:207)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:468)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
WORKS
However, substituting --jars with the combo --driver-class-path/spark.executor.extraClassPath works fine:
>>>spark.read.format('avro').load('/tmp/a.avro').selectExpr('AVG(a)').collect()
22/06/0816:48:57WARNGpuOverrides:
*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Expression<Alias>avg(a#0)#2 AS avg(a)#3 will run on GPU*Exec<ShuffleExchangeExec>willrunonGPU*Partitioning<SinglePartition$>willrunonGPU*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>partial_avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Exec<FileSourceScanExec>willrunonGPU22/06/0816:48:57WARNGpuOverrides:
*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Expression<Alias>avg(a#0)#2 AS avg(a)#3 will run on GPU*Exec<ShuffleExchangeExec>willrunonGPU*Partitioning<SinglePartition$>willrunonGPU*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>partial_avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Exec<FileSourceScanExec>willrunonGPU22/06/0816:48:57WARNGpuOverrides:
*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Expression<Alias>avg(a#0)#2 AS avg(a)#3 will run on GPU*Exec<ShuffleExchangeExec>willrunonGPU*Partitioning<SinglePartition$>willrunonGPU*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>partial_avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Exec<FileSourceScanExec>willrunonGPU22/06/0816:48:57WARNGpuOverrides:
*Exec<ShuffleExchangeExec>willrunonGPU*Partitioning<SinglePartition$>willrunonGPU*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>partial_avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Exec<FileSourceScanExec>willrunonGPU22/06/0816:48:58WARNGpuOverrides: (0+3) /3]
*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Expression<Alias>avg(a#0)#2 AS avg(a)#3 will run on GPU22/06/0816:48:58WARNGpuOverrides:
*Exec<HashAggregateExec>willrunonGPU*Expression<AggregateExpression>avg(a#0) will run on GPU*Expression<Average>avg(a#0) will run on GPU*Expression<Alias>avg(a#0)#2 AS avg(a)#3 will run on GPU
[Row(avg(a)=5.0)]
Expected behavior
A coherent doc, ideally providing just one supported method of deploying jars that always works.
Otherwise explain why / when one or the other way is necessary
Environment details (please complete the following information)
local, cloud providers
The text was updated successfully, but these errors were encountered:
Describe the bug
Document all the caveats regarding the Plugin jar deployment.
Probably due to a Spark bug it looks like one cannot use --jars to specify a package from which we intend to use the shuffle manager. See if we need to contribute a fix to upstream Apache Spark.
On the other hand we face issues when an external package such as spark-avro is used via --jars or --packages whereas our plugin is added to Spark internal classpath. Either
as noted in #5758
Steps/Code to reproduce bug
To repro the shuffle manager issue specifically:
BROKEN:
$SPARK_HOME/bin/pyspark --jars dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.shuffle.manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager \ --conf spark.shuffle.service.enabled=false \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executorEnv.UCX_ERROR_SIGNALS= \ --conf spark.executorEnv.UCX_MEMTYPE_CACHE=n \ --master spark://localhost:7077 \ --num-executors 1
and observe the executor instances crashing with
WORKS
However, substituting --jars with the combo --driver-class-path/spark.executor.extraClassPath works fine:
To reconcile this and #5758, placing all jars on the Spark's initial classpath works:
Expected behavior
A coherent doc, ideally providing just one supported method of deploying jars that always works.
Otherwise explain why / when one or the other way is necessary
Environment details (please complete the following information)
The text was updated successfully, but these errors were encountered: