Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix IT discrepancy which depending on TEST_PARALLEL #6044

Merged
merged 3 commits into from
Aug 16, 2022

Conversation

res-life
Copy link
Collaborator

@res-life res-life commented Jul 21, 2022

Fixes #5714

Changes:

  • Use spark.jars configuration instead of spark.executor.extraClassPath and spark.driver.extraClassPath.
    There're 2 paths depending on TEST_PARALLEL.
    if ((${#TEST_PARALLEL_OPTS[@]} > 0));
    then
        exec python "${RUN_TESTS_COMMAND[@]}" "${TEST_PARALLEL_OPTS[@]}" "${TEST_COMMON_OPTS[@]}"
    else
        # We set the GPU memory size to be a constant value even if only running with a parallelism of 1
        # because it helps us have consistent test runs.
        exec "$SPARK_HOME"/bin/spark-submit --jars "${ALL_JARS// /,}" \
            --driver-java-options "$PYSP_TEST_spark_driver_extraJavaOptions" \
            $SPARK_SUBMIT_FLAGS \
            --conf 'spark.rapids.memory.gpu.allocSize='"$PYSP_TEST_spark_rapids_memory_gpu_allocSize" \
            "${RUN_TESTS_COMMAND[@]}" "${TEST_COMMON_OPTS[@]}"
    fi

Update the first path, also use spark.jars which is same as --jars

spark.executor.extraClassPath is deprecated see:
https://spark.apache.org/docs/latest/configuration.html

spark.executor.extraClassPath: 
Extra classpath entries to prepend to the classpath of executors. 
This exists primarily for backwards-compatibility with older versions of Spark. 
Users typically should not need to set this option.	

We can also find the spark.jars configure from the above link.

  • Update bash script to test jar file existence before set JAR_PATH

Signed-off-by: Chong Gao [email protected]

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

Note: should re-target 22.10

@tgravescs tgravescs marked this pull request as draft July 21, 2022 12:51
@tgravescs
Copy link
Collaborator

put in draft since seems for 22.10

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just minor comments

@res-life
Copy link
Collaborator Author

The following command gets an invalid path if do not have spark-avro jars: xxx/spark-avro*.jar,
it should return an empty string.

AVRO_JARS=$(echo "$LOCAL_JAR_PATH"/spark-avro*.jar)

This will cause the following error if put non-exist jar into spark.jars

22/07/22 04:03:12 ERROR SparkContext: Failed to add /home/non-exist.jar to Spark environment
java.io.FileNotFoundException: Jar /home/non-exist.jar not found
	at org.apache.spark.SparkContext.addLocalJarFile$1(SparkContext.scala:1949)
	at org.apache.spark.SparkContext.addJar(SparkContext.scala:2004)
	at org.apache.spark.SparkContext.$anonfun$new$12(SparkContext.scala:507)
	at org.apache.spark.SparkContext.$anonfun$new$12$adapted(SparkContext.scala:507)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:507)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:748)

@gerashegalov
Copy link
Collaborator

The following command gets an invalid path if do not have spark-avro jars: xxx/spark-avro*.jar, it should return an empty string.

AVRO_JARS=$(echo "$LOCAL_JAR_PATH"/spark-avro*.jar)

Note that I suggested to replace echo by readlink #6044 (comment)

$ AVRO_JARS=$(readlink -f /non/existing/path/spark-avro*.jar)
$ echo -n "$AVRO_JARS" | wc -c
0

@res-life
Copy link
Collaborator Author

res-life commented Jul 22, 2022

A single readlink -f path can't get the right answer, see below:
See the last revision code, currently, added the readlink to canonicalize the path.

$ ls /home/chongg/local-disk/code/spark-rapids/integration_tests/target
run_dir
// not have `integration-tests jar`

$ ls /home/chongg/local-disk/code/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*-spark330.jar
ls: cannot access '/home/chongg/local-disk/code/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*-spark330.jar': No such file or directory
// not have `integration-tests jar`

$ readlink -f /home/chongg/local-disk/code/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*-spark330.jar
/home/chongg/local-disk/code/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*-spark330.jar
// But readlink gets this non-existing path,  this command can't get the right answer


@gerashegalov
Copy link
Collaborator

re #6044 (comment)
I inadvertently gave you a switch -f that I often use with readlink allowing the leaf path component to be absent. But there is also -e requiring all path components to exist.

https://linuxcommand.org/lc3_man_pages/readlink1.html

@res-life
Copy link
Collaborator Author

build

@res-life
Copy link
Collaborator Author

re #6044 (comment) I inadvertently gave you a switch -f that I often use with readlink allowing the leaf path component to be absent. But there is also -e requiring all path components to exist.

https://linuxcommand.org/lc3_man_pages/readlink1.html

Good idea, done.

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

gerashegalov
gerashegalov previously approved these changes Jul 28, 2022
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

please verify that the script works with Iceberg. Talking to @jlowe I recall it was sensitive to exrtaClassPath vs --jars classloader

if [[ "$ICEBERG_SPARK_VER" < "3.3" ]]; then
# Classloader config is here to work around classloader issues with
# --packages in distributed setups, should be fixed by
# https://github.com/NVIDIA/spark-rapids/pull/5646
SPARK_SUBMIT_FLAGS="$BASE_SPARK_SUBMIT_ARGS $SEQ_CONF \
--conf spark.rapids.force.caller.classloader=false \
--packages org.apache.iceberg:iceberg-spark-runtime-${ICEBERG_SPARK_VER}_2.12:${ICEBERG_VERSION} \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hadoop \
--conf spark.sql.catalog.spark_catalog.warehouse=/tmp/spark-warehouse-$$" \
./run_pyspark_from_build.sh -m iceberg --iceberg

@res-life res-life changed the base branch from branch-22.08 to branch-22.10 July 29, 2022 05:05
@res-life
Copy link
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes Jul 29, 2022
@res-life
Copy link
Collaborator Author

build

@gerashegalov gerashegalov marked this pull request as ready for review July 29, 2022 08:37
@gerashegalov
Copy link
Collaborator

build

@sameerz sameerz added the test Only impacts tests label Jul 29, 2022
…raClassPath and spark.driver.extraClassPath

Signed-off-by: Chong Gao <[email protected]>
@res-life
Copy link
Collaborator Author

res-life commented Aug 1, 2022

build

@res-life res-life marked this pull request as draft August 1, 2022 09:28
@res-life
Copy link
Collaborator Author

res-life commented Aug 1, 2022

Investigating the class not found issue when running rapids_shuffle_smoke_test
https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-premerge-build.sh#L91

22/08/01 05:25:08 INFO SecurityManager: Changing view acls groups to: 
22/08/01 05:25:08 INFO SecurityManager: Changing modify acls groups to: 
22/08/01 05:25:08 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/08/01 05:25:08 INFO TransportClientFactory: Successfully created connection to premerge-ci-1-jenkins-rapids-premerge-github-5274-b5jn1-05nrp/10.233.110.96:39717 after 3 ms (0 ms spent in bootstraps)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:382)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark311.RapidsShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:207)
	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:275)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
	... 4 more

@gerashegalov
Copy link
Collaborator

Investigating the class not found issue when running rapids_shuffle_smoke_test https://github.com/NVIDIA/spark-rapids/blob/branch-22.08/jenkins/spark-premerge-build.sh#L91

This confirms my previous finding #5796 that extraClassPath has been there to begin with to deal with the bug in Spark Standalone. We can work it around and still remain consistent by inspecting whether spark.shuffle.manager is part of the config
https://github.com/NVIDIA/spark-rapids/blob/branch-22.10/integration_tests/run_pyspark_from_build.sh#L236
to decide whether to use --jars or extraClassPath. Then it's clearly documentable when to use which option

@res-life
Copy link
Collaborator Author

res-life commented Aug 2, 2022

build

@gerashegalov
Copy link
Collaborator

build

@res-life blossom-ci is disabled on draft PR's

@pxLi maybe we could move the check to yml because there seems to be Boolean "draft" field on pull_request object https://stackoverflow.com/questions/68349031/only-run-actions-on-non-draft-pull-request

Comment on lines 246 to 249
else
# If specified master, set `spark.executor.extraClassPath` due to issue https://github.com/NVIDIA/spark-rapids/issues/5796
# Remove this line if the issue is fixed
export PYSP_TEST_spark_executor_extraClassPath="${ALL_JARS}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I meant in #6044 (comment). Whether to use extraClassPath is not decided based on master. I just meant we need a similar check. Let us undo this change

We want to inspect whether PYSP_TEST_spark_shuffle_manager is set outside `if ((NUM_LOCAL_EXECS > 0)); then .., else ... fi'. Please refer to the comment for L202

Comment on lines 263 to 264
# `spark.jars` is the same as `--jars`, e.g.: --jars a.jar,b.jar...
exec "$SPARK_HOME"/bin/spark-submit --conf spark.jars=${PYSP_TEST_spark_jars} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the above is acceptable here we can do

Suggested change
# `spark.jars` is the same as `--jars`, e.g.: --jars a.jar,b.jar...
exec "$SPARK_HOME"/bin/spark-submit --conf spark.jars=${PYSP_TEST_spark_jars} \
if [[ -n "$PYSP_TEST_spark_jars" ]]; then
jarOpts=(--conf spark.jars="$PYSP_TEST_spark_jars")
elif [[ -n "$PYSP_TEST_spark_driver_extraClassPath" ]]; then
jarOpts=(--driver-class-path "$PYSP_TEST_spark_driver_extraClassPath")
fi
# `spark.jars` is the same as `--jars`, e.g.: --jars a.jar,b.jar...
exec "$SPARK_HOME"/bin/spark-submit "${jarOpts[@]}" \

Comment on lines 198 to 202
export PYSP_TEST_spark_driver_extraClassPath="${ALL_JARS// /:}"
export PYSP_TEST_spark_executor_extraClassPath="${ALL_JARS// /:}"
export PYSP_TEST_spark_jars="${ALL_JARS}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we can have something to the tune of

    if [[ "${PYSP_TEST_spark_shuffle_manager}" =~ "RapidsShuffleManager" ]]; then
        export PYSP_TEST_spark_driver_extraClassPath="${ALL_JARS// /:}"
        export PYSP_TEST_spark_executor_extraClassPath="${ALL_JARS// /:}"
    else
        export PYSP_TEST_spark_jars="${ALL_JARS}"
    fi

@pxLi
Copy link
Member

pxLi commented Aug 3, 2022

build

@res-life blossom-ci is disabled on draft PR's

@pxLi maybe we could move the check to yml because there seems to be Boolean "draft" field on pull_request object https://stackoverflow.com/questions/68349031/only-run-actions-on-non-draft-pull-request

blossom-ci should work on draft PR. This one was actually timeout due to above issues

@gerashegalov
Copy link
Collaborator

blossom-ci should work on draft PR. This one was actually timeout due to above issues

Thanks @pxLi , good to know

@res-life res-life marked this pull request as ready for review August 11, 2022 05:41
@res-life
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL
5 participants