Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add valid retry solution to mvn-verify [skip ci] #9609

Merged
merged 6 commits into from
Nov 8, 2023

Conversation

YanxuanLiu
Copy link
Collaborator

@YanxuanLiu YanxuanLiu commented Nov 2, 2023

part of #9559

  1. added script-level retry for mvn commands
  2. add http connection TTL options

Signed-off-by: YanxuanLiu <[email protected]>
@YanxuanLiu YanxuanLiu changed the title add retry in bash-level Add valid retry solution to mvn-verify Nov 2, 2023
Signed-off-by: YanxuanLiu <[email protected]>
Signed-off-by: YanxuanLiu <[email protected]>
@revans2
Copy link
Collaborator

revans2 commented Nov 2, 2023

Why are we retrying verify? That includes running tests. If the connection to maven central is not ideal and needs to be retired I am okay with doing that, but if a test is failing randomly and ends up being retired until it passes I don't want that to happen.

@gerashegalov
Copy link
Collaborator

Why are we retrying verify? That includes running tests. If the connection to maven central is not ideal and needs to be retired I am okay with doing that, but if a test is failing randomly and ends up being retired until it passes I don't want that to happen.

We don't run tests in github actions. Only things like compile,RAT,scalastyle,docgen

TImeouts during spark-rapids-jni downloads are the most common reason for the failed checks there. I think the proper solution is to implement a cache action https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows
that would fetch heavy dependencies before the matrix jobs kick in. That would be equivalent to the similar logic in ./build/buildall

@sameerz sameerz added the build Related to CI / CD or cleanly building label Nov 2, 2023
@YanxuanLiu
Copy link
Collaborator Author

Why are we retrying verify? That includes running tests. If the connection to maven central is not ideal and needs to be retired I am okay with doing that, but if a test is failing randomly and ends up being retired until it passes I don't want that to happen.

We don't run tests in github actions. Only things like compile,RAT,scalastyle,docgen

TImeouts during spark-rapids-jni downloads are the most common reason for the failed checks there. I think the proper solution is to implement a cache action https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows that would fetch heavy dependencies before the matrix jobs kick in. That would be equivalent to the similar logic in ./build/buildall

cache action is used in MONAILabel project, but we need more attention to maintain the disc space of cache, and it may bring more unexpected issues. https://github.com/Project-MONAI/MONAILabel/actions/runs/6563649076.

For this feature, I think we can try if dependencies download issue can be solved by retry first.

Signed-off-by: YanxuanLiu <[email protected]>
@YanxuanLiu
Copy link
Collaborator Author

Why are we retrying verify? That includes running tests. If the connection to maven central is not ideal and needs to be retired I am okay with doing that, but if a test is failing randomly and ends up being retired until it passes I don't want that to happen.

@peixin and I tried to run mvn -U -B -Dmaven.repo.local=$WORKSPACE/.m2/repository dependency:go-offline dependency:resolve-plugins -Dcuda.version=cuda11 -Dbuildver=312 locally.

It failed with log:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 23.12.0-SNAPSHOT:
[INFO] 
[INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [  0.469 s]
[INFO] rapids-4-spark-jdk-profiles_2.12 ................... SUCCESS [  0.129 s]
[INFO] rapids-4-spark-shim-deps-parent_2.12 ............... SUCCESS [  0.448 s]
[INFO] rapids-4-spark-sql-plugin-api_2.12 ................. SUCCESS [  7.887 s]
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... FAILURE [  8.113 s]
[INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Delta Lake Stub  SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Aggregator ..... SKIPPED
[INFO] Data Generator ..................................... SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Distribution ... SKIPPED
[INFO] rapids-4-spark-integration-tests_2.12 .............. SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Tests .......... SKIPPED
[INFO] rapids-4-spark-api-validation_2.12 ................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  18.776 s
[INFO] Finished at: 2023-11-03T17:58:48+08:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project rapids-4-spark-sql_2.12: Could not resolve dependencies for project com.nvidia:rapids-4-spark-sql_2.12:jar:23.12.0-SNAPSHOT: Could not find artifact com.nvidia:rapids-4-spark-sql-plugin-api_2.12:jar:spark312:23.12.0-SNAPSHOT in snapshots-repo (https://oss.sonatype.org/content/repositories/snapshots) -> [Help 1]

These dependencies cannot be found on sonatype, but will be built in following mvn commands. Actually, we still need to download dependencies in following steps and we may still need retry then.

We also ran mvn -U -B -Dmaven.repo.local=$WORKSPACE/.m2/repository dependency:go-offline -pl sql-plugin-api -Dcuda.version=cuda11 -Dbuildver=312 for single module, the result is success.

Do you have suggestion on using dependency:go-offline? Can we skip the dependencies not found, or upload these deps to sonatype?

@revans2
Copy link
Collaborator

revans2 commented Nov 3, 2023

@gerashegalov yes I forgot about that. We don't run any tests so it is all about compiling.

@YanxuanLiu I think I found an alternative maven plugin that is designed to fix the problems that we are seeing.

mvn -U -B -Dmaven.repo.local=/home/roberte/src/rapids-plugin-4-spark/tmp-repo de.qaware.maven:go-offline-maven-plugin:1.2.8:resolve-dependencies -Dcuda.version=cuda11 -Dbuildver=330

The github page for it is at https://github.com/qaware/go-offline-maven-plugin

@gerashegalov
Copy link
Collaborator

I am not sure we can rely on local paths to share dependency cache between matrix job runners. We can add some logging to actions to see if they share anything even when they land on the same VM.

However, I suggest using a front door github feature https://github.com/actions/setup-java#caching-packages-dependencies instead of cooking up a backdoor workaround.

Even with the cache action, our main issue is a DDOS fetching of spark-rapids-jni from parallel matrix jobs.

So we should add an upstream job doing this step with github dependency cache enabled

@pxLi
Copy link
Member

pxLi commented Nov 6, 2023

Can we skip the dependencies not found, or upload these deps to sonatype?

We do not deploy plugin snapshot intermediate artifacts of plugin repo to public due to security requirements

@pxLi
Copy link
Member

pxLi commented Nov 6, 2023

Even with the cache action, our main issue is a DDOS fetching of spark-rapids-jni from parallel matrix jobs.

From my monitoring, its not only JNI, when the maven(sonatype) service is not responsive it would randomly fail any dep downloading

  1. conn reset from server side
  2. 502 internal service error
  3. Cannot find required artifact (search service down from their status page)

So we should add an upstream job doing this step with github dependency cache enabled

this could be an option to help decrease the frequency of similar issue.

This would also not entirely resolve this issue as pre-baked cache from that command would not cover all following module builds. mvn verify/package could still fail downloading non-pre-baked deps from remote. But we should try everything to help mitigate issue @YanxuanLiu (the root cause is still the unstable maven repo service which we cannot help too much)

@YanxuanLiu
Copy link
Collaborator Author

@gerashegalov yes I forgot about that. We don't run any tests so it is all about compiling.

@YanxuanLiu I think I found an alternative maven plugin that is designed to fix the problems that we are seeing.

mvn -U -B -Dmaven.repo.local=/home/roberte/src/rapids-plugin-4-spark/tmp-repo de.qaware.maven:go-offline-maven-plugin:1.2.8:resolve-dependencies -Dcuda.version=cuda11 -Dbuildver=330

The github page for it is at https://github.com/qaware/go-offline-maven-plugin

I tried the plugin locally, the result seems not that well.

[ERROR] Error downloading dependencies for project
[ERROR] The following artifacts could not be resolved: io.netty:netty-all:jar:sources:4.1.74.Final, io.netty:netty-all:jar:javadoc:4.1.74.Final, org.apache.hive:hive-common:jar:javadoc:2.3.9, org.apache.derby:derby:jar:sources:10.14.2.0, org.apache.derby:derby:jar:javadoc:10.14.2.0, io.netty:netty-transport-native-kqueue:jar:javadoc:4.1.74.Final, org.apache.xbean:xbean-asm9-shaded:jar:javadoc:4.20, org.apache.velocity:velocity:jar:sources:1.5, org.apache.velocity:velocity:jar:javadoc:1.5, org.apache.hive:hive-serde:jar:javadoc:2.3.9, io.netty:netty-transport-native-kqueue:jar:javadoc:4.1.74.Final, org.apache.hive:hive-llap-client:jar:javadoc:2.3.9, javax.activation:activation:jar:javadoc:1.1.1, io.netty:netty-transport-native-epoll:jar:javadoc:4.1.74.Final, org.apache.thrift:libfb303:jar:sources:0.9.3, org.apache.thrift:libfb303:jar:javadoc:0.9.3, org.apache.hive.shims:hive-shims-common:jar:javadoc:2.3.9, oro:oro:jar:javadoc:2.0.8, javax.transaction:transaction-api:jar:javadoc:1.1, io.netty:netty-transport-native-epoll:jar:javadoc:4.1.74.Final, org.apache.hadoop:hadoop-client-runtime:jar:sources:3.3.2, org.apache.hadoop:hadoop-client-runtime:jar:javadoc:3.3.2, org.apache.hive.shims:hive-shims-scheduler:jar:javadoc:2.3.9, org.apache.hive:hive-llap-common:jar:javadoc:2.3.9, org.apache.hadoop:hadoop-client-api:jar:javadoc:3.3.2, org.apache.hive:hive-vector-code-gen:jar:javadoc:2.3.9, org.apache.hive:hive-exec:jar:javadoc:2.3.9, stax:stax-api:jar:sources:1.0.1, stax:stax-api:jar:javadoc:1.0.1, org.apache.hive:hive-shims:jar:javadoc:2.3.9, org.apache.hive:hive-metastore:jar:javadoc:2.3.9, org.apache.parquet:parquet-jackson:jar:javadoc:1.12.2, org.apache.hive.shims:hive-shims-0.23:jar:javadoc:2.3.9, javolution:javolution:jar:javadoc:5.5.1, javax.transaction:jta:jar:javadoc:1.1: Could not find artifact io.netty:netty-all:jar:sources:4.1.74.Final in central (https://repo1.maven.org/maven2)
[WARNING] The following artifacts could not be resolved: io.netty:netty-all:jar:sources:4.1.74.Final, io.netty:netty-all:jar:javadoc:4.1.74.Final, org.apache.hive:hive-common:jar:javadoc:2.3.9, org.apache.derby:derby:jar:sources:10.14.2.0, org.apache.derby:derby:jar:javadoc:10.14.2.0, io.netty:netty-transport-native-kqueue:jar:javadoc:4.1.74.Final, org.apache.xbean:xbean-asm9-shaded:jar:javadoc:4.20, org.apache.velocity:velocity:jar:sources:1.5, org.apache.velocity:velocity:jar:javadoc:1.5, org.apache.hive:hive-serde:jar:javadoc:2.3.9, io.netty:netty-transport-native-kqueue:jar:javadoc:4.1.74.Final, org.apache.hive:hive-llap-client:jar:javadoc:2.3.9, javax.activation:activation:jar:javadoc:1.1.1, io.netty:netty-transport-native-epoll:jar:javadoc:4.1.74.Final, org.apache.thrift:libfb303:jar:sources:0.9.3, org.apache.thrift:libfb303:jar:javadoc:0.9.3, org.apache.hive.shims:hive-shims-common:jar:javadoc:2.3.9, oro:oro:jar:javadoc:2.0.8, javax.transaction:transaction-api:jar:javadoc:1.1, io.netty:netty-transport-native-epoll:jar:javadoc:4.1.74.Final, org.apache.hadoop:hadoop-client-runtime:jar:sources:3.3.2, org.apache.hadoop:hadoop-client-runtime:jar:javadoc:3.3.2, org.apache.hive.shims:hive-shims-scheduler:jar:javadoc:2.3.9, org.apache.hive:hive-llap-common:jar:javadoc:2.3.9, org.apache.hadoop:hadoop-client-api:jar:javadoc:3.3.2, org.apache.hive:hive-vector-code-gen:jar:javadoc:2.3.9, org.apache.hive:hive-exec:jar:javadoc:2.3.9, stax:stax-api:jar:sources:1.0.1, stax:stax-api:jar:javadoc:1.0.1, org.apache.hive:hive-shims:jar:javadoc:2.3.9, org.apache.hive:hive-metastore:jar:javadoc:2.3.9, org.apache.parquet:parquet-jackson:jar:javadoc:1.12.2, org.apache.hive.shims:hive-shims-0.23:jar:javadoc:2.3.9, javolution:javolution:jar:javadoc:5.5.1, javax.transaction:jta:jar:javadoc:1.1: Could not find artifact io.netty:netty-all:jar:sources:4.1.74.Final in central (https://repo1.maven.org/maven2)
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for RAPIDS Accelerator for Apache Spark Root Project 23.12.0-SNAPSHOT:
[INFO] 
[INFO] RAPIDS Accelerator for Apache Spark Root Project ... SUCCESS [01:18 min]
[INFO] rapids-4-spark-jdk-profiles_2.12 ................... SKIPPED
[INFO] rapids-4-spark-shim-deps-parent_2.12 ............... SKIPPED
[INFO] rapids-4-spark-sql-plugin-api_2.12 ................. SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark SQL Plugin ..... SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Shuffle Plugin . SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Scala UDF Plugin SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Delta Lake 2.1.x Support SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Delta Lake 2.2.x Support SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Delta Lake 2.3.x Support SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Aggregator ..... SKIPPED
[INFO] Data Generator ..................................... SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Distribution ... SKIPPED
[INFO] rapids-4-spark-integration-tests_2.12 .............. SKIPPED
[INFO] RAPIDS Accelerator for Apache Spark Tests .......... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:20 min
[INFO] Finished at: 2023-11-06T17:02:31+08:00
[INFO] ------------------------------------------------------------------------

Also, we need to add all dynamic dependencies to plugin config, I'm not sure if it's a good practice of management.

@YanxuanLiu YanxuanLiu marked this pull request as ready for review November 7, 2023 02:46
@pxLi pxLi added the test Only impacts tests label Nov 7, 2023
Copy link
Member

@pxLi pxLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the PR as partially fix the #9559 to help mitigate the issue first (still failing different CI everyday)

please help retrigger the run multiple times and check the log to see if it actually helped some scenarios resolve the issue (and the cost analysis), thanks

@YanxuanLiu YanxuanLiu self-assigned this Nov 7, 2023
@YanxuanLiu
Copy link
Collaborator Author

Triggered the action several times, and caught one example of retry.
https://github.com/NVIDIA/spark-rapids/actions/runs/6742625790/job/18430371626?pr=9609#step:5:3456
The duration of the stage is 5m12s, which is 4m51s without retry. That means the main difference is sleep time.

I set sleep time for 3 retries to 30s, 60s, 120s. The stage would cost extra <4mins if retried 3 times in theory.

@pxLi pxLi changed the title Add valid retry solution to mvn-verify Add valid retry solution to mvn-verify [skip ci] Nov 8, 2023
@pxLi
Copy link
Member

pxLi commented Nov 8, 2023

build

@YanxuanLiu YanxuanLiu merged commit 9e10f26 into NVIDIA:branch-23.12 Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants