Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

Closed
pxLi opened this issue May 26, 2022 · 18 comments
Closed

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

pxLi opened this issue May 26, 2022 · 18 comments
Assignees
Labels
bug Something isn't working

Comments

@pxLi
Copy link
Collaborator

pxLi commented May 26, 2022

Describe the bug
nightly failed UT w/ 9e593b3

10:50:06  [ERROR] testCudaAsyncMemoryResourceSize  Time elapsed: 0.008 s  <<< ERROR!
10:50:06  ai.rapids.cudf.CudfException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-4-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/detail/dynamic_load_runtime.hpp:139: cudaErrorInvalidValue invalid argument
10:50:06  	at ai.rapids.cudf.Rmm.initializeInternal(Native Method)
10:50:06  	at ai.rapids.cudf.Rmm.initialize(Rmm.java:119)
10:50:06  	at ai.rapids.cudf.RmmTest.testCudaAsyncMemoryResourceSize(RmmTest.java:392)
10:50:06  	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
10:50:06  	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
10:50:06  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
10:50:06  	at java.lang.reflect.Method.invoke(Method.java:498)
10:50:06  	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
10:50:06  	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:214)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:210)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:135)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:66)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at java.util.ArrayList.forEach(ArrayList.java:1259)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at java.util.ArrayList.forEach(ArrayList.java:1259)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
10:50:06  	at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
10:50:06  	at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
10:50:06  	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:155)
10:50:06  	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:134)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:383)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:344)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:417)
10:50:06  
@pxLi pxLi added the bug Something isn't working label May 26, 2022
@pxLi pxLi changed the title [BUG] 22.08 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit [BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit May 26, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented May 26, 2022

could be related to https://github.com/rapidsai/rmm/commits/branch-22.06 frequent rmm changes

image

will try rebuild later

@jlowe
Copy link
Contributor

jlowe commented May 26, 2022

I was able to build 9e593b3 successfully via build-in-docker and all RmmTest tests passed.

@gerashegalov
Copy link
Collaborator

Maybe the build machine has an older CUDA driver and it reproduces only there?

@pxLi
Copy link
Collaborator Author

pxLi commented May 27, 2022

actually its due to the rmm change, since rmm is a submodule of cudf build.

The build failed w/ one related change, and passed w/ the revert commit went in.

image

@pxLi
Copy link
Collaborator Author

pxLi commented May 27, 2022

close this ticket since rmm revert is in. And CI build could pass now

@pxLi pxLi closed this as completed May 27, 2022
@pxLi pxLi reopened this May 27, 2022
@pxLi
Copy link
Collaborator Author

pxLi commented May 27, 2022

hmm, failed again in nightly test.
in rmm https://github.com/rapidsai/rmm/commits/branch-22.06
they revert the revert again....

image

which introduced the dynamic_load_runtime.hpp change

Its weird we only fail this RMM test in nightly CI https://github.com/NVIDIA/spark-rapids-jni/blob/branch-22.06/ci/nightly-build.sh#L25-L29
but seems RMM test in submodule syncup CI works fine w/ the same image (cuda11.5 runtime) and jenkins instance, https://github.com/NVIDIA/spark-rapids-jni/blob/branch-22.06/ci/submodule-sync.sh#L68-L71

Not sure if we missed something when verify vs package, maybe just bad timing...

@jlowe
Copy link
Contributor

jlowe commented May 27, 2022

Not sure if we missed something when verify vs package

No, there's nothing different between verify and package goals with respect to the test. Both goals will first go through the same test goal on their way to their ultimate destination goal, and the test goal is where it's failing.

Are we positive the two containers are running on the same host OS? Docker doesn't isolate the kernel and drivers, so there could be a problem with the driver on one host OS being a different version than the driver on another host OS even though they're running the same Docker image in both cases.

@pxLi
Copy link
Collaborator Author

pxLi commented May 30, 2022

Not sure if we missed something when verify vs package

No, there's nothing different between verify and package goals with respect to the test. Both goals will first go through the same test goal on their way to their ultimate destination goal, and the test goal is where it's failing.

Are we positive the two containers are running on the same host OS? Docker doesn't isolate the kernel and drivers, so there could be a problem with the driver on one host OS being a different version than the driver on another host OS even though they're running the same Docker image in both cases.

Confirm it's due to host driver versions,
460.58 failed the RMM tests,
465.07, 470.13, 495.29 passed the test.

Looks like recent rmm changes relies on some new API of kernel lib.
It seems just detect the cuda version (in this case 11.5), but does not take care of driver/kernel lib (460.xx) well

rapidsai/rmm@914cb4c#diff-3e3264777cc642f393b35c720e0c55626d678c5e8fc11f0b6e9f25053d7c6ca3R138

I am not a cuda expert, can you help take a look at above change? thanks! @jlowe @gerashegalov

@gerashegalov
Copy link
Collaborator

Hi @pxLi, I filed an issue with RMM rapidsai/rmm#1054

@tgravescs
Copy link
Collaborator

We don't need the CUDF change in 22.06 for this because with the spark rapids plugin, we would fall back to use ARENA for any driver version < 11.5 and we statically link the runtime to be 11.5.

The open things to clarify are:

  1. which versions do we need to really support async and we should update RMMTest to match. Talking to CUDF team to clarify this.
  2. Our CI environment should be besting exactly what we want to. In this case I think we want to make sure driver version is always 11.5 so we test async. if we choose to test more versions that should be explicitly decided and CI should be setup to be consistent.

Based on this one @pxLi can we be more specific in our CI runs to make sure the nodes we run on have a minimum driver version of 11.5?

@tgravescs
Copy link
Collaborator

Leaving this targetted for 22.06 to see if we can get CI environment clarified

@tgravescs
Copy link
Collaborator

So for number 1 above the async support, Mark has responded on the PR with a more appropriate change which I believe will handle that so the test should work as is.

@pxLi
Copy link
Collaborator Author

pxLi commented Jun 2, 2022

We don't need the CUDF change in 22.06 for this because with the spark rapids plugin, we would fall back to use ARENA for any driver version < 11.5 and we statically link the runtime to be 11.5.

The open things to clarify are:

  1. which versions do we need to really support async and we should update RMMTest to match. Talking to CUDF team to clarify this.

Based on this one @pxLi can we be more specific in our CI runs to make sure the nodes we run on have a minimum driver version of 11.5?

there is actually no driver version w/ cuda 11.5 (I guess you mean cuda kernel?), all 450.XX, 460.XX, 470.XX, 495.XX could support cuda 11.5 after some specific minor XX versions. And the reason our static cuda 11.5 runtime could run well in drivers w/ smaller major version is due to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title.

So I would prefer to keep this, since rapids side does not cover enough scenarios as ours. Otherwise, we may hide test failures like this case. People could fail to use our plugin (or jni in this case) in ENV that they have data center gpu and have drivers that support cuda11 forward compatibility. It's lucky this time we plugin code have a selector for specific rmm pool, but next time it could be our users to report some other bugs if we/rapids did not take care of the forward compatibility well.

Our CI environment should be besting exactly what we want to. In this case I think we want to make sure driver version is always 11.5 so we test async. if we choose to test more versions that should be explicitly decided and CI should be setup to be consistent.

All drivers of our CICD instances are announced supporting cuda11 forward compatibility. So IMHO, it should be our code but not the CI ENV to support the cases. I always agree to explicitly setup some pipelines w/ specific drivers when we have more resource

@pxLi
Copy link
Collaborator Author

pxLi commented Jun 2, 2022

@tgravescs @sameerz
If we think cudf/plugin jni build&test should be a different case (no need to test forward compatibility) other than plugin build&test, please let me know. I could ask blossom team to help upgrade all driver versions on our CI machines to some specific one for cudf&plugin JNI only

BTW since you guys decide not to put the fix into 2206, I just updated the 2206 cudf&plugin JNI build to use newer driver instances. For 2208 @gerashegalov 's fix should help~

@tgravescs
Copy link
Collaborator

I do think its good to test on various versions to catch things like this but at the same time I think we should know what is being tested and we should make sure its being tested regularly, ie every other day we cycle through different version vs maybe we hit the same 460 kernel three days in a row. Like you said "It's lucky this time we plugin code have a selector for specific rmm pool".... we should not be relying on luck....

When its run on the older driver versions then certain features (in this case async memory) is not supported so the tests aren't testing it works. Async is our default configuration so we definitely want it tested. Really ideally before a release we should test on all the versions. Is there anyway to tell what version is being tested each day? This mostly comes down to resources and deciding what we want to test. @sameerz

@pxLi
Copy link
Collaborator Author

pxLi commented Jun 6, 2022

Is there anyway to tell what version is being tested each day?

We usually have the nvdia-smi of output for some of pipelines, if others do not have it we can add this to them. And yes, currently to mix the drivers in nightly runs is because there is no enough resources to schedule specific instances for the pipelines, we could make it happen when new machines are ready (we request 10+ machines in latest infra resource plan)

For test features w/ specific drivers, we may need a huge table for it. As we known, nvidia has many driver versions, there is no clear table of it to say which driver w/ major.minor version to support some specific cuda features. Mostly it only showed us that if your driver is newer than a specific version and you have a data-center gpu, the ENV should support forward compatibility. Or maybe we have a clear table somewhere? Thanks

@tgravescs
Copy link
Collaborator

moving to 22.08 as we don't need anything for 22.06, just the build randomly fails when it gets the host.

rapidsai/rmm#1055 should fix this in 22.08.

Yes we need to decide what testing matrix we want if we want to change anything here. @sameerz do you want to discuss this offline or have a separate issue for it?

@gerashegalov gerashegalov self-assigned this Jun 20, 2022
@gerashegalov
Copy link
Collaborator

rapidsai/rmm#1055 was propagated to spark-rapids-jni a while ago, and this issue is fixed as far as I see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants