[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

pxLi · 2022-05-26T03:05:15Z

Describe the bug
nightly failed UT w/ 9e593b3

10:50:06  [ERROR] testCudaAsyncMemoryResourceSize  Time elapsed: 0.008 s  <<< ERROR!
10:50:06  ai.rapids.cudf.CudfException: CUDA error at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-4-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/detail/dynamic_load_runtime.hpp:139: cudaErrorInvalidValue invalid argument
10:50:06  	at ai.rapids.cudf.Rmm.initializeInternal(Native Method)
10:50:06  	at ai.rapids.cudf.Rmm.initialize(Rmm.java:119)
10:50:06  	at ai.rapids.cudf.RmmTest.testCudaAsyncMemoryResourceSize(RmmTest.java:392)
10:50:06  	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
10:50:06  	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
10:50:06  	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
10:50:06  	at java.lang.reflect.Method.invoke(Method.java:498)
10:50:06  	at org.junit.platform.commons.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:725)
10:50:06  	at org.junit.jupiter.engine.execution.MethodInvocation.proceed(MethodInvocation.java:60)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$ValidatingInvocation.proceed(InvocationInterceptorChain.java:131)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.intercept(TimeoutExtension.java:149)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestableMethod(TimeoutExtension.java:140)
10:50:06  	at org.junit.jupiter.engine.extension.TimeoutExtension.interceptTestMethod(TimeoutExtension.java:84)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker$ReflectiveInterceptorCall.lambda$ofVoidMethod$0(ExecutableInvoker.java:115)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.lambda$invoke$0(ExecutableInvoker.java:105)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain$InterceptedInvocation.proceed(InvocationInterceptorChain.java:106)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.proceed(InvocationInterceptorChain.java:64)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.chainAndInvoke(InvocationInterceptorChain.java:45)
10:50:06  	at org.junit.jupiter.engine.execution.InvocationInterceptorChain.invoke(InvocationInterceptorChain.java:37)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:104)
10:50:06  	at org.junit.jupiter.engine.execution.ExecutableInvoker.invoke(ExecutableInvoker.java:98)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.lambda$invokeTestMethod$7(TestMethodTestDescriptor.java:214)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.invokeTestMethod(TestMethodTestDescriptor.java:210)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:135)
10:50:06  	at org.junit.jupiter.engine.descriptor.TestMethodTestDescriptor.execute(TestMethodTestDescriptor.java:66)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:151)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at java.util.ArrayList.forEach(ArrayList.java:1259)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at java.util.ArrayList.forEach(ArrayList.java:1259)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:41)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$6(NodeTestTask.java:155)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:141)
10:50:06  	at org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$9(NodeTestTask.java:139)
10:50:06  	at org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:138)
10:50:06  	at org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:95)
10:50:06  	at org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:35)
10:50:06  	at org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
10:50:06  	at org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:54)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
10:50:06  	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
10:50:06  	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:155)
10:50:06  	at org.junit.platform.surefire.provider.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:134)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:383)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:344)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:125)
10:50:06  	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:417)
10:50:06

The text was updated successfully, but these errors were encountered:

pxLi · 2022-05-26T03:09:38Z

could be related to https://github.com/rapidsai/rmm/commits/branch-22.06 frequent rmm changes

will try rebuild later

jlowe · 2022-05-26T14:59:13Z

I was able to build 9e593b3 successfully via build-in-docker and all RmmTest tests passed.

gerashegalov · 2022-05-26T16:12:01Z

Maybe the build machine has an older CUDA driver and it reproduces only there?

pxLi · 2022-05-27T01:49:43Z

actually its due to the rmm change, since rmm is a submodule of cudf build.

The build failed w/ one related change, and passed w/ the revert commit went in.

pxLi · 2022-05-27T01:50:10Z

close this ticket since rmm revert is in. And CI build could pass now

pxLi · 2022-05-27T02:45:50Z

hmm, failed again in nightly test.
in rmm https://github.com/rapidsai/rmm/commits/branch-22.06
they revert the revert again....

which introduced the dynamic_load_runtime.hpp change

Its weird we only fail this RMM test in nightly CI https://github.com/NVIDIA/spark-rapids-jni/blob/branch-22.06/ci/nightly-build.sh#L25-L29
but seems RMM test in submodule syncup CI works fine w/ the same image (cuda11.5 runtime) and jenkins instance, https://github.com/NVIDIA/spark-rapids-jni/blob/branch-22.06/ci/submodule-sync.sh#L68-L71

Not sure if we missed something when verify vs package, maybe just bad timing...

jlowe · 2022-05-27T13:37:11Z

Not sure if we missed something when verify vs package

No, there's nothing different between verify and package goals with respect to the test. Both goals will first go through the same test goal on their way to their ultimate destination goal, and the test goal is where it's failing.

Are we positive the two containers are running on the same host OS? Docker doesn't isolate the kernel and drivers, so there could be a problem with the driver on one host OS being a different version than the driver on another host OS even though they're running the same Docker image in both cases.

pxLi · 2022-05-30T01:18:32Z

Not sure if we missed something when verify vs package

No, there's nothing different between verify and package goals with respect to the test. Both goals will first go through the same test goal on their way to their ultimate destination goal, and the test goal is where it's failing.

Are we positive the two containers are running on the same host OS? Docker doesn't isolate the kernel and drivers, so there could be a problem with the driver on one host OS being a different version than the driver on another host OS even though they're running the same Docker image in both cases.

Confirm it's due to host driver versions,
460.58 failed the RMM tests,
465.07, 470.13, 495.29 passed the test.

Looks like recent rmm changes relies on some new API of kernel lib.
It seems just detect the cuda version (in this case 11.5), but does not take care of driver/kernel lib (460.xx) well

rapidsai/rmm@914cb4c#diff-3e3264777cc642f393b35c720e0c55626d678c5e8fc11f0b6e9f25053d7c6ca3R138

I am not a cuda expert, can you help take a look at above change? thanks! @jlowe @gerashegalov

gerashegalov · 2022-05-31T18:57:57Z

Hi @pxLi, I filed an issue with RMM rapidsai/rmm#1054

tgravescs · 2022-06-01T20:48:15Z

We don't need the CUDF change in 22.06 for this because with the spark rapids plugin, we would fall back to use ARENA for any driver version < 11.5 and we statically link the runtime to be 11.5.

The open things to clarify are:

which versions do we need to really support async and we should update RMMTest to match. Talking to CUDF team to clarify this.
Our CI environment should be besting exactly what we want to. In this case I think we want to make sure driver version is always 11.5 so we test async. if we choose to test more versions that should be explicitly decided and CI should be setup to be consistent.

Based on this one @pxLi can we be more specific in our CI runs to make sure the nodes we run on have a minimum driver version of 11.5?

tgravescs · 2022-06-01T20:49:00Z

Leaving this targetted for 22.06 to see if we can get CI environment clarified

tgravescs · 2022-06-01T21:24:49Z

So for number 1 above the async support, Mark has responded on the PR with a more appropriate change which I believe will handle that so the test should work as is.

pxLi · 2022-06-02T01:26:11Z

We don't need the CUDF change in 22.06 for this because with the spark rapids plugin, we would fall back to use ARENA for any driver version < 11.5 and we statically link the runtime to be 11.5.

The open things to clarify are:

which versions do we need to really support async and we should update RMMTest to match. Talking to CUDF team to clarify this.

Based on this one @pxLi can we be more specific in our CI runs to make sure the nodes we run on have a minimum driver version of 11.5?

there is actually no driver version w/ cuda 11.5 (I guess you mean cuda kernel?), all 450.XX, 460.XX, 470.XX, 495.XX could support cuda 11.5 after some specific minor XX versions. And the reason our static cuda 11.5 runtime could run well in drivers w/ smaller major version is due to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title.

So I would prefer to keep this, since rapids side does not cover enough scenarios as ours. Otherwise, we may hide test failures like this case. People could fail to use our plugin (or jni in this case) in ENV that they have data center gpu and have drivers that support cuda11 forward compatibility. It's lucky this time we plugin code have a selector for specific rmm pool, but next time it could be our users to report some other bugs if we/rapids did not take care of the forward compatibility well.

Our CI environment should be besting exactly what we want to. In this case I think we want to make sure driver version is always 11.5 so we test async. if we choose to test more versions that should be explicitly decided and CI should be setup to be consistent.

All drivers of our CICD instances are announced supporting cuda11 forward compatibility. So IMHO, it should be our code but not the CI ENV to support the cases. I always agree to explicitly setup some pipelines w/ specific drivers when we have more resource

pxLi · 2022-06-02T01:47:50Z

@tgravescs @sameerz
If we think cudf/plugin jni build&test should be a different case (no need to test forward compatibility) other than plugin build&test, please let me know. I could ask blossom team to help upgrade all driver versions on our CI machines to some specific one for cudf&plugin JNI only

BTW since you guys decide not to put the fix into 2206, I just updated the 2206 cudf&plugin JNI build to use newer driver instances. For 2208 @gerashegalov 's fix should help~

tgravescs · 2022-06-02T12:38:44Z

I do think its good to test on various versions to catch things like this but at the same time I think we should know what is being tested and we should make sure its being tested regularly, ie every other day we cycle through different version vs maybe we hit the same 460 kernel three days in a row. Like you said "It's lucky this time we plugin code have a selector for specific rmm pool".... we should not be relying on luck....

When its run on the older driver versions then certain features (in this case async memory) is not supported so the tests aren't testing it works. Async is our default configuration so we definitely want it tested. Really ideally before a release we should test on all the versions. Is there anyway to tell what version is being tested each day? This mostly comes down to resources and deciding what we want to test. @sameerz

pxLi · 2022-06-06T01:19:11Z

Is there anyway to tell what version is being tested each day?

We usually have the nvdia-smi of output for some of pipelines, if others do not have it we can add this to them. And yes, currently to mix the drivers in nightly runs is because there is no enough resources to schedule specific instances for the pipelines, we could make it happen when new machines are ready (we request 10+ machines in latest infra resource plan)

For test features w/ specific drivers, we may need a huge table for it. As we known, nvidia has many driver versions, there is no clear table of it to say which driver w/ major.minor version to support some specific cuda features. Mostly it only showed us that if your driver is newer than a specific version and you have a data-center gpu, the ENV should support forward compatibility. Or maybe we have a clear table somewhere? Thanks

tgravescs · 2022-06-06T14:14:27Z

moving to 22.08 as we don't need anything for 22.06, just the build randomly fails when it gets the host.

rapidsai/rmm#1055 should fix this in 22.08.

Yes we need to decide what testing matrix we want if we want to change anything here. @sameerz do you want to discuss this offline or have a separate issue for it?

gerashegalov · 2022-06-20T19:13:20Z

rapidsai/rmm#1055 was propagated to spark-rapids-jni a while ago, and this issue is fixed as far as I see.

pxLi added the bug Something isn't working label May 26, 2022

pxLi changed the title ~~[BUG] 22.08 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit~~ [BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit May 26, 2022

pxLi closed this as completed May 27, 2022

pxLi reopened this May 27, 2022

gerashegalov mentioned this issue May 31, 2022

[BUG] Driver 460.58 => dynamic_load_runtime.hpp:139: cudaErrorInvalidValue invalid argument rapidsai/rmm#1054

Closed

gerashegalov self-assigned this Jun 20, 2022

gerashegalov closed this as completed Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

pxLi commented May 26, 2022 •

edited

Loading

pxLi commented May 26, 2022 •

edited

Loading

jlowe commented May 26, 2022

gerashegalov commented May 26, 2022

pxLi commented May 27, 2022

pxLi commented May 27, 2022

pxLi commented May 27, 2022 •

edited

Loading

jlowe commented May 27, 2022

pxLi commented May 30, 2022 •

edited

Loading

gerashegalov commented May 31, 2022

tgravescs commented Jun 1, 2022

tgravescs commented Jun 1, 2022

tgravescs commented Jun 1, 2022

pxLi commented Jun 2, 2022 •

edited

Loading

pxLi commented Jun 2, 2022 •

edited

Loading

tgravescs commented Jun 2, 2022

pxLi commented Jun 6, 2022 •

edited

Loading

tgravescs commented Jun 6, 2022

gerashegalov commented Jun 20, 2022

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287

Comments

pxLi commented May 26, 2022 • edited Loading

pxLi commented May 26, 2022 • edited Loading

jlowe commented May 26, 2022

gerashegalov commented May 26, 2022

pxLi commented May 27, 2022

pxLi commented May 27, 2022

pxLi commented May 27, 2022 • edited Loading

jlowe commented May 27, 2022

pxLi commented May 30, 2022 • edited Loading

gerashegalov commented May 31, 2022

tgravescs commented Jun 1, 2022

tgravescs commented Jun 1, 2022

tgravescs commented Jun 1, 2022

pxLi commented Jun 2, 2022 • edited Loading

pxLi commented Jun 2, 2022 • edited Loading

tgravescs commented Jun 2, 2022

pxLi commented Jun 6, 2022 • edited Loading

tgravescs commented Jun 6, 2022

gerashegalov commented Jun 20, 2022

pxLi commented May 26, 2022 •

edited

Loading

pxLi commented May 26, 2022 •

edited

Loading

pxLi commented May 27, 2022 •

edited

Loading

pxLi commented May 30, 2022 •

edited

Loading

pxLi commented Jun 2, 2022 •

edited

Loading

pxLi commented Jun 2, 2022 •

edited

Loading

pxLi commented Jun 6, 2022 •

edited

Loading