-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 22.06 testCudaAsyncMemoryResourceSize failed w/ latest cudf commit #287
Comments
could be related to https://github.com/rapidsai/rmm/commits/branch-22.06 frequent rmm changes will try rebuild later |
I was able to build 9e593b3 successfully via |
Maybe the build machine has an older CUDA driver and it reproduces only there? |
close this ticket since rmm revert is in. And CI build could pass now |
hmm, failed again in nightly test. which introduced the dynamic_load_runtime.hpp change Its weird we only fail this RMM test in nightly CI https://github.com/NVIDIA/spark-rapids-jni/blob/branch-22.06/ci/nightly-build.sh#L25-L29 Not sure if we missed something when |
No, there's nothing different between verify and package goals with respect to the test. Both goals will first go through the same test goal on their way to their ultimate destination goal, and the test goal is where it's failing. Are we positive the two containers are running on the same host OS? Docker doesn't isolate the kernel and drivers, so there could be a problem with the driver on one host OS being a different version than the driver on another host OS even though they're running the same Docker image in both cases. |
Confirm it's due to host driver versions, Looks like recent rmm changes relies on some new API of kernel lib. rapidsai/rmm@914cb4c#diff-3e3264777cc642f393b35c720e0c55626d678c5e8fc11f0b6e9f25053d7c6ca3R138 I am not a cuda expert, can you help take a look at above change? thanks! @jlowe @gerashegalov |
Hi @pxLi, I filed an issue with RMM rapidsai/rmm#1054 |
We don't need the CUDF change in 22.06 for this because with the spark rapids plugin, we would fall back to use ARENA for any driver version < 11.5 and we statically link the runtime to be 11.5. The open things to clarify are:
Based on this one @pxLi can we be more specific in our CI runs to make sure the nodes we run on have a minimum driver version of 11.5? |
Leaving this targetted for 22.06 to see if we can get CI environment clarified |
So for number 1 above the async support, Mark has responded on the PR with a more appropriate change which I believe will handle that so the test should work as is. |
there is actually no driver version w/ cuda 11.5 (I guess you mean cuda kernel?), all 450.XX, 460.XX, 470.XX, 495.XX could support cuda 11.5 after some specific minor XX versions. And the reason our static cuda 11.5 runtime could run well in drivers w/ smaller major version is due to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title. So I would prefer to keep this, since rapids side does not cover enough scenarios as ours. Otherwise, we may hide test failures like this case. People could fail to use our plugin (or jni in this case) in ENV that they have data center gpu and have drivers that support cuda11 forward compatibility. It's lucky this time we plugin code have a selector for specific rmm pool, but next time it could be our users to report some other bugs if we/rapids did not take care of the forward compatibility well.
All drivers of our CICD instances are announced supporting cuda11 forward compatibility. So IMHO, it should be our code but not the CI ENV to support the cases. I always agree to explicitly setup some pipelines w/ specific drivers when we have more resource |
@tgravescs @sameerz BTW since you guys decide not to put the fix into 2206, I just updated the 2206 cudf&plugin JNI build to use newer driver instances. For 2208 @gerashegalov 's fix should help~ |
I do think its good to test on various versions to catch things like this but at the same time I think we should know what is being tested and we should make sure its being tested regularly, ie every other day we cycle through different version vs maybe we hit the same 460 kernel three days in a row. Like you said "It's lucky this time we plugin code have a selector for specific rmm pool".... we should not be relying on luck.... When its run on the older driver versions then certain features (in this case async memory) is not supported so the tests aren't testing it works. Async is our default configuration so we definitely want it tested. Really ideally before a release we should test on all the versions. Is there anyway to tell what version is being tested each day? This mostly comes down to resources and deciding what we want to test. @sameerz |
We usually have the nvdia-smi of output for some of pipelines, if others do not have it we can add this to them. And yes, currently to mix the drivers in nightly runs is because there is no enough resources to schedule specific instances for the pipelines, we could make it happen when new machines are ready (we request 10+ machines in latest infra resource plan) For test features w/ specific drivers, we may need a huge table for it. As we known, nvidia has many driver versions, there is no clear table of it to say which driver w/ major.minor version to support some specific cuda features. Mostly it only showed us that if your driver is newer than a specific version and you have a data-center gpu, the ENV should support forward compatibility. Or maybe we have a clear table somewhere? Thanks |
moving to 22.08 as we don't need anything for 22.06, just the build randomly fails when it gets the host. rapidsai/rmm#1055 should fix this in 22.08. Yes we need to decide what testing matrix we want if we want to change anything here. @sameerz do you want to discuss this offline or have a separate issue for it? |
rapidsai/rmm#1055 was propagated to spark-rapids-jni a while ago, and this issue is fixed as far as I see. |
Describe the bug
nightly failed UT w/ 9e593b3
The text was updated successfully, but these errors were encountered: