Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmp.gesv failures with gcc/10+armpl/21 #1332

Closed
e10harvey opened this issue Feb 17, 2022 · 9 comments
Closed

openmp.gesv failures with gcc/10+armpl/21 #1332

e10harvey opened this issue Feb 17, 2022 · 9 comments
Assignees
Labels

Comments

@e10harvey
Copy link
Contributor

@vqd8a, when standing up the A64FX CI testing I encountered these test failures. Can you investigate?

Snippet of ctest output

1: [ RUN      ] openmp.gesv_mrhs_double
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: [  FAILED  ] openmp.gesv_mrhs_double (50 ms)
1: [ RUN      ] openmp.gesv_complex_double
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:121: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:121: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:121: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: [  FAILED  ] openmp.gesv_complex_double (74 ms)
1: [ RUN      ] openmp.gesv_mrhs_complex_double
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: /path/to/workspace/KokkosKernels_PullRequest_Tpls_ARMPL2110_Tpls_ARMPL2030_GCC1020/kokkos-kernels/unit_test/blas/Test_Blas_gesv.hpp:229: Failure
1: Value of: true
1: Expected: test_flag
1: Which is: false
1: [  FAILED  ] openmp.gesv_mrhs_complex_double (91 ms)

Reproducer instructions

cd kokkos/
git checkout -f 0d19eebfa26d076f551d5b7a43230f627887df21
cd ../kokkos-kernels/
git checkout -f f5d7490dee7751a5a3cff8242e7de9f6ad6fe5b2
cd ../
mkdir testing
cd testing/
../kokkos-kernels/scripts/cm_test_all_sandia --spot-check-tpls armpl/21.1.0 --with-tpls=armpl --kokkos-path=../kokkos --kokkoskernels-path=../kokkos-kernels
@e10harvey e10harvey added the bug label Feb 17, 2022
@vqd8a
Copy link
Contributor

vqd8a commented Feb 17, 2022

@e10harvey I will look at this issue.

@vqd8a vqd8a self-assigned this Feb 17, 2022
@vqd8a
Copy link
Contributor

vqd8a commented Mar 1, 2022

@e10harvey I have looked into this. I think these failures are only caused by ARMPL because when I tested with OpenBLAS TPL, these openmp gesv tests passed (Note that gesv currently can only work with either MAGMA TPL or BLAS TPL, we don't have a fall-back implementation yet, sorry for my lazziness).

An interesting thing is that these gesv tests fail with even numbers of OMP_NUM_THREADS. If we set OMP_NUM_THREADS to an odd number (ex: 47, 21, ...), these tests pass.
Because it might be an ARMPL issue, can we set the OMP_NUM_THREADS in A64FX CI testing to an odd number for now?

@ndellingwood
Copy link
Contributor

The hip.gesv_double and hip.gesv_mrhs_double also fail with Hip when compiled with rocm/4.5.0 for MI100 arch, at least when building through Trilinos. I'll open a separate issue but wanted to comment here in case there is a common underlying issue to the hip case and this issue

@e10harvey
Copy link
Contributor Author

@vqd8a : Are there any updates on this? Once this is resolved, we can enable Armpl CI checks and improve our code coverage.

@vqd8a
Copy link
Contributor

vqd8a commented Mar 31, 2022

@e10harvey Hi Evan, I do not have any update recently. I will find time to look at these tests on Hip.
But for ARMPL, I think this is an ARMPL issue. And as I mentioned above, can you set up an odd OMP_NUM_THREADS in A64FX CI testing to avoid these failures?

@e10harvey
Copy link
Contributor Author

Thanks, @vqd8a. Yes, I will set OMP_NUM_THREADS. Do you already have a bug report filed for armpl?

@vqd8a
Copy link
Contributor

vqd8a commented Apr 1, 2022

@e10harvey I did not. I can do that.

@e10harvey
Copy link
Contributor Author

Thank you, @vqd8a. Can this be closed?

@vqd8a
Copy link
Contributor

vqd8a commented May 5, 2022

Yes. I am closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants