-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zoltan2: New test failures, possibly due to recent Kokkos upgrade #9909
Comments
@kddevin thanks for the adding me on the issue. I reproduced the zoltan2 failures on a kokkos-dev testing machine (SKX + Volta70) with sems modules. We run a nightly Cuda build on Power9+Volta70 (Trilinos+Kokkos develop branches) where the reported tests pass consistently. To tease out other differences between the failing build posted via the cdash link and our nightly build, we also have a difference in testing configurations, with While testing on SKX + Volta70, I reproduced the failures with both Tpetra deprecated code on/off, though failures were more intermittent for the deprecated code == on case. Architecture seems to be the key trigger for the failures (and may provide a hint for debugging), and likely why this made it through PR testing (Trilinos tests Cuda builds on Power8+Pascal60). A couple related questions regarding testing:
The SKX+Volta70 machine used for our nightly testing is heavily loaded but I'll expand the downstream packages tested to include zoltan2. |
Thanks @ndellingwood The problem was a change in the semantics of Kokkos::sort (kokkos/kokkos#4526) which appears to be fixed by Kokkos kokkos/kokkos#4490 . I'll need to wait until there is a fix in Trilinos' version to verify. Meanwhile, there is a Zoltan2 workaround in #9924. I will not merge the workaround if the Kokkos fix can be patched into Trilinos soon. |
Fix works; Zoltan2 tests are now passing. Thanks, @ndellingwood |
Bug Report
@trilinos/zoltan2 @trilinos/kokkos @ndellingwood
Description
Zoltan2 tests on the ascicgpu machines with UVM=ON used to pass. Around the time of the Kokkos upgrade (#9836) they began to fail.
https://testing.sandia.gov/cdash/index.php?project=Trilinos&begin=2021-10-25&end=2021-11-07&filtercount=1&showfilters=1&field1=site&compare1=63&value1=ascicgpu031
Questions:
The text was updated successfully, but these errors were encountered: