Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zoltan2: New test failures, possibly due to recent Kokkos upgrade #9909

Closed
kddevin opened this issue Nov 8, 2021 · 3 comments
Closed

Zoltan2: New test failures, possibly due to recent Kokkos upgrade #9909

kddevin opened this issue Nov 8, 2021 · 3 comments
Assignees
Labels
pkg: Kokkos pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests

Comments

@kddevin
Copy link
Contributor

kddevin commented Nov 8, 2021

Bug Report

@trilinos/zoltan2 @trilinos/kokkos @ndellingwood

Description

Zoltan2 tests on the ascicgpu machines with UVM=ON used to pass. Around the time of the Kokkos upgrade (#9836) they began to fail.
https://testing.sandia.gov/cdash/index.php?project=Trilinos&begin=2021-10-25&end=2021-11-07&filtercount=1&showfilters=1&field1=site&compare1=63&value1=ascicgpu031

Questions:

  • what is causing the failures?
  • why did PR testing pass?
@kddevin kddevin added type: bug The primary issue is a bug in Trilinos code or tests pkg: Kokkos pkg: Zoltan2 labels Nov 8, 2021
@kddevin kddevin self-assigned this Nov 8, 2021
@ndellingwood
Copy link
Contributor

@kddevin thanks for the adding me on the issue. I reproduced the zoltan2 failures on a kokkos-dev testing machine (SKX + Volta70) with sems modules. We run a nightly Cuda build on Power9+Volta70 (Trilinos+Kokkos develop branches) where the reported tests pass consistently. To tease out other differences between the failing build posted via the cdash link and our nightly build, we also have a difference in testing configurations, with -DTpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF being set in the posted cdash build above vs -DTpetra_ENABLE_DEPRECATED_CODE:BOOL=ON in our nightly Cuda testing; I'm assuming the Trilinos PR autotester uses this option as well (I didn't see Tpetra_ENABLE_DEPRECATED_CODE in the configure output from PR cuda tests)

While testing on SKX + Volta70, I reproduced the failures with both Tpetra deprecated code on/off, though failures were more intermittent for the deprecated code == on case. Architecture seems to be the key trigger for the failures (and may provide a hint for debugging), and likely why this made it through PR testing (Trilinos tests Cuda builds on Power8+Pascal60).

A couple related questions regarding testing:

  • Is -DTpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF the actively developed code path in Tpetra (checking to see if this option should be most heavily tested in our builds)?
  • Is it possible for -DTpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF option to be set for one of the Trilinos' PR tests for added stability of those code paths? Or will this be a later "wholesale" change for all PR configurations at some point down the road prior to a Trilinos version update?

The SKX+Volta70 machine used for our nightly testing is heavily loaded but I'll expand the downstream packages tested to include zoltan2.

kddevin added a commit that referenced this issue Nov 11, 2021
@kddevin
Copy link
Contributor Author

kddevin commented Nov 11, 2021

Thanks @ndellingwood

The problem was a change in the semantics of Kokkos::sort (kokkos/kokkos#4526) which appears to be fixed by Kokkos kokkos/kokkos#4490 . I'll need to wait until there is a fix in Trilinos' version to verify. Meanwhile, there is a Zoltan2 workaround in #9924. I will not merge the workaround if the Kokkos fix can be patched into Trilinos soon.

@kddevin
Copy link
Contributor Author

kddevin commented Dec 1, 2021

Fix works; Zoltan2 tests are now passing. Thanks, @ndellingwood

@kddevin kddevin closed this as completed Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Kokkos pkg: Zoltan2 type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants