Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: PR failures in unrelated packages 2022-07 #10782

Closed
jhux2 opened this issue Jul 20, 2022 · 22 comments
Closed

Framework: PR failures in unrelated packages 2022-07 #10782

jhux2 opened this issue Jul 20, 2022 · 22 comments
Labels
autotester Issues related to the autotester. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@jhux2
Copy link
Member

jhux2 commented Jul 20, 2022

Bug Report

@trilinos/framework @jwillenbring

Description

A number of PRs are failing due to failures in packages that are apparently unrelated to changes in the PR.
See #10776 and #10777 and #10751 for examples.

@jhux2 jhux2 added type: bug The primary issue is a bug in Trilinos code or tests PA: Framework Issues that fall under the Trilinos Framework Product Area labels Jul 20, 2022
@jhux2
Copy link
Member Author

jhux2 commented Jul 20, 2022

Reported in TRILINOSHD-123.

@jhux2
Copy link
Member Author

jhux2 commented Jul 20, 2022

Here's the error, which is completely unrelated to any changes in the PRs.

In file included from packages/muelu[Utils/ExplicitInstantiation/MueLu_Utilities_kokkos.cpp:55](https://github.com/trilinos/Trilinos//blob/master/Utils/ExplicitInstantiation/MueLu_Utilities_kokkos.cpp#L55):0:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu/src/Utils/MueLu_Utilities_kokkos_def.hpp: In instantiation of ‘MueLu::DetectDirichletRows(const Xpetra::Matrix<Scalar, LocalOrdinal, GlobalOrdinal, Node>&, const typename Teuchos::ScalarTraits<T>::magnitudeType&, bool)::<lambda(LO)> [with SC = double; LO = int; GO = long long int; NO = Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial>]’:
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu[Utils/MueLu_Utilities_kokkos_def.hpp:367](https://github.com/trilinos/Trilinos//blob/master/Utils/MueLu_Utilities_kokkos_def.hpp#L367):78:   required from ‘struct MueLu::DetectDirichletRows(const Xpetra::Matrix<Scalar, LocalOrdinal, GlobalOrdinal, Node>&, const typename Teuchos::ScalarTraits<T>::magnitudeType&, bool) [with SC = double; LO = int; GO = long long int; NO = Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial>; typename Node::device_type = Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>; typename Teuchos::ScalarTraits<T>::magnitudeType = double]::<lambda(int)>’
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu[Utils/MueLu_Utilities_kokkos_def.hpp:357](https://github.com/trilinos/Trilinos//blob/master/Utils/MueLu_Utilities_kokkos_def.hpp#L357):27:   required from ‘Kokkos::View<bool*, typename Node::device_type> MueLu::DetectDirichletRows(const Xpetra::Matrix<Scalar, LocalOrdinal, GlobalOrdinal, Node>&, const typename Teuchos::ScalarTraits<T>::magnitudeType&, bool) [with SC = double; LO = int; GO = long long int; NO = Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial>; typename Node::device_type = Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>; typename Teuchos::ScalarTraits<T>::magnitudeType = double]’
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu[Utils/MueLu_Utilities_kokkos_def.hpp:398](https://github.com/trilinos/Trilinos//blob/master/Utils/MueLu_Utilities_kokkos_def.hpp#L398):78:   required from ‘static Kokkos::View<bool*, typename Node::device_type> MueLu::Utilities_kokkos<Scalar, LocalOrdinal, GlobalOrdinal, Node>::DetectDirichletRows(const Xpetra::Matrix<Scalar, LocalOrdinal, GlobalOrdinal, Node>&, const typename Teuchos::ScalarTraits<T>::magnitudeType&, bool) [with Scalar = double; LocalOrdinal = int; GlobalOrdinal = long long int; Node = Kokkos::Compat::KokkosDeviceWrapperNode<Kokkos::Serial>; typename Node::device_type = Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>; typename Teuchos::ScalarTraits<T>::magnitudeType = double]’
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu[Utils/MueLu_ETI_4arg.hpp:42](https://github.com/trilinos/Trilinos//blob/master/Utils/MueLu_ETI_4arg.hpp#L42):3:   required from here
/scratch/trilinos/jenkins/ascic143/workspace/Trilinos_PR_gcc-7.2.0-debug@2/Trilinos/packages/muelu[Utils/MueLu_Utilities_kokkos_def.hpp:364](https://github.com/trilinos/Trilinos//blob/master/Utils/MueLu_Utilities_kokkos_def.hpp#L364):49: error: uninitialized variable ‘colID’ in ‘constexpr’ function
                                decltype(length) colID;

@jhux2 jhux2 added the autotester Issues related to the autotester. label Jul 20, 2022
@brian-kelley
Copy link
Contributor

@jhux2 I pushed a quick fix for this into #10777 (although I didn't try to replicate it - I don't have access to ascic143)

@jhux2
Copy link
Member Author

jhux2 commented Jul 20, 2022

@brian-kelley Thanks. I don't understand how that wasn't flagged by PR testing. New machine/configuration, maybe?

@brian-kelley
Copy link
Contributor

Yeah, must be something like that.

@rppawlo
Copy link
Contributor

rppawlo commented Jul 20, 2022

I don't understand how that wasn't flagged by PR testing. New machine/configuration, maybe?

Maybe it has to do with switching from c++14 to c++17 in the test scripts?

@csiefer2
Copy link
Member

csiefer2 commented Jul 20, 2022

Maybe it has to do with switching from c++14 to c++17 in the test scripts?

@rppawlo Wait. They did that?

@jwillenbring
Copy link
Member

Maybe it has to do with switching from c++14 to c++17 in the test scripts?

It appears this might be possible. I will speak with @srbdev about this.

@jwillenbring
Copy link
Member

@csiefer2

Wait. They did that?

We moved one build to C++17 about a year ago, we moved a couple more this week. We are trying to slowly move them over to support the new version of Kokkos anticipated next month.

@jwillenbring
Copy link
Member

After reviewing the failures in #10777 more carefully, we are pursuing reverting the C++17 changes from the last couple days temporarily until the resulting failures can be resolved.

@bartlettroscoe
Copy link
Member

This is impacting my PR #10784 as well.

@bartlettroscoe
Copy link
Member

@brian-kelley Thanks. I don't understand how that wasn't flagged by PR testing. New machine/configuration, maybe?

@jhux2, the issue is that changes to the PR build configurations don't actually pass through PR testing. Making changes to the PR build configurations pass through PR testing is something that would be recommended. Knowing how the GenConfig repos are handled I think I know how that could be done if people are interested.

@srbdev
Copy link
Contributor

srbdev commented Jul 21, 2022

The switch to C++17 appeared to have passed PR testing but we suspect that the changes didn't actually trigger a full build and passed without actually going through the full test suite. Since they "passed" and were set with the AT:AUTOMERGE label, the PRs were merged into develop and started causing issues. It looks like the PRs with the reverts were just merged into develop so that should resolve the master merge for tonight. We'll check back in the morning.

@bartlettroscoe
Copy link
Member

You can add this to PR build failures that fail the compiler check due to running out of disk space as shown in the PR test iteration #10784 (comment) and on CDash here showing errors like:

-- Check for working C compiler: /projects/sems/install/rhel7-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicc
-- Check for working C compiler: /projects/sems/install/rhel7-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicc - broken
CMake Error at /projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.19.1/share/cmake-3.19/Modules/CMakeTestCCompiler.cmake:66 (message):
  The C compiler

    "/projects/sems/install/rhel7-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: /scratch/trilinos/jenkins/ascic142/workspace/Trilinos_PR_intel-17.0.1/pull_request_test/CMakeFiles/CMakeTmp
    
    Run Build Command(s):/projects/sems/install/rhel7-x86_64/sems/utility/ninja_fortran/1.10.0/bin/ninja cmTC_3598b && [1/2] Building C object CMakeFiles/cmTC_3598b.dir/testCCompiler.c.o
    FAILED: CMakeFiles/cmTC_3598b.dir/testCCompiler.c.o 
    /projects/sems/install/rhel7-x86_64/sems/compiler/intel/17.0.1/mpich/3.2/bin/mpicc   -fPIE -MD -MT CMakeFiles/cmTC_3598b.dir/testCCompiler.c.o -MF CMakeFiles/cmTC_3598b.dir/testCCompiler.c.o.d -o CMakeFiles/cmTC_3598b.dir/testCCompiler.c.o -c testCCompiler.c
    icc: warning #10352: The directory '/tmp' is full.  Please check disk space.
    icc: error #10001: could not find directory in which g++ resides
    ninja: build stopped: subcommand failed.

and:

-- Detecting Fortran/C Interface
Failed to compile
-- Verifying Fortran/CXX Compiler Compatibility
Failed to compile
CMake Warning (dev) at /projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.17.1/share/cmake-3.17/Modules/FortranCInterface.cmake:309 (message):
  No FortranCInterface mangling known for VerifyFortran
Call Stack (most recent call first):
  /projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.17.1/share/cmake-3.17/Modules/FortranCInterface/Verify/CMakeLists.txt:16 (FortranCInterface_HEADER)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Verifying Fortran/CXX Compiler Compatibility - Failed
CMake Error at /projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.17.1/share/cmake-3.17/Modules/FortranCInterface.cmake:383 (message):
  The Fortran compiler:

    /projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/openmpi/1.10.1/bin/mpif90

  and the CXX compiler:

    /projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/openmpi/1.10.1/bin/mpicxx

  failed to compile a simple test project using both languages.  The output
  was:

    Change Dir: /scratch/trilinos/jenkins/ascic142/workspace/Trilinos_PR_clang-10.0.0/pull_request_test/CMakeFiles/FortranCInterface/VerifyCXX
    
    Run Build Command(s):/projects/sems/install/rhel7-x86_64/sems/utility/ninja_fortran/1.10.0/bin/ninja VerifyFortranC && [1/8] Building Fortran preprocessed CMakeFiles/VerifyFortran.dir/VerifyFortran.f-pp.f
    [2/8] Building C object CMakeFiles/VerifyFortranC.dir/main.c.o
    [3/8] Building CXX object CMakeFiles/VerifyFortranC.dir/VerifyCXX.cxx.o
    [4/8] Generating Fortran dyndep file CMakeFiles/VerifyFortran.dir/Fortran.dd
    [5/8] Building C object CMakeFiles/VerifyFortranC.dir/VerifyC.c.o
    [6/8] Building Fortran object CMakeFiles/VerifyFortran.dir/VerifyFortran.f.o
    FAILED: CMakeFiles/VerifyFortran.dir/VerifyFortran.f.o 
    /projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/openmpi/1.10.1/bin/mpif90  -I/projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.17.1/share/cmake-3.17/Modules/FortranCInterface/Verify  -I. -g -O3   -fpreprocessed -c CMakeFiles/VerifyFortran.dir/VerifyFortran.f-pp.f -o CMakeFiles/VerifyFortran.dir/VerifyFortran.f.o
    /projects/sems/install/rhel7-x86_64/sems/utility/cmake/3.17.1/share/cmake-3.17/Modules/FortranCInterface/Verify/VerifyFortran.f:3:0:
    
           end
     ^
    Fatal Error: error writing to /tmp/ccVjYvRi.s: No space left on device
    compilation terminated.
    ninja: build stopped: subcommand failed.

It looks like these failures are all coming from one machine:

-- Trilinos_HOSTNAME='ascic142'

Is there no automated process checking disk space on these machines and sending notification emails when they get low?

@bartlettroscoe bartlettroscoe changed the title Framework: PR failures in unrelated packages Framework: PR failures in unrelated packages 2022-07 Jul 21, 2022
@ndellingwood
Copy link
Contributor

Ran into the out-of-disk-space issue with the intel/17 build in PR #10783, that was on the same machine Ross mentioned above:
-- Trilinos_HOSTNAME='ascic142'

#10783 (comment)

@srbdev
Copy link
Contributor

srbdev commented Jul 21, 2022

I cleared up tmp on the node.

@bartlettroscoe I'm looking into the tools to help with resource monitoring and alerting, and eventually add some automation to keep the testing nodes healthy.

@jhux2
Copy link
Member Author

jhux2 commented Jul 21, 2022

I cleared up tmp on the node.

@srbdev Thank you!

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 21, 2022

@bartlettroscoe I'm looking into the tools to help with resource monitoring and alerting, and eventually add some automation to keep the testing nodes healthy.

@srbdev, here is a dirt-simple tool that does the job: monitor-disk-usage.sh. Just set up a Jenkins job or a cron job on each node to run it once a day and send out email to the Trilinos framework team email list. I have this running on testing.sandia.gov and testing-dev.sandia.gov so that we don't get caught flat-footed when the disk fills up too much. Don't let the perfect be the enemy of the better.

@bartlettroscoe
Copy link
Member

And now it appears we have another set of failings in 'develop' not related to a particular PR that is impacting PRs #10775 (@jhux2), #10777 (@brian-kelley), #10783 (@masterleinad), #10784 (@bartlettroscoe), and #10785 (@srbdev) as shown in this query showing:

image

which shows the failures:

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMemcpyAsync(dst, src, n, cudaMemcpyDefault, instance.cuda_stream()) error( cudaErrorIllegalInstruction): an illegal instruction was encountered /gpfs/trilinos/workspace/Trilinos_PR_cuda-11.4.2-uvm-off@2/Trilinos/packages/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:99

This can't be related to the PRs because there is zero chance my PR #10784, for example, could trigger a failure like this.

Trilinos really needs a set of post-merge CI builds to catch errors like this. And we need to figure out how errors like this are getting into the 'develop' branch and adjust the processes so this happens less often.

@jwillenbring
Copy link
Member

I force merged #10776 as explained in detail in that issue.

@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jul 23, 2023
@jhux2
Copy link
Member Author

jhux2 commented Jul 24, 2023

Closing, this was fixed long ago.

@jhux2 jhux2 closed this as completed Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autotester Issues related to the autotester. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

8 participants