Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Mass gnu and clang PR test failures on new 'ascic0xyz' machines #10999

Closed
csiefer2 opened this issue Sep 8, 2022 · 11 comments
Closed
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@csiefer2
Copy link
Member

csiefer2 commented Sep 8, 2022

Link:
https://trilinos-cdash.sandia.gov/test/11636999

Sample error:

/scratch/trilinos/workspace/Trilinos_PR_gcc-7.2.0-debug@2/pull_request_test/packages/amesos/example/Amesos_a_trivial_mpi_test.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/gcc/7.2.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_driver

Reported in TRILINOSHD-203.

@csiefer2 csiefer2 added type: bug The primary issue is a bug in Trilinos code or tests PA: Framework Issues that fall under the Trilinos Framework Product Area labels Sep 8, 2022
@csiefer2
Copy link
Member Author

csiefer2 commented Sep 8, 2022

Nothing like watching over 2,000 tests fail...

@bartlettroscoe
Copy link
Member

Hum, I saw the same error when trying to use one of the GenConfig builds on an 'hpws' machine showing runtime errors like:

1: /fgs/rabartl/Trilinos.base/BUILDS/ATDM/CHECKIN/sems-rhel7-intel-18.0.5-openmp-shared-release-debug/packages/teuchos/core/test/Allocator/TeuchosCore_Allocator_UnitTest.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/intel/18.0.5/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers

See TRILINOSHD-59 and:

@bartlettroscoe
Copy link
Member

NOTE: This PR build is running on the machine ascic0193 which is one of the 'ascic0xy' machines that was supposed to have been taken out of the pool of PR testing machines (see #10893). How did it get back on?

@bartlettroscoe bartlettroscoe changed the title Framework: GNU 7.2 Build Failing Framework: GNU 7.2 tests failing on new 'ascic0xy' machines Sep 8, 2022
@bartlettroscoe
Copy link
Member

As shown in this query showing:

image

this is taking out gnu-7.2.0, gnu-8.3.0, and clang-10.0.0 builds and impacting PRs #10962 and #109743 so far.

@bartlettroscoe bartlettroscoe changed the title Framework: GNU 7.2 tests failing on new 'ascic0xy' machines Framework: Mass gnu and clang test failures on new 'ascic0xy' machines Sep 8, 2022
@bartlettroscoe bartlettroscoe changed the title Framework: Mass gnu and clang test failures on new 'ascic0xy' machines Framework: Mass gnu and clang PR builds test failures on new 'ascic0xy' machines Sep 8, 2022
@jhux2 jhux2 pinned this issue Sep 9, 2022
@ndellingwood
Copy link
Contributor

How do we view progress on TRILINOSHD-203 ? I have access to Jira but don't have permissions to view this specific issue

@csiefer2
Copy link
Member Author

@ndellingwood You do now :)

@bartlettroscoe bartlettroscoe changed the title Framework: Mass gnu and clang PR builds test failures on new 'ascic0xy' machines Framework: Mass gnu and clang PR test failures on new 'ascic0xy' machines Sep 13, 2022
@bartlettroscoe bartlettroscoe changed the title Framework: Mass gnu and clang PR test failures on new 'ascic0xy' machines Framework: Mass gnu and clang PR test failures on new 'ascic0xyz' machines Sep 13, 2022
@bartlettroscoe bartlettroscoe unpinned this issue Sep 28, 2022
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 23, 2022

@trilinos/framework, it is happening again (running on broken 'ascic0XYZ' nodes) as of 2022-10-21 as shown in this query. As shown in this query, since 2022-10-21, we have seen 30708 failed tests so far in these PR builds with mpiexec init errors like:

/scratch/trilinos/workspace/Trilinos_PR_clang-10.0.0/pull_request_test/packages/stratimikos/adapters/amesos2/test/Stratimikos_test_single_amesos2_tpetra_solver_driver.exe: symbol lookup error: /projects/sems/install/rhel7-x86_64/sems/compiler/clang/10.0.0/openmpi/1.10.1/lib/libmca_common_verbs.so.7: undefined symbol: ompi_common_verbs_usnic_register_fake_drivers
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[18596,1],0]
  Exit code:    127
--------------------------------------------------------------------------

As shown by running:


$ ../TrilinosATDMStatus/create_trilinos_github_test_failure_issue_driver.sh -u "https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2022-10-20&end=2022-10-23&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=site&compare2=65&value2=ascic0&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=95&value4=ompi_common_verbs_usnic_register_fake_drivers"

***
*** Getting data to create a new issue tracker
***

Downloading full list of nonpassing tests from CDash URL:

   https://trilinos-cdash.sandia.gov/queryTests.php?project=Trilinos&begin=2022-10-20&end=2022-10-23&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=site&compare2=65&value2=ascic0&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=95&value4=ompi_common_verbs_usnic_register_fake_drivers

  Downloading CDash data from:
    https://trilinos-cdash.sandia.gov/api/v1/queryTests.php?project=Trilinos&begin=2022-10-20&end=2022-10-23&filtercount=4&showfilters=1&filtercombine=and&field1=groupname&compare1=61&value1=Pull%20Request&field2=site&compare2=65&value2=ascic0&field3=status&compare3=61&value3=failed&field4=testoutput&compare4=95&value4=ompi_common_verbs_usnic_register_fake_drivers

Total number of nonpassing tests over all days = 30708

Total number of unique nonpassing test/build pairs over all days = 30657

Number of test names = 2828

Number of build names = 28

Writing out new issue tracker text to 'newGithubMarkdownIssueBody.md'

Writing out list of test/biuld pairs for CSV file 'newTestsWithIssueTrackers.csv'

real    0m7.274s
user    0m1.500s
sys     0m0.686s

This is taking out 28 different individual PR builds in that time impacting the PR builds:

  • PR-10926-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1308
  • PR-10926-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1292
  • PR-10926-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1300
  • PR-11053-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1283
  • PR-11136-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1293
  • PR-11136-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1301
  • PR-11162-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1298
  • PR-11162-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1282
  • PR-11162-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1290
  • PR-11165-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1302
  • PR-11165-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1286
  • PR-11165-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1294
  • PR-11170-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1301
  • PR-11170-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1285
  • PR-11170-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1293
  • PR-11172-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1287
  • PR-11172-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1295
  • PR-11174-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1306
  • PR-11174-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1290
  • PR-11174-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1298
  • PR-11180-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1307
  • PR-11180-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1291
  • PR-11180-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1299
  • PR-11182-test-rhel7_sems-clang-10.0.0-openmpi-1.10.1-serial_release-debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1310
  • PR-11182-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1294
  • PR-11182-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1302
  • PR-11183-test-rhel7_sems-gnu-7.2.0-openmpi-1.10.1-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1289
  • PR-11183-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.1-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1297

How do these broken 'ascic0XYZ' nodes keep getting put back in the node pool for PR build?

@e10harvey
Copy link
Contributor

This should be resolved now.

@bartlettroscoe
Copy link
Member

Unpinning because this should be resolved.

@bartlettroscoe bartlettroscoe unpinned this issue Oct 26, 2022
@github-actions
Copy link

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Oct 28, 2023
Copy link

This issue was closed due to inactivity for 395 days.

@github-actions github-actions bot added the CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. label Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLOSED_DUE_TO_INACTIVITY Issue or PR has been closed by the GitHub Actions bot due to inactivity. MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants