Document process for addressing failures in PR testing #10879

bartlettroscoe · 2022-08-15T20:12:32Z

Description

There needs to be a well defined process for addressing PR builds failures that creep into the 'develop' branch. Through various mechanisms, test failures (and rarely build failures) may pop up in the Trilinos Pull Request builds that are present on the 'develop' branch and are breaking many PR builds trying to test against 'develop'. These test failures must be addressed ASAP in order to allow innocent unrelated PRs to pass their builds and merge. (By not identifying and addressing these failures quickly, the PRs will get backed up and create a logjam that will take a long time and a lot of effort to get out of.)

This issue will be to create a Trilinos Wiki Page that documents this process.

bartlettroscoe · 2022-08-15T20:14:18Z

I was requested by @csiefer2 to do this in:

https://sems-atlassian-son.sandia.gov/jira/servicedesk/customer/portal/7/TRILINOSHD-168

I am writing put a draft for the wiki page now.

bartlettroscoe · 2022-08-15T20:17:56Z

@csiefer2, please review the initial draft:

https://github.com/trilinos/Trilinos/wiki/Addressing-test-failures

csiefer2 · 2022-08-15T20:22:20Z

Are those ini files in Trilinos or GenConfig?

Other than that, it looks good.

bartlettroscoe · 2022-08-15T20:26:42Z

Are those ini files in Trilinos or GenConfig?

They are in the Trilinos repo, see 6e31532 from PR #10871.

bartlettroscoe · 2022-08-15T20:27:31Z

Other than that, it looks good.

@csiefer2, great!

Now I will put in review others to take a look.

bartlettroscoe · 2022-08-15T23:20:14Z

@jhux2, @rppawlo, @e10harvey, @srbdev, can you please browse over to see if you can find any issues with:

https://github.com/trilinos/Trilinos/wiki/Addressing-test-failures

?

We need a solid process in place for quickly and effectively dealing with randomly failing tests bringing down PR builds (e.g. #10847).

srbdev · 2022-08-16T18:42:20Z

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

bartlettroscoe · 2022-08-16T18:52:05Z

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

That is a risk you take if you disable that test in all PR builds. But for the example in #10847 and PR #10871, we are only disabling these two tests in one of the CUDA builds and they are is still running in the other CUDA build and in all of the other PR builds. So you loose almost no real code or test coverage. So there is zero excuse to keep the failing tests enabled in that CUDA-11 build.

But the negative impact on your PR testing process is too great to let these tests fail. If the functionality protected by that failing test is large enough that you can't afford to have to test off, then you "Stop the Line", turn off all PR testing, and select option A) Fix the failure and then get it fixed ASAP before restarting the PR testing process.. That is what a real Lean/Agile project does in cases like this.

But it is extremely rare that one of these tests is so important that you can't disable it for some period of time or justifies stopping everything to fix it.

bartlettroscoe · 2022-08-16T19:11:57Z

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

A related question worth asking is why are there over 530 tests being disabled in the GenConfig PR builds a shown by:

$ cd Trilinos/

$ git log-short --name-status -1
6d19428c6ed "Merge Pull Request #10796 from srbdev/Trilinos/framework-7423-add-citation-message"
Author: trilinos-autotester <[email protected]>
Date:   Tue Aug 16 11:16:04 2022 -0600 (2 hours ago)

$ grep "_DISABLE " packages/framework/ini-files/config-specs.ini | wc -l
539

?

What is documented in:

https://github.com/trilinos/Trilinos/wiki/Addressing-test-failures

is just a careful process for what has already been going on in mass but in a way that is visible to the Trilinos developers and gives them control over the process. For example, I can't find Trilinos GitHub issues that mention most of those disabled tests. Do the Trilinos developers even know those tests are being disabled in PR testing? And if so, how was that communicated?

srbdev · 2022-08-16T19:29:51Z

Good point - I didn't think of the intersection of test cases across builds, and being covered in terms of code coverage that way. And I didn't know that was already going on with the already disabled tests.

👍🏼 for documenting the process

bartlettroscoe · 2022-08-16T20:20:19Z

👍🏼 for documenting the process

@srbdev, if you see any major issues with that documentation, please comment here. If you find any typos, please go ahead and fix them :-) (its a wiki)

bartlettroscoe · 2022-10-26T18:58:33Z

This has been done for a while in:

https://github.com/trilinos/Trilinos/wiki/Addressing-test-failures

Closing as complete

bartlettroscoe self-assigned this Aug 15, 2022

bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests labels Aug 15, 2022

bartlettroscoe closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document process for addressing failures in PR testing #10879

Document process for addressing failures in PR testing #10879

bartlettroscoe commented Aug 15, 2022 •

edited

Loading

bartlettroscoe commented Aug 15, 2022

bartlettroscoe commented Aug 15, 2022

csiefer2 commented Aug 15, 2022 •

edited

Loading

bartlettroscoe commented Aug 15, 2022 •

edited

Loading

bartlettroscoe commented Aug 15, 2022

bartlettroscoe commented Aug 15, 2022

srbdev commented Aug 16, 2022

bartlettroscoe commented Aug 16, 2022 •

edited

Loading

bartlettroscoe commented Aug 16, 2022

srbdev commented Aug 16, 2022

bartlettroscoe commented Aug 16, 2022

bartlettroscoe commented Oct 26, 2022

Document process for addressing failures in PR testing #10879

Document process for addressing failures in PR testing #10879

Comments

bartlettroscoe commented Aug 15, 2022 • edited Loading

Description

bartlettroscoe commented Aug 15, 2022

bartlettroscoe commented Aug 15, 2022

csiefer2 commented Aug 15, 2022 • edited Loading

bartlettroscoe commented Aug 15, 2022 • edited Loading

bartlettroscoe commented Aug 15, 2022

bartlettroscoe commented Aug 15, 2022

srbdev commented Aug 16, 2022

bartlettroscoe commented Aug 16, 2022 • edited Loading

bartlettroscoe commented Aug 16, 2022

srbdev commented Aug 16, 2022

bartlettroscoe commented Aug 16, 2022

bartlettroscoe commented Oct 26, 2022

bartlettroscoe commented Aug 15, 2022 •

edited

Loading

csiefer2 commented Aug 15, 2022 •

edited

Loading

bartlettroscoe commented Aug 15, 2022 •

edited

Loading

bartlettroscoe commented Aug 16, 2022 •

edited

Loading