Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document process for addressing failures in PR testing #10879

Closed
bartlettroscoe opened this issue Aug 15, 2022 · 12 comments
Closed

Document process for addressing failures in PR testing #10879

bartlettroscoe opened this issue Aug 15, 2022 · 12 comments
Assignees
Labels
Framework tasks Framework tasks (used internally by Framework team) impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests PA: Framework Issues that fall under the Trilinos Framework Product Area

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 15, 2022

CC: @csiefer2, @rppawlo, @jhux2

Description

There needs to be a well defined process for addressing PR builds failures that creep into the 'develop' branch. Through various mechanisms, test failures (and rarely build failures) may pop up in the Trilinos Pull Request builds that are present on the 'develop' branch and are breaking many PR builds trying to test against 'develop'. These test failures must be addressed ASAP in order to allow innocent unrelated PRs to pass their builds and merge. (By not identifying and addressing these failures quickly, the PRs will get backed up and create a logjam that will take a long time and a lot of effort to get out of.)

This issue will be to create a Trilinos Wiki Page that documents this process.

@bartlettroscoe bartlettroscoe self-assigned this Aug 15, 2022
@bartlettroscoe bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests labels Aug 15, 2022
@bartlettroscoe
Copy link
Member Author

I was requested by @csiefer2 to do this in:

I am writing put a draft for the wiki page now.

@bartlettroscoe
Copy link
Member Author

@csiefer2
Copy link
Member

csiefer2 commented Aug 15, 2022

  1. Are those ini files in Trilinos or GenConfig?

Other than that, it looks good.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 15, 2022

  1. Are those ini files in Trilinos or GenConfig?

They are in the Trilinos repo, see 6e31532 from PR #10871.

@bartlettroscoe
Copy link
Member Author

Other than that, it looks good.

@csiefer2, great!

Now I will put in review others to take a look.

@bartlettroscoe
Copy link
Member Author

@jhux2, @rppawlo, @e10harvey, @srbdev, can you please browse over to see if you can find any issues with:

?

We need a solid process in place for quickly and effectively dealing with randomly failing tests bringing down PR builds (e.g. #10847).

@srbdev
Copy link
Contributor

srbdev commented Aug 16, 2022

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 16, 2022

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

That is a risk you take if you disable that test in all PR builds. But for the example in #10847 and PR #10871, we are only disabling these two tests in one of the CUDA builds and they are is still running in the other CUDA build and in all of the other PR builds. So you loose almost no real code or test coverage. So there is zero excuse to keep the failing tests enabled in that CUDA-11 build.

But the negative impact on your PR testing process is too great to let these tests fail. If the functionality protected by that failing test is large enough that you can't afford to have to test off, then you "Stop the Line", turn off all PR testing, and select option A) Fix the failure and then get it fixed ASAP before restarting the PR testing process.. That is what a real Lean/Agile project does in cases like this.

But it is extremely rare that one of these tests is so important that you can't disable it for some period of time or justifies stopping everything to fix it.

@bartlettroscoe
Copy link
Member Author

re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test?

A related question worth asking is why are there over 530 tests being disabled in the GenConfig PR builds a shown by:

$ cd Trilinos/

$ git log-short --name-status -1
6d19428c6ed "Merge Pull Request #10796 from srbdev/Trilinos/framework-7423-add-citation-message"
Author: trilinos-autotester <[email protected]>
Date:   Tue Aug 16 11:16:04 2022 -0600 (2 hours ago)

$ grep "_DISABLE " packages/framework/ini-files/config-specs.ini | wc -l
539

?

What is documented in:

is just a careful process for what has already been going on in mass but in a way that is visible to the Trilinos developers and gives them control over the process. For example, I can't find Trilinos GitHub issues that mention most of those disabled tests. Do the Trilinos developers even know those tests are being disabled in PR testing? And if so, how was that communicated?

@srbdev
Copy link
Contributor

srbdev commented Aug 16, 2022

Good point - I didn't think of the intersection of test cases across builds, and being covered in terms of code coverage that way. And I didn't know that was already going on with the already disabled tests.

👍🏼 for documenting the process

@bartlettroscoe
Copy link
Member Author

👍🏼 for documenting the process

@srbdev, if you see any major issues with that documentation, please comment here. If you find any typos, please go ahead and fix them :-) (its a wiki)

@bartlettroscoe
Copy link
Member Author

This has been done for a while in:

Closing as complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) impacting: documentation The issue is primarily about documentation vs. a problem with the build or tests PA: Framework Issues that fall under the Trilinos Framework Product Area
Projects
None yet
Development

No branches or pull requests

3 participants