-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document process for addressing failures in PR testing #10879
Comments
I was requested by @csiefer2 to do this in: I am writing put a draft for the wiki page now. |
@csiefer2, please review the initial draft: |
Other than that, it looks good. |
@csiefer2, great! Now I will put in review others to take a look. |
@jhux2, @rppawlo, @e10harvey, @srbdev, can you please browse over to see if you can find any issues with: ? We need a solid process in place for quickly and effectively dealing with randomly failing tests bringing down PR builds (e.g. #10847). |
re:option B and disabling a defective test - wouldn't we run into the issue of having code being merged while the test is disabled that has error(s) that would be caught by the working cases of that test? |
That is a risk you take if you disable that test in all PR builds. But for the example in #10847 and PR #10871, we are only disabling these two tests in one of the CUDA builds and they are is still running in the other CUDA build and in all of the other PR builds. So you loose almost no real code or test coverage. So there is zero excuse to keep the failing tests enabled in that CUDA-11 build. But the negative impact on your PR testing process is too great to let these tests fail. If the functionality protected by that failing test is large enough that you can't afford to have to test off, then you "Stop the Line", turn off all PR testing, and select option A) Fix the failure and then get it fixed ASAP before restarting the PR testing process.. That is what a real Lean/Agile project does in cases like this. But it is extremely rare that one of these tests is so important that you can't disable it for some period of time or justifies stopping everything to fix it. |
A related question worth asking is why are there over 530 tests being disabled in the GenConfig PR builds a shown by:
? What is documented in: is just a careful process for what has already been going on in mass but in a way that is visible to the Trilinos developers and gives them control over the process. For example, I can't find Trilinos GitHub issues that mention most of those disabled tests. Do the Trilinos developers even know those tests are being disabled in PR testing? And if so, how was that communicated? |
Good point - I didn't think of the intersection of test cases across builds, and being covered in terms of code coverage that way. And I didn't know that was already going on with the already disabled tests. 👍🏼 for documenting the process |
@srbdev, if you see any major issues with that documentation, please comment here. If you find any typos, please go ahead and fix them :-) (its a wiki) |
This has been done for a while in: Closing as complete |
CC: @csiefer2, @rppawlo, @jhux2
Description
There needs to be a well defined process for addressing PR builds failures that creep into the 'develop' branch. Through various mechanisms, test failures (and rarely build failures) may pop up in the Trilinos Pull Request builds that are present on the 'develop' branch and are breaking many PR builds trying to test against 'develop'. These test failures must be addressed ASAP in order to allow innocent unrelated PRs to pass their builds and merge. (By not identifying and addressing these failures quickly, the PRs will get backed up and create a logjam that will take a long time and a lot of effort to get out of.)
This issue will be to create a Trilinos Wiki Page that documents this process.
The text was updated successfully, but these errors were encountered: