-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tpetra, Xpetra, Amesos2, MueLu, and PanzerAdaptersSTK_* tests failing in many ATDM cuda 9.2 builds with Kokkos view bounds errors starting before 2019-05-15 #5179
Comments
Panzer has not changed recently. According to empire testing (thanks @jmgate !), these are the candidate commits that could have caused the new panzer failure.
|
@kyungjoo-kim, @tjfulle, do you guys know if you recent commits listed above might caused this? |
@rppawlo @bartlettroscoe No, my commits are only intended for ifpack2 blocktridicontainer. The commits are not related to Panzer. |
Does Panzer use Ifpack2? |
Panzer may use other ifpack2 components but it does not use blocktridicontainer solver. The solver I am working on is only used by SPARC. |
@rppawlo and I talked about this over e-mail. The issue is that Trilinos does not yet work correctly when deprecated Tpetra code is disabled (
@trilinos/tpetra is working on fixing these. The work-around for now is not to disabled deprecated code. |
@mhoemmen correct me if I'm wrong, but these failures don't have Tpetra_ENABLE_DEPRECATED_CODE:BOOL=OFF set, |
The deprecated code is enabled as @bathmatt mentioned. So the errors from the tests are different. One test shows:
While the two other failures show:
@mhoemmen - are there any changes that to tpetra in the last 2 days that might trigger this? |
I don't think so, but it's possible. @trilinos/tpetra For debugging, to see if this is a Panzer issue, we could adjust that print threshold temporarily. |
Try also setting the environment variable |
I'm seeing the second error Roger mentioned in EMPIRE now with the EMPlasma trilinos. So, this isn't a new bug, it is an older bug that is starting to pop up more often it looks like |
My statement might be incorrect, I wiped everything clean and it looks like it isn't popping up anymore |
After rebuilding from scratch, this looks like the parallel level is too high and the cuda card is running out of memory with multiple tests hitting the same card. In the steps to reproduce, the tests are run with "cmake -j20". I could not reproduce the errors running the tests manually or when the cmake parallel level was reduced. I think we run the other cuda machines at -j8. Maybe we need to do that here also? |
@rppawlo, looking at the Jenkins driver at: it shows:
Therefore, it is running them with But that may be too much for some of these Panzer tests? |
I think that is ok. The instructions at the top of this ticket have -j20 so I assumed that is what the tests were running. With -j20 I see a bunch of failures. With -j8 nothing fails. Do the atdm build scripts wipe the build directory? Some of the reported failures went away for both Matt and I with a clean build. |
@rppawlo asked:
By default all of the nightly ATDM Trilinos builds build from scratch each day. We can look on Jenkins to see if that is the case to be sure. For example, at: it shows:
and does not show any errors so I would assume that it is blowing away the directories. |
It looks like these are also failing on non-waterman builds. There are 74 failing PanzerAdaptersSTK_* tests between 4 builds between 2019-05-01 ans 2019-05-29 shown here Note that the above link filters out builds on white and ride because we have seen a lot of failures on those machines recently but these tests may be failing there too. failures on white/ride in the last 2 weeks |
All failures are in cuda builds using the tpetra deprecated dynamic profile. I've tracked the multiblock test failure to a separate issue and will push a fix shortly. The majority of the random errors look to be in the fillComplete on the CrsMatrix. I have not had good luck reproducing in raw panzer tests. EMPIRE is also seeing similar failures and @bathmatt was able to get the following stack trace:
The failures occur in different ways - maybe a race condition? Sometimes we see a raw seg fault and sometimes we get the following two different errors reported from tpetra:
@mhoemmen and @tjfulle - have there been any changes recently to tpetra that might cause these kinds of errors? |
Tpetra changes to that code haven't landed yet. Try a debug build, and set the environment variable |
I'm trying that now, have a test that fails in opt code 3 out of 4 runs, however, in debug it hasn't failed yet. Maybe this flag will show something |
I have the log file for two failed runs, they are different, nothing is reported but it looks like a cuda mem error, should I rerun with verbose? |
@jjellio wrote:
I made |
Let's take this conversation to: I don't want to distract from this problem. The link above has info on how to run with Cuda on our machines, as well as LLNL ones, with Cuda enabled or without. |
@nmhamster @jjellio @jhux2 this gets my run past the MPI error, I ran all the way to the end with this flag. Not sure what it does. We don't have cuda direct mpi enabled.. |
@bathmatt - I think it changes the memory regions where MPI/network has visibility. So this fixes your issue completely based on what I think you have above? |
Si, sorry, didn’t follow that
…On Wed, Aug 21, 2019 at 3:46 PM Si Hammond ***@***.***> wrote:
@bathmatt <https://github.com/bathmatt> - I think it changes the memory
regions where MPI/network has visibility. So this fixes your issue
completely based on what I think you have above?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5179?email_source=notifications&email_token=ADR4GIBBBYAKXHRDK3BRF6DQFWZSBA5CNFSM4HNDXI72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD43HFYA#issuecomment-523662048>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADR4GIAVR2524PMNZOXDFC3QFWZSBANCNFSM4HNDXI7Q>
.
|
atrocities to make CUDA error in github issue #5179 reproducible. now even worse
@trilinos/tpetra This is part of trilinos#5179 debugging.
Tpetra: More permanent fixes for FixedHashTable issue in #5179
FYI: Looking at this query there were 35 failures of Panzer tests beginning with
in this query, you see there are zero other failures. Therefore, I dont' know that we are seeing these Panzer tests failing anymore. Therefore, I think we can remove tracking of these Panzer tests. |
We probably should leave them in. There is an underlying issue and it can affect panzer. The team has a good idea of what it is, but fixing may take a bit still. |
FYI: PR #6425 disabled these Panzer tests for the build:
|
@trilinos/tpetra, @trilinos/xpetra, @trilinos/amesos2, @trilinos/muelu, @trilinos/panzer, @bathmatt, FYI: According to this query, in the last 90 days there was just one failing test that showed
who's output showed:
That is just one test failure in the last 3 months. Has this issue finally been resolved? Can we close this Issue? If not, why are we not seeing any more of these failures in the last 2.5 months? Is this just dumb luck? (I wish we had kept clear statistics for how frequent these failures were before so we had something to compare against.) |
@bartlettroscoe I think this is the elusive so-called "UVM error." It has not been resolved. I'm going with dumb luck for the recent successes, or maybe some sort of system upgrade. It is interesting that the error has not appeared for 2.5 months; thanks for tracking it. |
FYI: Just heard from @crtrott yesterday that a reproducer was created for this with help from Stan Moore with EMPIRE that allowed the defect to be discovered and fixed. A modified debug-enabled EMPIRE executable was able to trigger it about 50% of the time after several invocations (don't remember the actual number of invocations). This turned out to be a defect in Kokkos. The fix was just merged to Trilinos 'develop' in PR #6627 and was merged to the Kokkos release and 'develop' branches. Stan verified that the problem was fixed with EMPIRE. To make a long story short, a race condition existed between different threads on the GPU computing the scan result that would (in very rare cases) compute the wrong result. The fix was to put in a fence to avoid the race. The reason this was triggering errors like:
was that a bad integer index was being computed that was invalid to create the view. So it would appear that the native Trilinos test suite was in fact demonstrating the defect! So it would seem that if someone would have run one of these failing Trilinos tests hundreds or thousands of times, then they might have been able to reproduce the failures. The reason why we saw less of these failures in the last 90 days is unclear. Perhaps other code was changed that lessened how often this was triggered? So it would appear that this was NOT a UVM bug!!!!! |
I will put this issue in review for now and then we can close after I discuss this with a few people. |
FYI: As shown in comments above, this random failure may have also likely occurred not just in Power9+GPU builds built also in x86+CUDA builds because we saw these bounds view errors in tests in the builds |
@bathmatt reports this is fixed so we can close. Yea! |
Bug Report
CC: @trilinos/panzer, @kddevin (Trilinos Data Services Product Lead), @srajama1 (Trilinos Linear Solver Data Services), @mperego (Trilinos Discretizations Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
Since PR #5346 was merged on 6/7/2019 which fixed a file read/write race in the test, there has only been one failing Panzer test on any ATDM Trilinos platform as of 6/11/2019 looking to be related. Also, on 6/11/2019 @bathmatt reported EMPIRE is not failing in a similar way in his recent tests. Next: Watch results over next few weeks to see if more random failures like this occur ...
Description
As shown in this query the tests:
are failing in the build:
Additionally the test:
is failing in a different build on the same machine:
Expand to see new commits on 2019-05-14
Current Status on CDash
Results for the current testing day
Steps to Reproduce
One should be able to reproduce this failure on waterman as described in:
More specifically, the commands given for waterman are provided at:
The exact commands to reproduce this issue should be:
The text was updated successfully, but these errors were encountered: