-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239
Comments
Hm, NaN never equals NaN; that's how you're supposed to test for NaN... |
So is the test working? It looks like no. |
The test does not actually try to compare NaNs. It calls
(These tests should be converted to use the Teuchos Unit Test Harness. But at the very least, they should use the unit test macros to show what is actually being tested.) My guess is that the following function is not correctly flagging NaNs on this system:
Looks this compiler/runtime is not doing floating point inequality correctly involving NaNs. That is a little scary. This test was taken from NOX and has worked on many compilers and platforms for a long time. It looks like this function has worked untouched for 6 years, since the commit:
From looking at that patch, you can see that the only thing that was added was the test for INF. The test for NaN was unchanged. This NOX test for NaNs appears to go back 12 years to:
This was pre-git, pre-checkin-test, etc. I think that means that in 12 years, we have yet to see a platform where this test for NaN did not work. This machine The function isnaninf() is used in several numerical algorithms to detect major failures and is used to restart many numerical methods. It is used in line search methods for backtracking, etc. If you can't detect Inf and NaNs on a given platform, I would not trust most of the numerical algorithms. This function gets used in a lots of places:
@nmhamster, how urgent is this? What other test failures do you see on this platform or is this the tip of the iceberg? I suspect that many other skilled numerical programmers could look into this. Therefore, I have added the "help wanted" label. |
@bartlettroscoe is right -- those tests look legit. We could try splicing the C++11 / C99 functions isinf / isnan into the test temporarily on white. Teuchos can't use them unconditionally, because Teuchos must not require C++11 (Teuchos is like Kokkos, in that it has other customers besides Trilinos). |
If C++11 has standard isinf() an isnan() function, then we should use them. We can just put in an ifdef based on Can we give that a try? |
I think it makes sense to use the C++11 functions conditionally, protected by the macro as you said. I just got back and am digging out of e-mail purgatory still. If you're in the mood, feel free to take this. Otherwise, you're welcome to assign to me, but it will take a while for me to pick this up. |
@bartlettroscoe @mhoemmen this isn't urgent. I agreed with Mike H that I would start fielding bugs in the Advanced Architecture Test Bed bring up so that the new code efforts and SIERRA would get more robust products when they got onto the machines. I'm slowly working my way through all the configurations and tests. Remember - these are using OpenBLAS so it's possible we have an issue there, although I doubt that would affect NaN. This is also running with a fair amount of optimization enabled so I wonder if it is short circuiting something assuming values won't be NaN or INF. |
That could very well be the case. But that level of optimization is also a bit scarry. At some point, you can't write robust numerical algorithms if compilers/runtimes throw out too much of the (already minimal) IEEE floating point standards (e.g. if compiler/runtime does not respect '()' in expressions, then you are sunk). |
@bartlettroscoe agreed, that's why we want to get these tests reported before code teams get to the machines. Let me know what you want to try (patch?) and I'll get it run as soon as I have SSH. Appreciate you're swamped right now. |
It is a very small change. I will give it a try soon and push if it passes with GCC 4.8.4 on hansen. |
The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails the unit tests for ST::isnaninf() which uses the genetic inequality test for NaNs. This may be due to heavy compiler optimizations. Therefore, we are going to try to C++11 functions std::isnan() and std::isinf(). I have ifdefed this based on C++11. Without C++11, it just uses the old generic implementaion. I have tested this with and without C++11 enabled and they both passed on the machine hansen using the GCC 4.8.4 compiler.
I tired to push from hansen and a bunch of Trilinos tests timed out or failed. I am running the checkin-test.py script now on the ORNL machine th232 and everything is passing so far (the full PT MPI_DEBUG build and tests passed). |
The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails the unit tests for ST::isnaninf() which uses the genetic inequality test for NaNs. This may be due to heavy compiler optimizations. Therefore, we are going to try to C++11 functions std::isnan() and std::isinf(). I have ifdefed this based on C++11. Without C++11, it just uses the old generic implementaion. I have tested this with and without C++11 enabled and they both passed on the machine hansen using the GCC 4.8.4 compiler. Build/Test Cases Summary Enabled Packages: TeuchosCore Disabled Packages: PyTrilinos,Pliris,Claps,STK,TriKota Enabled all Forward Packages 0) MPI_DEBUG => passed: passed=1403,notpassed=0 (54.55 min) 1) SERIAL_RELEASE => passed: passed=1323,notpassed=0 (40.93 min)
@nmhamster, I just pushed the commit:
Please try to pull the current version of Trilinos 'master' off github with this commit and then try this again on @mhoemmen, note that I also improved the unit tests so that they should be more clear now. |
@bartlettroscoe thanks!!! :-D |
@bartlettroscoe @mhoemmen The rebuild on POWER is underway. This takes some considerable amount of time (days) for the full build but I will try get a Teuchos check ASAP. |
Got the Teuchos only answer back much quicker than I expected -
|
Excellent! I am closing as complete. |
Awesome, thanks @bartlettroscoe !!! @nmhamster -- would you happen to know what optimization flags cause this? I'm a little scared if Inf and NaN behavior break. These are pretty tame bits of the floating-point standard. |
This file should be merged to Trilinos only after pull request trilinos#239 in TriBITSPub has been propigated to Trilinos. (TriBITSPub/TriBITS#239) It merely sets the override for the sierra builds to be the c11 posix standard.
Origin repo remote tracking branch: 'github/master' Origin repo remote repo URL: 'github = [email protected]:TriBITSPub/TriBITS.git' At commit: commit 6e3d7d2be883b43cbc4dacf27e083f0283ec01b8 Author: Roscoe A. Bartlett <[email protected]> Date: Thu Nov 9 19:39:57 2017 -0700 Summary: Define <Project>_C_Standard as cache var and document (#239)
This file should be merged to Trilinos only after pull request #239 in TriBITSPub has been propigated to Trilinos. (TriBITSPub/TriBITS#239) It merely sets the override for the sierra builds to be the c11 posix standard.
@bartlettroscoe - I am getting these errors in the Trilinos bring up on POWER8 with GCC 4.9.2 and CUDA 7.5. This is being tested on the
white
Sandia system. Just want to check in whether these would be expected failures or not?The text was updated successfully, but these errors were encountered: