TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

nmhamster · 2016-03-22T02:36:24Z

@bartlettroscoe - I am getting these errors in the Trilinos bring up on POWER8 with GCC 4.9.2 and CUDA 7.5. This is being tested on the white Sandia system. Just want to check in whether these would be expected failures or not?

34: Testing: Teuchos::ScalarTraits<float> ...
34:  Type chain (ascending) : float -> double
34:  Type chain (descending): float
34:
34:  Testing that squareroot(NaN) == NaN! ...
34:  squareroot(nan) = nan == nan : failed
34:
34:  Testing that squareroot(-NaN) == NaN! ...
34:  squareroot(-nan) = -nan == nan : failed
34:
34:  Testing that squareroot(-1) == NaN! ...
34:  squareroot(-1) = nan == nan : failed
34:
34: Testing: Teuchos::ScalarTraits<double> ...
34:  Type chain (ascending) : double
34:  Type chain (descending): double -> float
34:
34:  Testing that squareroot(NaN) == NaN! ...
34:  squareroot(nan) = nan == nan : failed
34:
34:  Testing that squareroot(-NaN) == NaN! ...
34:  squareroot(-nan) = -nan == nan : failed
34:
34:  Testing that squareroot(-1) == NaN! ...
34:  squareroot(-1) = nan == nan : failed

The text was updated successfully, but these errors were encountered:

mhoemmen · 2016-03-22T07:40:40Z

Hm, NaN never equals NaN; that's how you're supposed to test for NaN...
C++11 would give us isnan in , but Teuchos isn't supposed to require C++11 (or C99).

nmhamster · 2016-03-22T10:22:27Z

So is the test working? It looks like no.

bartlettroscoe · 2016-03-22T17:00:42Z

Hm, NaN never equals NaN; that's how you're supposed to test for NaN...

The test does not actually try to compare NaNs. It calls ScalarTraits<T>:isnaninf(). For example:

  out << "\nTesting that squareroot(NaN) == NaN! ...\n";
  {
    const Scalar sqrtNan = ST::squareroot(nan);
    result = (ST::isnaninf(sqrtNan));
    if (!result) success = false;
    out
      << "squareroot("<<nan<<") = " << sqrtNan << " == " << nan << " : "
      << passfail(result) << "\n";
  }

(These tests should be converted to use the Teuchos Unit Test Harness. But at the very least, they should use the unit test macros to show what is actually being tested.)

My guess is that the following function is not correctly flagging NaNs on this system:

template<class Scalar>
bool generic_real_isnaninf(const Scalar &x)
{
  typedef std::numeric_limits<Scalar> STD_NL;
  // IEEE says this should fail for NaN (not all compilers do not obey IEEE)
  const Scalar tol = 1.0; // Any (bounded) number should do!
  if (!(x <= tol) && !(x > tol)) return true;
  // Use fact that Inf*0 = NaN (again, all compilers do not obey IEEE)
  Scalar z = static_cast<Scalar>(0.0) * x;
  if (!(z <= tol) && !(z > tol)) return true;
  // As a last result use comparisons
  if (x == STD_NL::infinity() || x == -STD_NL::infinity()) return true;
  // We give up and assume the number is finite
  return false;
}

Looks this compiler/runtime is not doing floating point inequality correctly involving NaNs. That is a little scary. This test was taken from NOX and has worked on many compilers and platforms for a long time. It looks like this function has worked untouched for 6 years, since the commit:

commit a1704c6087e867f91027a05426096d8fc24b55b6
Author: Roscoe A. Bartlett <[email protected]>
Date:   Fri Mar 19 14:11:31 2010 -0600

    Fixing ScalarTraits::isnaninf(...) on Intel 10.1 compiler

    The isnaninf(...) test taken from NOX did not correctly flag Infs.  I
    modified the rountines for float and double to use a new generic
    template function that adds another test to compare to
    std::numeric_limits::infinity().  This test will be a little more
    expensive now but you should only use this test in production code on
    a reduced quantity like a norm.

    My guess is that the Intel 10.1 C++ compiler saw the 0.0
    multiplication and just set it to 0.0 without checking that x was
    actually NaN or Inf.
...
diff --git a/packages/teuchos/src/Teuchos_ScalarTraits.hpp b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
index 05297f2..663201b 100644
--- a/packages/teuchos/src/Teuchos_ScalarTraits.hpp
+++ b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
@@ -75,6 +75,23 @@ namespace Teuchos {
 void throwScalarTraitsNanInfError( const std::string &errMsg );


+template<class Scalar>
+bool generic_real_isnaninf(const Scalar &x)
+{
+  typedef std::numeric_limits<Scalar> NL;
+  // IEEE says this should fail for NaN (not all compilers do not obey IEEE)
+  const Scalar tol = 1.0; // Any (bounded) number should do!
+  if (!(x <= tol) && !(x > tol)) return true;
+  // Use fact that Inf*0 = NaN (again, all compilers do not obey IEEE)
+  Scalar z = static_cast<Scalar>(0.0) * x;
+  if (!(z <= tol) && !(z > tol)) return true;
+  // As a last result use comparisons
+  if (x == NL::infinity() || x == -NL::infinity()) return true;
+  // We give up and assume the number is finite
+  return false;
+}
+
+
 #define TEUCHOS_SCALAR_TRAITS_NAN_INF_ERR( VALUE, MSG ) \
   if (isnaninf(VALUE)) { \
     std::ostringstream omsg; \
@@ -383,11 +400,8 @@ struct ScalarTraits<float>
     return flt_nan;
 #endif
   }
-  static inline bool isnaninf(float x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
-    const float tol = 1e-6f; // Any (bounded) number should do!
-    if( !(x <= tol) && !(x > tol) ) return true;                 // IEEE says this should fail for NaN
-    float z=0.0f*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
-    return false;
+  static inline bool isnaninf(float x) {
+    return generic_real_isnaninf<float>(x);
   }
   static inline void seedrandom(unsigned int s) { 
     std::srand(s); 
@@ -495,11 +509,8 @@ struct ScalarTraits<double>
     return dbl_nan;
 #endif
   }
-  static inline bool isnaninf(double x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
-    const double tol = 1e-6; // Any (bounded) number should do!
-    if( !(x <= tol) && !(x > tol) ) return true;                  // IEEE says this should fail for NaN
-    double z=0.0*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
-    return false;
+  static inline bool isnaninf(double x) {
+    return generic_real_isnaninf<double>(x);
   }
   static inline void seedrandom(unsigned int s) { 
     std::srand(s); 
...

From looking at that patch, you can see that the only thing that was added was the test for INF. The test for NaN was unchanged. This NOX test for NaNs appears to go back 12 years to:

commit 2055302e84a0ae05099891ca64b59cea1a3e4f79
Author: Roscoe A. Bartlett <[email protected]>
Date:   Sat Jun 5 15:39:26 2004 +0000

    I updated scalar traits some.

    One of the major changes I made was to add a configure-time test for <limits> and std::numeric_limits<>.  I then put in an #ifdef so that Teuchos::ScalarTraits can use this standard class for the definition of certain constants inst

    For now I only updated the eps() function but I can also update some of the other functions too.

    Another change was to add support of generating NaN (the new nan() function) and a test for NAN or Inf (the new isnaninf() function).  I took the test for NaN/Inf from NOX and this seems to be a really good and portable test.  This 

    I have run the test suite for both debug and optimized versions and they check out.

diff --git a/packages/teuchos/src/Teuchos_ScalarTraits.hpp b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
index d70956f..590b9a4 100644
--- a/packages/teuchos/src/Teuchos_ScalarTraits.hpp
+++ b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
...
     //! Returns a random number (between -one() and +one()) of this scalar type.
@@ -143,13 +146,21 @@ namespace Teuchos {
     static inline std::string name() { return "int"; };
     static inline int squareroot(int x) { return (int) sqrt((double) x); };
   };
-  
+
+  extern float flt_nan;
+ 
   template<>
   struct ScalarTraits<float>
   {
     typedef float magnitudeType;
     static inline bool haveMachineParameters() { return true; };
-    static inline float eps()   { LAPACK<int, float> lp; return lp.LAMCH('E'); };
+    static inline float eps()   {
+#ifdef HAVE_NUMERIC_LIMITS
+                       return std::numeric_limits<float>::epsilon();
+#else
+                       LAPACK<int, float> lp; return lp.LAMCH('E');
+#endif
+               }
     static inline float sfmin() { LAPACK<int, float> lp; return lp.LAMCH('S'); };
     static inline float base()  { LAPACK<int, float> lp; return lp.LAMCH('B'); };
     static inline float prec()  { LAPACK<int, float> lp; return lp.LAMCH('P'); };
@@ -162,21 +173,33 @@ namespace Teuchos {
     static inline magnitudeType magnitude(float a) { return fabs(a); };    
     static inline float zero()  { return(0.0); };
     static inline float one()   { return(1.0); };    
+    static inline float nan()   { return flt_nan; };
+    static inline bool isnaninf(float x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
+                       const float tol = 1e-6; // Any (bounded) number should do!
+                       if( !(x <= tol) && !(x > tol) ) return true;                 // IEEE says this should fail for NaN
+                       float z=0.0*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
+                       return false;
+               }
     static inline void seedrandom(unsigned int s) { srand(s); };
     static inline float random() { float rnd = (float) rand() / RAND_MAX; return (float)(-1.0 + 2.0 * rnd); };
     static inline std::string name() { return "float"; };
     static inline float squareroot(float x) { return sqrt(x); };
   };
-

This was pre-git, pre-checkin-test, etc. I think that means that in 12 years, we have yet to see a platform where this test for NaN did not work. This machine white is really a special platform!

The function isnaninf() is used in several numerical algorithms to detect major failures and is used to restart many numerical methods. It is used in line search methods for backtracking, etc. If you can't detect Inf and NaNs on a given platform, I would not trust most of the numerical algorithms. This function gets used in a lots of places:

$ cd Trilinos/
$ find . -name "*" -exec grep -nH isnaninf {} \; | wc -l
174

@nmhamster, how urgent is this? What other test failures do you see on this platform or is this the tip of the iceberg?

I suspect that many other skilled numerical programmers could look into this. Therefore, I have added the "help wanted" label.

mhoemmen · 2016-03-22T17:08:49Z

@bartlettroscoe is right -- those tests look legit. We could try splicing the C++11 / C99 functions isinf / isnan into the test temporarily on white. Teuchos can't use them unconditionally, because Teuchos must not require C++11 (Teuchos is like Kokkos, in that it has other customers besides Trilinos).

bartlettroscoe · 2016-03-22T17:15:19Z

If C++11 has standard isinf() an isnan() function, then we should use them. We can just put in an ifdef based on HAVE_TEUCHOSCORE_CXX11 to call these functions or not.

Can we give that a try?

mhoemmen · 2016-03-22T17:20:36Z

I think it makes sense to use the C++11 functions conditionally, protected by the macro as you said.

I just got back and am digging out of e-mail purgatory still. If you're in the mood, feel free to take this. Otherwise, you're welcome to assign to me, but it will take a while for me to pick this up.

nmhamster · 2016-03-22T17:37:40Z

@bartlettroscoe @mhoemmen this isn't urgent. I agreed with Mike H that I would start fielding bugs in the Advanced Architecture Test Bed bring up so that the new code efforts and SIERRA would get more robust products when they got onto the machines. I'm slowly working my way through all the configurations and tests. Remember - these are using OpenBLAS so it's possible we have an issue there, although I doubt that would affect NaN. This is also running with a fair amount of optimization enabled so I wonder if it is short circuiting something assuming values won't be NaN or INF.

bartlettroscoe · 2016-03-22T17:45:36Z

This is also running with a fair amount of optimization enabled so I wonder if it is short circuiting something assuming values won't be NaN or INF.

That could very well be the case. But that level of optimization is also a bit scarry. At some point, you can't write robust numerical algorithms if compilers/runtimes throw out too much of the (already minimal) IEEE floating point standards (e.g. if compiler/runtime does not respect '()' in expressions, then you are sunk).

nmhamster · 2016-03-22T17:47:46Z

@bartlettroscoe agreed, that's why we want to get these tests reported before code teams get to the machines. Let me know what you want to try (patch?) and I'll get it run as soon as I have SSH. Appreciate you're swamped right now.

bartlettroscoe · 2016-03-22T17:55:02Z

It is a very small change. I will give it a try soon and push if it passes with GCC 4.8.4 on hansen.

The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails the unit tests for ST::isnaninf() which uses the genetic inequality test for NaNs. This may be due to heavy compiler optimizations. Therefore, we are going to try to C++11 functions std::isnan() and std::isinf(). I have ifdefed this based on C++11. Without C++11, it just uses the old generic implementaion. I have tested this with and without C++11 enabled and they both passed on the machine hansen using the GCC 4.8.4 compiler.

bartlettroscoe · 2016-03-23T16:46:45Z

I tired to push from hansen and a bunch of Trilinos tests timed out or failed. I am running the checkin-test.py script now on the ORNL machine th232 and everything is passing so far (the full PT MPI_DEBUG build and tests passed).

The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails the unit tests for ST::isnaninf() which uses the genetic inequality test for NaNs. This may be due to heavy compiler optimizations. Therefore, we are going to try to C++11 functions std::isnan() and std::isinf(). I have ifdefed this based on C++11. Without C++11, it just uses the old generic implementaion. I have tested this with and without C++11 enabled and they both passed on the machine hansen using the GCC 4.8.4 compiler. Build/Test Cases Summary Enabled Packages: TeuchosCore Disabled Packages: PyTrilinos,Pliris,Claps,STK,TriKota Enabled all Forward Packages 0) MPI_DEBUG => passed: passed=1403,notpassed=0 (54.55 min) 1) SERIAL_RELEASE => passed: passed=1323,notpassed=0 (40.93 min)

bartlettroscoe · 2016-03-23T17:34:53Z

@nmhamster, I just pushed the commit:

commit 7363684de86e76d7d4a61722216cb524dc203349
Author: Roscoe A. Bartlett <[email protected]>
Date:   Tue Mar 22 18:36:55 2016 -0600

    Use std::isnan() and isinf() with C++11 (Trilinos #239)

    The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails
    the unit tests for ST::isnaninf() which uses the genetic inequality test for
    NaNs.  This may be due to heavy compiler optimizations.  Therefore, we are
    going to try to C++11 functions std::isnan() and std::isinf().  I have ifdefed
    this based on C++11.  Without C++11, it just uses the old generic
    implementaion.

    I have tested this with and without C++11 enabled and they both passed on the
    machine hansen using the GCC 4.8.4 compiler.

    Build/Test Cases Summary
    Enabled Packages: TeuchosCore
    Disabled Packages: PyTrilinos,Pliris,Claps,STK,TriKota
    Enabled all Forward Packages
    0) MPI_DEBUG => passed: passed=1403,notpassed=0 (54.55 min)
    1) SERIAL_RELEASE => passed: passed=1323,notpassed=0 (40.93 min)

M       packages/teuchos/core/src/Teuchos_ScalarTraits.hpp
M       packages/teuchos/core/test/ScalarTraits/ScalarTraits_test.cpp

Please try to pull the current version of Trilinos 'master' off github with this commit and then try this again on white.

@mhoemmen, note that I also improved the unit tests so that they should be more clear now.

mhoemmen · 2016-03-23T18:30:39Z

@bartlettroscoe thanks!!! :-D

nmhamster · 2016-03-23T18:49:30Z

@bartlettroscoe @mhoemmen The rebuild on POWER is underway. This takes some considerable amount of time (days) for the full build but I will try get a Teuchos check ASAP.

nmhamster · 2016-03-23T18:53:29Z

Got the Teuchos only answer back much quicker than I expected - SUCCESS. Thanks guys!

$ ctest -R TeuchosCore_ScalarTraits_test_MPI_1
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-power8-xl-long-long-ross-fix/packages/teuchos
    Start 24: TeuchosCore_ScalarTraits_test_MPI_1
1/1 Test #24: TeuchosCore_ScalarTraits_test_MPI_1 ...   Passed    0.94 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
Teuchos    =   0.94 sec (1 test)

Total Test time (real) =   0.98 sec

bartlettroscoe · 2016-03-23T19:00:34Z

Excellent! I am closing as complete.

mhoemmen · 2016-03-23T19:44:32Z

Awesome, thanks @bartlettroscoe !!!

@nmhamster -- would you happen to know what optimization flags cause this? I'm a little scared if Inf and NaN behavior break. These are pretty tame bits of the floating-point standard.

This file should be merged to Trilinos only after pull request trilinos#239 in TriBITSPub has been propigated to Trilinos. (TriBITSPub/TriBITS#239) It merely sets the override for the sierra builds to be the c11 posix standard.

Origin repo remote tracking branch: 'github/master' Origin repo remote repo URL: 'github = [email protected]:TriBITSPub/TriBITS.git' At commit: commit 6e3d7d2be883b43cbc4dacf27e083f0283ec01b8 Author: Roscoe A. Bartlett <[email protected]> Date: Thu Nov 9 19:39:57 2017 -0700 Summary: Define <Project>_C_Standard as cache var and document (#239)

This file should be merged to Trilinos only after pull request #239 in TriBITSPub has been propigated to Trilinos. (TriBITSPub/TriBITS#239) It merely sets the override for the sierra builds to be the c11 posix standard.

nmhamster added type: bug The primary issue is a bug in Trilinos code or tests pkg: Teuchos Issues primarily dealing with the Teuchos Package labels Mar 22, 2016

nmhamster assigned bartlettroscoe Mar 22, 2016

bartlettroscoe added the type: help wanted label Mar 22, 2016

bartlettroscoe added this to the Advanced Technology Test Bed Issue milestone Mar 22, 2016

bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 23, 2016

bartlettroscoe assigned nmhamster and unassigned bartlettroscoe Mar 23, 2016

bartlettroscoe closed this as completed Mar 23, 2016

bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 23, 2016

prwolfe mentioned this issue Nov 8, 2017

Change C standard of SIERRA builds of Trilinos from c99 to c11 #1974

Merged

trilinos-autotester mentioned this issue Nov 22, 2018

Trilinos Master Merge PR Generator: Auto PR created to promote from master_merge_20181122_020612 branch to master #3922

Closed

davoodansari mentioned this issue Jun 4, 2019

Some of the tests in Trilinos 12.8.1 are timing out at 1500 seconds #5303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

nmhamster commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

bartlettroscoe commented Mar 23, 2016

bartlettroscoe commented Mar 23, 2016

mhoemmen commented Mar 23, 2016

nmhamster commented Mar 23, 2016

nmhamster commented Mar 23, 2016

bartlettroscoe commented Mar 23, 2016

mhoemmen commented Mar 23, 2016

TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

Comments

nmhamster commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

mhoemmen commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

nmhamster commented Mar 22, 2016

bartlettroscoe commented Mar 22, 2016

bartlettroscoe commented Mar 23, 2016

bartlettroscoe commented Mar 23, 2016

mhoemmen commented Mar 23, 2016

nmhamster commented Mar 23, 2016

nmhamster commented Mar 23, 2016

bartlettroscoe commented Mar 23, 2016

mhoemmen commented Mar 23, 2016