Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TeuchosCore_ScalarTraits_test_MPI_1 Tests Fail on POWER8 with GCC 4.9.2 and CUDA 7.5 #239

Closed
nmhamster opened this issue Mar 22, 2016 · 17 comments
Assignees
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests type: help wanted

Comments

@nmhamster
Copy link
Contributor

@bartlettroscoe - I am getting these errors in the Trilinos bring up on POWER8 with GCC 4.9.2 and CUDA 7.5. This is being tested on the white Sandia system. Just want to check in whether these would be expected failures or not?

34: Testing: Teuchos::ScalarTraits<float> ...
34:  Type chain (ascending) : float -> double
34:  Type chain (descending): float
34:
34:  Testing that squareroot(NaN) == NaN! ...
34:  squareroot(nan) = nan == nan : failed
34:
34:  Testing that squareroot(-NaN) == NaN! ...
34:  squareroot(-nan) = -nan == nan : failed
34:
34:  Testing that squareroot(-1) == NaN! ...
34:  squareroot(-1) = nan == nan : failed
34:
34: Testing: Teuchos::ScalarTraits<double> ...
34:  Type chain (ascending) : double
34:  Type chain (descending): double -> float
34:
34:  Testing that squareroot(NaN) == NaN! ...
34:  squareroot(nan) = nan == nan : failed
34:
34:  Testing that squareroot(-NaN) == NaN! ...
34:  squareroot(-nan) = -nan == nan : failed
34:
34:  Testing that squareroot(-1) == NaN! ...
34:  squareroot(-1) = nan == nan : failed
@nmhamster nmhamster added type: bug The primary issue is a bug in Trilinos code or tests pkg: Teuchos Issues primarily dealing with the Teuchos Package labels Mar 22, 2016
@mhoemmen
Copy link
Contributor

Hm, NaN never equals NaN; that's how you're supposed to test for NaN...
C++11 would give us isnan in , but Teuchos isn't supposed to require C++11 (or C99).

@nmhamster
Copy link
Contributor Author

So is the test working? It looks like no.

@bartlettroscoe
Copy link
Member

Hm, NaN never equals NaN; that's how you're supposed to test for NaN...

The test does not actually try to compare NaNs. It calls ScalarTraits<T>:isnaninf(). For example:

  out << "\nTesting that squareroot(NaN) == NaN! ...\n";
  {
    const Scalar sqrtNan = ST::squareroot(nan);
    result = (ST::isnaninf(sqrtNan));
    if (!result) success = false;
    out
      << "squareroot("<<nan<<") = " << sqrtNan << " == " << nan << " : "
      << passfail(result) << "\n";
  }

(These tests should be converted to use the Teuchos Unit Test Harness. But at the very least, they should use the unit test macros to show what is actually being tested.)

My guess is that the following function is not correctly flagging NaNs on this system:

template<class Scalar>
bool generic_real_isnaninf(const Scalar &x)
{
  typedef std::numeric_limits<Scalar> STD_NL;
  // IEEE says this should fail for NaN (not all compilers do not obey IEEE)
  const Scalar tol = 1.0; // Any (bounded) number should do!
  if (!(x <= tol) && !(x > tol)) return true;
  // Use fact that Inf*0 = NaN (again, all compilers do not obey IEEE)
  Scalar z = static_cast<Scalar>(0.0) * x;
  if (!(z <= tol) && !(z > tol)) return true;
  // As a last result use comparisons
  if (x == STD_NL::infinity() || x == -STD_NL::infinity()) return true;
  // We give up and assume the number is finite
  return false;
}

Looks this compiler/runtime is not doing floating point inequality correctly involving NaNs. That is a little scary. This test was taken from NOX and has worked on many compilers and platforms for a long time. It looks like this function has worked untouched for 6 years, since the commit:

commit a1704c6087e867f91027a05426096d8fc24b55b6
Author: Roscoe A. Bartlett <[email protected]>
Date:   Fri Mar 19 14:11:31 2010 -0600

    Fixing ScalarTraits::isnaninf(...) on Intel 10.1 compiler

    The isnaninf(...) test taken from NOX did not correctly flag Infs.  I
    modified the rountines for float and double to use a new generic
    template function that adds another test to compare to
    std::numeric_limits::infinity().  This test will be a little more
    expensive now but you should only use this test in production code on
    a reduced quantity like a norm.

    My guess is that the Intel 10.1 C++ compiler saw the 0.0
    multiplication and just set it to 0.0 without checking that x was
    actually NaN or Inf.
...
diff --git a/packages/teuchos/src/Teuchos_ScalarTraits.hpp b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
index 05297f2..663201b 100644
--- a/packages/teuchos/src/Teuchos_ScalarTraits.hpp
+++ b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
@@ -75,6 +75,23 @@ namespace Teuchos {
 void throwScalarTraitsNanInfError( const std::string &errMsg );


+template<class Scalar>
+bool generic_real_isnaninf(const Scalar &x)
+{
+  typedef std::numeric_limits<Scalar> NL;
+  // IEEE says this should fail for NaN (not all compilers do not obey IEEE)
+  const Scalar tol = 1.0; // Any (bounded) number should do!
+  if (!(x <= tol) && !(x > tol)) return true;
+  // Use fact that Inf*0 = NaN (again, all compilers do not obey IEEE)
+  Scalar z = static_cast<Scalar>(0.0) * x;
+  if (!(z <= tol) && !(z > tol)) return true;
+  // As a last result use comparisons
+  if (x == NL::infinity() || x == -NL::infinity()) return true;
+  // We give up and assume the number is finite
+  return false;
+}
+
+
 #define TEUCHOS_SCALAR_TRAITS_NAN_INF_ERR( VALUE, MSG ) \
   if (isnaninf(VALUE)) { \
     std::ostringstream omsg; \
@@ -383,11 +400,8 @@ struct ScalarTraits<float>
     return flt_nan;
 #endif
   }
-  static inline bool isnaninf(float x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
-    const float tol = 1e-6f; // Any (bounded) number should do!
-    if( !(x <= tol) && !(x > tol) ) return true;                 // IEEE says this should fail for NaN
-    float z=0.0f*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
-    return false;
+  static inline bool isnaninf(float x) {
+    return generic_real_isnaninf<float>(x);
   }
   static inline void seedrandom(unsigned int s) { 
     std::srand(s); 
@@ -495,11 +509,8 @@ struct ScalarTraits<double>
     return dbl_nan;
 #endif
   }
-  static inline bool isnaninf(double x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
-    const double tol = 1e-6; // Any (bounded) number should do!
-    if( !(x <= tol) && !(x > tol) ) return true;                  // IEEE says this should fail for NaN
-    double z=0.0*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
-    return false;
+  static inline bool isnaninf(double x) {
+    return generic_real_isnaninf<double>(x);
   }
   static inline void seedrandom(unsigned int s) { 
     std::srand(s); 
...

From looking at that patch, you can see that the only thing that was added was the test for INF. The test for NaN was unchanged. This NOX test for NaNs appears to go back 12 years to:

commit 2055302e84a0ae05099891ca64b59cea1a3e4f79
Author: Roscoe A. Bartlett <[email protected]>
Date:   Sat Jun 5 15:39:26 2004 +0000

    I updated scalar traits some.

    One of the major changes I made was to add a configure-time test for <limits> and std::numeric_limits<>.  I then put in an #ifdef so that Teuchos::ScalarTraits can use this standard class for the definition of certain constants inst

    For now I only updated the eps() function but I can also update some of the other functions too.

    Another change was to add support of generating NaN (the new nan() function) and a test for NAN or Inf (the new isnaninf() function).  I took the test for NaN/Inf from NOX and this seems to be a really good and portable test.  This 

    I have run the test suite for both debug and optimized versions and they check out.

diff --git a/packages/teuchos/src/Teuchos_ScalarTraits.hpp b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
index d70956f..590b9a4 100644
--- a/packages/teuchos/src/Teuchos_ScalarTraits.hpp
+++ b/packages/teuchos/src/Teuchos_ScalarTraits.hpp
...
     //! Returns a random number (between -one() and +one()) of this scalar type.
@@ -143,13 +146,21 @@ namespace Teuchos {
     static inline std::string name() { return "int"; };
     static inline int squareroot(int x) { return (int) sqrt((double) x); };
   };
-  
+
+  extern float flt_nan;
+ 
   template<>
   struct ScalarTraits<float>
   {
     typedef float magnitudeType;
     static inline bool haveMachineParameters() { return true; };
-    static inline float eps()   { LAPACK<int, float> lp; return lp.LAMCH('E'); };
+    static inline float eps()   {
+#ifdef HAVE_NUMERIC_LIMITS
+                       return std::numeric_limits<float>::epsilon();
+#else
+                       LAPACK<int, float> lp; return lp.LAMCH('E');
+#endif
+               }
     static inline float sfmin() { LAPACK<int, float> lp; return lp.LAMCH('S'); };
     static inline float base()  { LAPACK<int, float> lp; return lp.LAMCH('B'); };
     static inline float prec()  { LAPACK<int, float> lp; return lp.LAMCH('P'); };
@@ -162,21 +173,33 @@ namespace Teuchos {
     static inline magnitudeType magnitude(float a) { return fabs(a); };    
     static inline float zero()  { return(0.0); };
     static inline float one()   { return(1.0); };    
+    static inline float nan()   { return flt_nan; };
+    static inline bool isnaninf(float x) { // RAB: 2004/05/28: Taken from NOX_StatusTest_FiniteValue.C
+                       const float tol = 1e-6; // Any (bounded) number should do!
+                       if( !(x <= tol) && !(x > tol) ) return true;                 // IEEE says this should fail for NaN
+                       float z=0.0*x; if( !(z <= tol) && !(z > tol) ) return true;  // Use fact that Inf*0 = NaN
+                       return false;
+               }
     static inline void seedrandom(unsigned int s) { srand(s); };
     static inline float random() { float rnd = (float) rand() / RAND_MAX; return (float)(-1.0 + 2.0 * rnd); };
     static inline std::string name() { return "float"; };
     static inline float squareroot(float x) { return sqrt(x); };
   };
-  

This was pre-git, pre-checkin-test, etc. I think that means that in 12 years, we have yet to see a platform where this test for NaN did not work. This machine white is really a special platform!

The function isnaninf() is used in several numerical algorithms to detect major failures and is used to restart many numerical methods. It is used in line search methods for backtracking, etc. If you can't detect Inf and NaNs on a given platform, I would not trust most of the numerical algorithms. This function gets used in a lots of places:

$ cd Trilinos/
$ find . -name "*" -exec grep -nH isnaninf {} \; | wc -l
174

@nmhamster, how urgent is this? What other test failures do you see on this platform or is this the tip of the iceberg?

I suspect that many other skilled numerical programmers could look into this. Therefore, I have added the "help wanted" label.

@bartlettroscoe bartlettroscoe added this to the Advanced Technology Test Bed Issue milestone Mar 22, 2016
@mhoemmen
Copy link
Contributor

@bartlettroscoe is right -- those tests look legit. We could try splicing the C++11 / C99 functions isinf / isnan into the test temporarily on white. Teuchos can't use them unconditionally, because Teuchos must not require C++11 (Teuchos is like Kokkos, in that it has other customers besides Trilinos).

@bartlettroscoe
Copy link
Member

If C++11 has standard isinf() an isnan() function, then we should use them. We can just put in an ifdef based on HAVE_TEUCHOSCORE_CXX11 to call these functions or not.

Can we give that a try?

@mhoemmen
Copy link
Contributor

I think it makes sense to use the C++11 functions conditionally, protected by the macro as you said.

I just got back and am digging out of e-mail purgatory still. If you're in the mood, feel free to take this. Otherwise, you're welcome to assign to me, but it will take a while for me to pick this up.

@nmhamster
Copy link
Contributor Author

@bartlettroscoe @mhoemmen this isn't urgent. I agreed with Mike H that I would start fielding bugs in the Advanced Architecture Test Bed bring up so that the new code efforts and SIERRA would get more robust products when they got onto the machines. I'm slowly working my way through all the configurations and tests. Remember - these are using OpenBLAS so it's possible we have an issue there, although I doubt that would affect NaN. This is also running with a fair amount of optimization enabled so I wonder if it is short circuiting something assuming values won't be NaN or INF.

@bartlettroscoe
Copy link
Member

This is also running with a fair amount of optimization enabled so I wonder if it is short circuiting something assuming values won't be NaN or INF.

That could very well be the case. But that level of optimization is also a bit scarry. At some point, you can't write robust numerical algorithms if compilers/runtimes throw out too much of the (already minimal) IEEE floating point standards (e.g. if compiler/runtime does not respect '()' in expressions, then you are sunk).

@nmhamster
Copy link
Contributor Author

@bartlettroscoe agreed, that's why we want to get these tests reported before code teams get to the machines. Let me know what you want to try (patch?) and I'll get it run as soon as I have SSH. Appreciate you're swamped right now.

@bartlettroscoe
Copy link
Member

It is a very small change. I will give it a try soon and push if it passes with GCC 4.8.4 on hansen.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 23, 2016
The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails
the unit tests for ST::isnaninf() which uses the genetic inequality test for
NaNs.  This may be due to heavy compiler optimizations.  Therefore, we are
going to try to C++11 functions std::isnan() and std::isinf().  I have ifdefed
this based on C++11.  Without C++11, it just uses the old generic
implementaion.

I have tested this with and without C++11 enabled and they both passed on the
machine hansen using the GCC 4.8.4 compiler.
@bartlettroscoe
Copy link
Member

I tired to push from hansen and a bunch of Trilinos tests timed out or failed. I am running the checkin-test.py script now on the ORNL machine th232 and everything is passing so far (the full PT MPI_DEBUG build and tests passed).

bartlettroscoe added a commit that referenced this issue Mar 23, 2016
The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails
the unit tests for ST::isnaninf() which uses the genetic inequality test for
NaNs.  This may be due to heavy compiler optimizations.  Therefore, we are
going to try to C++11 functions std::isnan() and std::isinf().  I have ifdefed
this based on C++11.  Without C++11, it just uses the old generic
implementaion.

I have tested this with and without C++11 enabled and they both passed on the
machine hansen using the GCC 4.8.4 compiler.

Build/Test Cases Summary
Enabled Packages: TeuchosCore
Disabled Packages: PyTrilinos,Pliris,Claps,STK,TriKota
Enabled all Forward Packages
0) MPI_DEBUG => passed: passed=1403,notpassed=0 (54.55 min)
1) SERIAL_RELEASE => passed: passed=1323,notpassed=0 (40.93 min)
@bartlettroscoe
Copy link
Member

@nmhamster, I just pushed the commit:

commit 7363684de86e76d7d4a61722216cb524dc203349
Author: Roscoe A. Bartlett <[email protected]>
Date:   Tue Mar 22 18:36:55 2016 -0600

    Use std::isnan() and isinf() with C++11 (Trilinos #239)

    The ATTB machine 'white' which is a POWER8 with GCC 4.9.2 and CUDA 7.5 fails
    the unit tests for ST::isnaninf() which uses the genetic inequality test for
    NaNs.  This may be due to heavy compiler optimizations.  Therefore, we are
    going to try to C++11 functions std::isnan() and std::isinf().  I have ifdefed
    this based on C++11.  Without C++11, it just uses the old generic
    implementaion.

    I have tested this with and without C++11 enabled and they both passed on the
    machine hansen using the GCC 4.8.4 compiler.

    Build/Test Cases Summary
    Enabled Packages: TeuchosCore
    Disabled Packages: PyTrilinos,Pliris,Claps,STK,TriKota
    Enabled all Forward Packages
    0) MPI_DEBUG => passed: passed=1403,notpassed=0 (54.55 min)
    1) SERIAL_RELEASE => passed: passed=1323,notpassed=0 (40.93 min)

M       packages/teuchos/core/src/Teuchos_ScalarTraits.hpp
M       packages/teuchos/core/test/ScalarTraits/ScalarTraits_test.cpp

Please try to pull the current version of Trilinos 'master' off github with this commit and then try this again on white.

@mhoemmen, note that I also improved the unit tests so that they should be more clear now.

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 23, 2016
@mhoemmen
Copy link
Contributor

@bartlettroscoe thanks!!! :-D

@nmhamster
Copy link
Contributor Author

@bartlettroscoe @mhoemmen The rebuild on POWER is underway. This takes some considerable amount of time (days) for the full build but I will try get a Teuchos check ASAP.

@nmhamster
Copy link
Contributor Author

Got the Teuchos only answer back much quicker than I expected - SUCCESS. Thanks guys!

$ ctest -R TeuchosCore_ScalarTraits_test_MPI_1
Test project /ascldap/users/sdhammo/git/trilinos-github-repo/build-power8-xl-long-long-ross-fix/packages/teuchos
    Start 24: TeuchosCore_ScalarTraits_test_MPI_1
1/1 Test #24: TeuchosCore_ScalarTraits_test_MPI_1 ...   Passed    0.94 sec

100% tests passed, 0 tests failed out of 1

Label Time Summary:
Teuchos    =   0.94 sec (1 test)

Total Test time (real) =   0.98 sec

@bartlettroscoe
Copy link
Member

Excellent! I am closing as complete.

@bartlettroscoe bartlettroscoe removed the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Mar 23, 2016
@mhoemmen
Copy link
Contributor

Awesome, thanks @bartlettroscoe !!!

@nmhamster -- would you happen to know what optimization flags cause this? I'm a little scared if Inf and NaN behavior break. These are pretty tame bits of the floating-point standard.

prwolfe added a commit to prwolfe/Trilinos that referenced this issue Nov 8, 2017
This file should be merged to Trilinos only after
pull request trilinos#239 in TriBITSPub has been propigated
to Trilinos. (TriBITSPub/TriBITS#239)
It merely sets the override for the sierra builds to
be the c11 posix standard.
bartlettroscoe added a commit that referenced this issue Nov 10, 2017
Origin repo remote tracking branch: 'github/master'
Origin repo remote repo URL: 'github = [email protected]:TriBITSPub/TriBITS.git'

At commit:

commit 6e3d7d2be883b43cbc4dacf27e083f0283ec01b8
Author:  Roscoe A. Bartlett <[email protected]>
Date:    Thu Nov 9 19:39:57 2017 -0700
Summary: Define <Project>_C_Standard as cache var and document (#239)
prwolfe added a commit that referenced this issue Nov 13, 2017
This file should be merged to Trilinos only after
pull request #239 in TriBITSPub has been propigated
to Trilinos. (TriBITSPub/TriBITS#239)
It merely sets the override for the sierra builds to
be the c11 posix standard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Teuchos Issues primarily dealing with the Teuchos Package type: bug The primary issue is a bug in Trilinos code or tests type: help wanted
Projects
None yet
Development

No branches or pull requests

3 participants