Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.0.0rc1: XLC-12.1 build errors w/ -q32 and atomics #3816

Closed
PHHargrove opened this issue Jul 6, 2017 · 30 comments
Closed

v3.0.0rc1: XLC-12.1 build errors w/ -q32 and atomics #3816

PHHargrove opened this issue Jul 6, 2017 · 30 comments

Comments

@PHHargrove
Copy link
Member

With xlc-12.1 on a ppc64 host, I am trying to build for ILP32, and configure as follows:

[path-to]/configure --prefix=[...] --enable-debug CC=xlc CXX=xlC FC=xlf \
         CFLAGS="-q32"  --with-wrapper-cflags="-q32"  \
         CXXFLAGS="-q32"  --with-wrapper-cxxflags="-q32"  \
         FCFLAGS="-q32"  --with-wrapper-fcflags="-q32" --disable-oshmem-fortran

This build was previously failing due to #3811, but a patch from @ggouaillardet gets me past that and on to the next problem:

libtool: compile:  xlc -DHAVE_CONFIG_H -I. -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/ompi/mpi/c/profile -I../../../../opal/include -I../../../../ompi/include -I../../../../oshmem/include -I../../../../opal/mca/hwloc/hwloc1113/hwloc/include/private/autogen -I../../../../opal/mca/hwloc/hwloc1113/hwloc/include/hwloc/autogen -I../../../../ompi/mpiext/cuda/c -DOMPI_BUILD_MPI_PROFILING=1 -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1 -I../../../.. -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/orte/include -I../../../../orte/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/ompi/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/oshmem/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/BLD/opal/mca/event/libevent2022/libevent/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/mca/event/libevent2022/libevent -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/mca/event/libevent2022/libevent/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/BLD/opal/mca/hwloc/hwloc1113/hwloc/include -I/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/mca/hwloc/hwloc1113/hwloc/include -D_REENTRANT -q32 -g -c paddress.c -Wp,-qmakedep=gcc,-MF.deps/paddress.TPlo  -qpic -DPIC -o .libs/paddress.o
"/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/include/opal/sys/sync_builtin/atomic.h", line 35.5: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/include/opal/sys/sync_builtin/atomic.h", line 40.5: 1506-754 (S) The parameter type is not valid for a function of this linkage type.
"/home/hargrove/SCRATCH/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/openmpi-3.0.0rc1/opal/include/opal/sys/sync_builtin/atomic.h", line 45.5: 1506-754 (S) The parameter type is not valid for a function of this linkage type.

All thee errors are on lines with just __sync_synchronize().
I have attached the config.log

@ggouaillardet
Copy link
Contributor

this is really odd, since sync_synchronize() was tested in configure

out of curiosity, can you please give the inline patch below a try ?

diff --git a/opal/include/opal/sys/sync_builtin/atomic.h b/opal/include/opal/sys/sync_builtin/atomic.h
index 0f18039..b52e82f 100644
--- a/opal/include/opal/sys/sync_builtin/atomic.h
+++ b/opal/include/opal/sys/sync_builtin/atomic.h
@@ -13,6 +13,8 @@
  * Copyright (c) 2011      Sandia National Laboratories. All rights reserved.
  * Copyright (c) 2014-2016 Los Alamos National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2017      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -30,17 +32,17 @@
  *********************************************************************/
 #define OPAL_HAVE_ATOMIC_MEM_BARRIER 1
 
-static inline void opal_atomic_mb(void)
+static inline void opal_atomic_mb()
 {
     __sync_synchronize();
 }
 
-static inline void opal_atomic_rmb(void)
+static inline void opal_atomic_rmb()
 {
     __sync_synchronize();
 }
 
-static inline void opal_atomic_wmb(void)
+static inline void opal_atomic_wmb()
 {
     __sync_synchronize();
 }

@PHHargrove
Copy link
Member Author

I can verify that the following 4-line C program compiles fine with the same arguments to xlc other than the -o:

static inline void opal_atomic_mb(void)
{
    __sync_synchronize();
}

So, I don't think the configure check was wrong.

Even without the patch above from @ggouaillardet to remove void, I find that many files that include atomic.h are just fine.

With the patch, the one observed failure on ompi/mpi/c/profile/paddress.c is fixed.

Makes no sense to me.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jul 7, 2017
just declare inline subroutines with
static inline ... ()
in order to make some compilers (xlc 12.1) happy

Thanks Paul Hargrove for the report

Refs. open-mpi#3816

Signed-off-by: Gilles Gouaillardet <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2017

@PHHargrove Just curious: is paddress.o the first time this error appears in the build process? Or is it happening in lots of places, and you just happened to cite the paddress.c compilation line in this github issue?

I ask because if it first shows up in paddress.c, that's really odd. On the one hand, it's very deep into the compilation process. The atomics have been used (and apparently successfully compiled) in a bazillion places (e.g., all of OPAL and ORTE) before this error occurs. On the other hand, paddress.c is the first file compiled in the ompi/mpi/c/profile directory. Perhaps there's something unique about the preprocessor macros in that directory that is triggering this behavior...?

@PHHargrove
Copy link
Member Author

@jsquyres
I also observed that the atomics compiled many times, before paddress.c which is the first failure.
It was not the first file compiled in ompi/mca/c/profile either:

make: Entering directory `/gpfs/mira-fs1/projects/EarlyPerf_theta/hargrove/OMPI/openmpi-3.0.0rc1-linux-ppc32-xlc-12.1/BLD/ompi/mpi/c/profile'
  CC       pabort.lo
  CC       padd_error_class.lo
  CC       padd_error_code.lo
  CC       padd_error_string.lo
  CC       paddress.lo

This is not problem with LP64 (the preferred ABI) and I really think this looks to be a compiler bug.

While 12.1 is the latest XLC on this system (the front-end to a BG/Q), it is not the latest release for big-endian PPC (which is 13.x).
So, I think this should be documented rather than applying Gilles's patch to remove void.

@bwbarrett
Copy link
Member

I agree with @PHHargrove; I'll put together a README note this afternoon.

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2017

@PHHargrove @bwbarrett 👍

@PHHargrove
Copy link
Member Author

@bwbarrett
This system/compiler has been build Open MPI RCs for years without problem.
So there is something different from previous releases.
I suspect that something like --disable-builtin-atomics might work-around the problem.
So, you might want to hold off on writing the README note until I've had a chance to try that,

@gpaulsen
Copy link
Member

gpaulsen commented Jul 7, 2017

I asked our compiler person and he said:

By looking at the code change, it seems that the failure is caused by something else.

In the C standard, there is no difference between foo() {...} and foo(void) {...} (i.e. function definitions), both mean that the function has no parameter. However, there is a difference between foo() and foo(void).

Two relevant references in the C11 standard:

"The special case of an unnamed parameter of type void as the only item in the list specifies that the function has no parameters." [133p10]

"An empty list in a function declarator that is part of a definition of that function specifies that the function has no parameters. The empty list in a function declarator that is not part of a definition of that function specifies that no information about the number or types of the parameters is supplied." [134p14]

@bwbarrett
Copy link
Member

@PHHargrove , yeah, I'd suspect it's the builtin atomics, since those are new to the 3.0.x series. Will wait for your feedback.

@gpaulsen
Copy link
Member

gpaulsen commented Jul 7, 2017

I like the README with possible workaround of disabling atomics in README.

@PHHargrove
Copy link
Member Author

@gpaulsen
Not only is there no difference by specification for the code before and after Gilles's change, but additionally the error code from XLC appears to indicate incorrect argument types to a compiler intrinsic (such as the vector extensions), which points to the call the __sync_synchronize() which did not change.

@PHHargrove
Copy link
Member Author

@bwbarrett
As suspected, disabling builtin atomics resolved the problem.
When documenting, also please note that with is only a problem w/ the ILP32 ABI (the -q32 compiler flag). A build on ppc64 w/o setting CFLAGS (etc.) is fine with the builtin atomics.

@gpaulsen
Copy link
Member

So how should we resolve this? I'd think that compiler generated atomics for -q32 case would be acceptable. Do we just need to add some configure logic to make that happen?

@jsquyres
Copy link
Member

It might be worthwhile to put in a specific configure test for this broken compiler scenario.

@PHHargrove
Copy link
Member Author

@jsquyres wrote:

It might be worthwhile to put in a specific configure test for this broken compiler scenario.

Based on the odd circumstances (only the profile build, and not until the 5th file), I doubt anything can be done at configure time other than explicitly identifying the compiler by version and the ABI from something like sizeof(void*).

@bwbarrett bwbarrett added this to the v3.0.0 milestone Jul 11, 2017
@bwbarrett bwbarrett self-assigned this Jul 13, 2017
@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

@bwbarrett Adding the target-v3.x tag because I think we have a simple fix that can be included in v3.0.0.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

Opps, wrong one.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

@bwbarrett This one will be fixed in v3.0.1 when we disable the builtin atomics by default again. Are we just waiting on documenting this issue for v3.0.0?

@bwbarrett
Copy link
Member

why are we waiting for 3.0.1 to disable builtin atomics? If it's broken, we should fix it before we release...

@PHHargrove
Copy link
Member Author

@bwbarrett wrote

why are we waiting for 3.0.1 to disable builtin atomics? If it's broken, we should fix it before we release...

Because there are trolls living under that particular bridge.
See #4001 for one I just found.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

@bwbarrett We discussed just this at the developer meeting (or on the phone-- can't remember). It is deemed too late in the release to change it.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

#4001 would indeed need to be fixed though.....

@bwbarrett
Copy link
Member

So this is a 3.0.0 or not for 3.0.x decision; it's absolutely the wrong thing to plan on swapping something like atomics around mid-release series. I'm not even sure we should enable built-in for 3.0 and then disable it for 3.1; that feels like the wrong thing for our users.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

@bwbarrett Exactly. I would be more comfortable not using the builtins. They are slower right now and we haven't determined why.

FWIW, it is a single line change to configury to go back to the v2.x default.

@hjelmn
Copy link
Member

hjelmn commented Aug 1, 2017

I think it really is time to kill the pre-built assembly. I don't know of any work done on supporting that code in the last couple of years. Killing that would help fix #4001.

@bwbarrett
Copy link
Member

now that we require perl (sigh), the pre-built stuff can die. It was a workaround for us not wanting to require perl.

@bwbarrett
Copy link
Member

@hjelmn, do you want to put together the build-ins not by default patch or should I?

@hjelmn
Copy link
Member

hjelmn commented Aug 3, 2017

@bwbarrett I will put it together.

@jjhursey
Copy link
Member

jjhursey commented Aug 16, 2017

Per issue #4053 and corresponding PR #4090 we are dropping support for the PowerPC Big Endian architecture and XL compilers older than 13.1. Once that PR is in to the release branch then we can close this issue.

@jjhursey
Copy link
Member

This has been resolved by PR #4103

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants