BlueGene Q stuff

I found this presentation very enlightening. Page 45 contains a schematic of the node-local network map:

http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf

Configuration

XLC

On JUQUEEN:

../configure --enable-spi --enable-qpx --enable-omp --enable-alignment=32 \
--without-bgldram --with-limedir=$YOURLIMEDIR --with-lemondir=$YOURLEMONDIR \
--enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor \
--enable-largefile --with-lapack="-L/opt/ibmmath/lib64 \
-L/usr/local/bg_soft/lapack/3.5.0_simd/lib -lesslsmpbg -llapack -lesslsmpbg \
-lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath \
 -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread" \
CC=/bgsys/drivers/ppcfloor/comm/xl/bin/mpixlc_r F77=bgxlf_r \
LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.5.0_simd/lib \
-lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 \
-lxlsmp -lpthread -L/opt/ibmmath/lib64" FC=bgxlf_r

Note how the fortran libraries are only used as thread-safe '_r' versions. I'm not sure whether the FC is necessary.

GCC

For testing purposes you might want to compile with gcc:

../configure --with-alignment=32 --with-limedir=/homea/hch02/hch028/lime-1.3.2/install_bgq/
 --enable-omp --enable-mpi --with-mpidimension=4 --enable-gaugecopy --disable-halfspinor
 --enable-largefile --with-lapack=-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib 
 -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath 
 -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 CC=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpicc 
 CFLAGS="-I/bgsys/drivers/ppcfloor/comm/gcc/include" F77=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpif77
 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r
 -lxlfmath -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64"

and as one line:

../configure --with-alignment=32 --with-limedir=/homea/hch02/hch028/lime-1.3.2/install_bgq/ --enable-omp --enable-mpi --with-mpidimension=4 --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack=-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib  -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 CC=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpicc  CFLAGS="-I/bgsys/drivers/ppcfloor/comm/gcc/include" F77=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpif77 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r -lxlfmath -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64"

Note that the resulting code is very slow so far.

Optimization

NOTE THAT THESE ARE PROBABLY UNSAFE, STICK TO THE ONES DEFINED IN configure.in!

Currently the best optimization results seem to come from "--with-alignment=32" and these OPTARGS:

-O5 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp

although funnily enough, -O0 gives the same performance (while -On, n < 5 are slower). It has not yet been analyzed whether -qnostrict leads to numerical problems but the performance difference is minor. Note that the compiler is bgxlc or mpixlc on juqueen. With only xlc you'll get code for the frontend node. Further, to compile the hybrid or openmp version it is important to use the thread-safe version of the compiler, e.g. 'bgxlc_r' and 'mpixlc_r'.

Jobscript example

# @ job_name         = BGQ_benchmark_etmc_4096
# @ error            = $(job_name).$(jobid).out
# @ output           = $(job_name).$(jobid).out
# @ environment      = COPY_ALL;
# @ wall_clock_limit = 00:15:00
# @ notification     = always
# @ notify_user      = [email protected]
# @ job_type         = bluegene
# @ bg_connectivity  = TORUS
# @ bg_size          = 64
# @ queue

export NAME=BGQ_benchmark_etmc_4096

if [[ ! -d ${WORK}/${NAME} ]]
then
  mkdir -p ${WORK}/${NAME}
fi

cd ${WORK}/${NAME}

cp /homea/hch02/hch028/tmLQCD/build_bgq_etmc/benchmark.input ${WORK}/${NAME}/benchmark.input

runjob --ranks-per-node 64 --np 4096 --cwd ${WORK}/${NAME} --exe ${HOME}/tmLQCD/build_bgq_etmc/benchmark

More involved jobscript

# @ job_name         = examplejob
# @ error            = $(job_name).$(jobid).out
# @ output           = $(job_name).$(jobid).out
# @ environment      = COPY_ALL;
# @ wall_clock_limit = 00:30:00
# @ notification     = always
# @ notify_user      = [email protected]
# @ job_type         = bluegene
# @ bg_connectivity  = TORUS
# @ bg_size          = 1024
# @ queue

export NP=$LOADL_BG_SIZE
export NPN=1
export OMP_NUM_THREADS=64
export NAME=${LOADL_JOB_NAME}
export RUNDIR=${NAME}

# working directory
export WDIR=${WORK}/${RUNDIR}

# executable
export EXEC=${HOME}/tmlqcd_binaries/4D_LEMON/hmc_tm

# the output file
export OFILE=${LOADL_STEP_OUT}
echo "output file: ${OFILE}" 

if [ ! -d ${WDIR} ]
then
  mkdir -p ${WDIR}
fi

cd ${WDIR}

# determine optimal mapping to network topology
echo loadl shape is $LOADL_BG_SHAPE
# default mapping for bg_size=32 and bg_size=512 
export MP=EABCDT

# different mappings for different machine partitions, the configuration of processes is given
# in parentheses, so for bg_size=1024, for example, we would have 16 processes in T, and 4 processes
# each in X, Y and Z directions

case ${LOADL_BG_SHAPE} in
  # mappings for BG_SIZE=1024 (16x4x4x4)
  2x1x1x1 )
    MP=EABCDT 
  ;;
  1x2x1x1 )
    MP=EBACDT
  ;;    
  1x1x2x1 )
    MP=ECABDT
  ;;    
  1x1x1x2 )
    MP=EDABCT
  ;;
  # mappings for BG_SIZE=64 (8x2x2x2)
  2x2x2x4x2 )
    MP=EDABCT
  ;;
  2x2x4x2x2 )
    MP=ECABDT
  ;;
  2x4x2x2x2 )
    MP=EBACDT
  ;;
  4x2x2x2x2 )
    MP=EABCDT
  ;;
  # mappings for bg_size=128 (16x2x2x2)
  2x2x4x4x2 )
    MP=CDABET
  ;;
  2x4x2x4x2 )
    MP=BDACET
  ;;
  4x2x2x4x2 )
    MP=ADBCET
  ;;
  2x4x4x2x2 )
    MP=BCADET
  ;;
  4x2x4x2x2 )
    MP=ACBDET
  ;;
  4x4x2x2x2 )
    MP=ABCDET
  ;;
  # mappings for bg_size=256 (4x4x4x4)
  2x4x4x4x2 )
    MP=BCDAET
  ;;
  4x4x2x4x2 )
    MP=ABDCET
  ;;
  4x2x4x4x2 )
    MP=ACDBET
  ;;
  4x4x4x2x2 )
    MP=ABCDET
  ;;
esac

echo mapping is ${MP}

runjob --mapping ${MP} --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node ${NPN} --np ${NP} --cwd ${WDIR} --exe ${EXEC}

runjob option --mapping $MAP with default "ABCDET" might need tuning.

Mappings

This case statement provides appropriate mappings for 4D parallelization, one process per node, 64 threads per process. The process geometry that should be set in the tmLQCD input file is given in parantheses above each section. The case statement should be placed before the runjob call in the job command file (job script).

## determine optimal mapping to network topology
echo loadl shape is $LOADL_BG_SHAPE

# default mapping for BG_SIZE=512 and 32
export MP=EABCDT

case ${LOADL_BG_SHAPE} in
  # mappings for BG_SIZE=1024 (16x4x4x4)
  2x1x1x1 )
    MP=EABCDT 
  ;;
  1x2x1x1 )
    MP=EBACDT
  ;;    
  1x1x2x1 )
    MP=ECABDT
  ;;    
  1x1x1x2 )
    MP=EDABCT
  ;;
  # mappings for BG_SIZE=64 (8x2x2x2)
  2x2x2x4x2 )
    MP=EDABCT
  ;;
  2x2x4x2x2 )
    MP=ECABDT
  ;;
  2x4x2x2x2 )
    MP=EBACDT
  ;;
  4x2x2x2x2 )
    MP=EABCDT
  ;;
  # mappings for bg_size=128 (16x2x2x2)
  2x2x4x4x2 )
    MP=CDABET
  ;;
  2x4x2x4x2 )
    MP=BDACET
  ;;
  4x2x2x4x2 )
    MP=ADBCET
  ;;
  2x4x4x2x2 )
    MP=BCADET
  ;;
  4x2x4x2x2 )
    MP=ACBDET
  ;;
  4x4x2x2x2 )
    MP=ABCDET
  ;;
  # mappings for bg_size=256 (4x4x4x4)
  2x4x4x4x2 )
    MP=BCDAET
  ;;
  4x4x2x4x2 )
    MP=ABDCET
  ;;
  4x2x4x4x2 )
    MP=ACDBET
  ;;
  4x4x4x2x2 )
    MP=ABCDET
  ;;
esac

Scalasca

Initial scalasca runs of the benchmark executable seem to suggest that the MPI and OpenMP overheads on BG/Q are much lower than on any other architecture so far. Of course these are relative measurements and the indication could well change with improvements to single thread performance.

Note that scalasca does not understand the conditional loop headers in my omp-final branch. Before trying to use scalasca they need to be removed in Hopping_Matrix (3 or 7 depending on halfspinor usage), clover.c and clover_leaf.c.

QPX intrinsics

Here are some of the compiler intrinsic functions that we will probably need for optimising the Dirac operator for the Blue Gene/Q (needs xlc compiler), where the old BG/P intrinsics are given where applicable:

add, subtract, multiply:

d = vec_add(a, b); // former __fpadd
d = vec_sub(a, b); // former __fpsub
d = vec_mul(a, b); // former __fpmul

multiply-add operations

d = vec_madd(a, b, c); // d = a*b + c // former __fpmadd
d = vec_msub(a, b, c); // d = a*b - c // former __fpmsub
d = vec_nmadd(a, b, c); // d = -(a*b + c) // former __fpnmadd
d = vec_nmsub(a, b, c); // d = -(a*b - c) // former __fpnmsub

load and store:

d = vec_ld(a, b); // loads a full vector4double
d = vec_lda(a, b);
d = vec_ld2(a, b); // former __lfpd
d = vec_lda2(a, b);
vec_st(a, b, c);
vec_st2(a, b, c); // former __stfpd

distributing

d = vec_splat(a, b);
d = vec_splats(a);

which sets all elements of d to a[b] (a).

cross multiply and multiply-add operations:

d = vec_xmadd(a,b,c); // former __fxmadd

which does

d[0] = (a[0] * b[0]) + c[0];
d[1] = (a[0] * b[1]) + c[1];
d[2] = (a[2] * b[2]) + c[2];
d[3] = (a[2] * b[3]) + c[3];

and

d = vec_xmul(a,b); // former _fxpmul

which does

d[0] = (a[0] * b[0]); 
d[1] = (a[0] * b[1]); 
d[2] = (a[2] * b[2]); 
d[3] = (a[2] * b[3]);

and

d = vec_xxcpnmadd(a, b, c);

which does

d[0] = +( a[1] * b[1] ) + c[0];
d[1] = -( a[0] * b[1] ) + c[1];
d[2] = +( a[3] * b[3] ) + c[2];
d[3] = -( a[2] * b[3] ) + c[3];

and

d = vec_xxmadd(a, b, c);

which does

d[0] =  ( a[1] * b[1] ) + c[0];
d[1] =  ( a[0] * b[1] ) + c[1];
d[2] =  ( a[3] * b[3] ) + c[2];
d[3] =  ( a[2] * b[3] ) + c[3];

and

d = vec_xxnpmadd(a, b, c);

which does

d[0] =  -( a[1] * b[1] ) + c[0];
d[1] =  +( a[0] * b[1] ) + c[1];
d[2] =  -( a[3] * b[3] ) + c[2];
d[3] =  +( a[2] * b[3] ) + c[3];

cores have 32 of 256 bit registers, and no other registers. You need to use

__align(32)

statement for pointers and the new data type is vector4double and possibly

#pragma disjoint ...

in order to allow the vector unit to work. vector4double can be used like a C native type, so you may write a*b, a+b ...

Sample Code with vector instructions

here is sample code for performing a complex multiplication d=a*b using the new vector instructions, but only half of the 4 doubles:

complex double ca, cb, cc;
vector4double a, b, c, d;

cb = 0.5 + I;
cc = 3. + 0.1*I;
ca = cb*cc;
printf("%e %e\n", creal(ca), cimag(ca));
a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(a, b);
d = vec_xxnpmadd(b, a, c);
vec_st2(d, 0, (double*) &ca);
printf("vec %e %e\n", creal(ca), cimag(ca));

or alternatively (more symmetrical to the following d=a*conj(b))

a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(b, a);
d = vec_xxnpmadd(a, b, c);
vec_st2(d, 0, (double*) &ca);

and for d=a*conj(b)

a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(b, a);
d = vec_xxcpnmadd(a, b, c);
vec_st2(d, 0, (double*) &ca);

and for d=conj(a)*b

a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(a, b);
d = vec_xxcpnmadd(b, a, c);
vec_st2(d, 0, (double*) &ca);

Some Performance results

I have coded scalar_prod_r with QPX intrinsics, and at least there the ratio of intrinsics to -O5 -qsimd=auto is 0.75/9, so quite encouraging. It turns out that

__prefetch_by_load(addr);

also speeds up the code again. I didn't do a careful and systematic study depending on the problem size etc. yet.

New Features

thread level speculation with compiler flag -qsmp=speculative
```
`#pragma speculative for` 
`#pragma speculative section`
```
tries to optimise overhead and conflict probability. Useful for us??
transactional memory with -qtm, useful for us??
I'm not sure actually. I think to a certain extent we have very good data separation between threads and the "transparent locking" provided by transactional memory might be more of an overhead than not. I guess in principle the usefulness can be evaluated by a) benchmarks b) looking at how often memory is actually being locked and unlocked. Forcing -qnotm does not seem to make a difference in the benchmark executable, but of course this could be due to the current low efficiency.
list based prefetching, by a new library... useful?
L2 Atomic Operations (very efficient):
- LoadClear
- LoadIncrement,...

Environment variables

PAMID_ASYNC_PROGRESS

This variable determines whether one or more communications threads are started to make asynchronous progress. This variable is required for maximum performance in message throughput cases. But it requires MPI_Init_thread to be called instead of MPI_Init.

PAMID_THREAD_MULTIPLE

Specifies the messaging execution environment. It specifically selects whether there can be multiple independent communications occurring in parallel, driven by internal communications threads.

PAMID_CONTEXT_POST

This variable must be enabled to allow parallelism of multiple contexts. It might increase latency. Enabling this variable is the only method to allow parallelism between contexts.

PAMID_CONTEXT_MAX

This variable sets the maximum allowable number of contexts. Contexts are a method of dividing hardware resources among a Parallel Active Messaging Interface (PAMI) client (for example, MPI) to set how many parallel operations can occur at one time. Contexts are similar to channels in a communications system. The practical maximum is usually 64 contexts per node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlueGene Q stuff

Configuration

XLC

GCC

Optimization

Jobscript example

More involved jobscript

Mappings

Scalasca

QPX intrinsics

Sample Code with vector instructions

Some Performance results

New Features

Environment variables

Clone this wiki locally