-
Notifications
You must be signed in to change notification settings - Fork 48
BlueGene Q stuff
I found this presentation very enlightening. Page 45 contains a schematic of the node-local network map:
http://www.training.prace-ri.eu/uploads/tx_pracetmo/BG-Q-_Vezolle.pdf
On JUQUEEN:
../configure --enable-spi --enable-qpx --enable-omp --enable-alignment=32 \
--without-bgldram --with-limedir=$YOURLIMEDIR --with-lemondir=$YOURLEMONDIR \
--enable-mpi --with-mpidimension=4 --enable-gaugecopy --enable-halfspinor \
--enable-largefile --with-lapack="-L/opt/ibmmath/lib64 \
-L/usr/local/bg_soft/lapack/3.5.0_simd/lib -lesslsmpbg -llapack -lesslsmpbg \
-lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath \
-L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 -lxlsmp -lpthread" \
CC=/bgsys/drivers/ppcfloor/comm/xl/bin/mpixlc_r F77=bgxlf_r \
LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.5.0_simd/lib \
-lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 \
-lxlsmp -lpthread -L/opt/ibmmath/lib64" FC=bgxlf_r
Note how the fortran libraries are only used as thread-safe '_r' versions. I'm not sure whether the FC is necessary.
For testing purposes you might want to compile with gcc:
../configure --with-alignment=32 --with-limedir=/homea/hch02/hch028/lime-1.3.2/install_bgq/
--enable-omp --enable-mpi --with-mpidimension=4 --enable-gaugecopy --disable-halfspinor
--enable-largefile --with-lapack=-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib
-lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath
-L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 CC=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpicc
CFLAGS="-I/bgsys/drivers/ppcfloor/comm/gcc/include" F77=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpif77
LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r
-lxlfmath -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64"
and as one line:
../configure --with-alignment=32 --with-limedir=/homea/hch02/hch028/lime-1.3.2/install_bgq/ --enable-omp --enable-mpi --with-mpidimension=4 --enable-gaugecopy --disable-halfspinor --enable-largefile --with-lapack=-L/bgsys/local/lib/ -L/usr/local/bg_soft/lapack/3.3.0/lib -lesslbg -llapack -lesslbg -lxlf90_r -L/opt/ibmcmp/xlf/bg/14.1/lib64 -lxl -lxlopt -lxlf90_r -lxlfmath -L/opt/ibmcmp/xlsmp/bg/3.1/bglib64 CC=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpicc CFLAGS="-I/bgsys/drivers/ppcfloor/comm/gcc/include" F77=/bgsys/drivers/ppcfloor/comm/gcc/bin/mpif77 LDFLAGS="-L/opt/ibmcmp/xlf/bg/14.1/lib64 -L/usr/local/bg_soft/lapack/3.3.0 -lxl -lxlopt -lxlf90_r -lxlfmath -L/bgsys/ibm_essl/prod/opt/ibmmath/lib64"
Note that the resulting code is very slow so far.
NOTE THAT THESE ARE PROBABLY UNSAFE, STICK TO THE ONES DEFINED IN configure.in!
Currently the best optimization results seem to come from "--with-alignment=32" and these OPTARGS:
-O5 -qstrict -qprefetch=aggressive -qarch=qp -qtune=qp -qmaxmem=-1 -qsimd=noauto -qsmp
although funnily enough, -O0 gives the same performance (while -On, n < 5 are slower). It has not yet been analyzed whether -qnostrict leads to numerical problems but the performance difference is minor. Note that the compiler is bgxlc
or mpixlc
on juqueen. With only xlc
you'll get code for the frontend node. Further, to compile the hybrid or openmp version it is important to use the thread-safe version of the compiler, e.g. 'bgxlc_r' and 'mpixlc_r'.
# @ job_name = BGQ_benchmark_etmc_4096
# @ error = $(job_name).$(jobid).out
# @ output = $(job_name).$(jobid).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 00:15:00
# @ notification = always
# @ notify_user = [email protected]
# @ job_type = bluegene
# @ bg_connectivity = TORUS
# @ bg_size = 64
# @ queue
export NAME=BGQ_benchmark_etmc_4096
if [[ ! -d ${WORK}/${NAME} ]]
then
mkdir -p ${WORK}/${NAME}
fi
cd ${WORK}/${NAME}
cp /homea/hch02/hch028/tmLQCD/build_bgq_etmc/benchmark.input ${WORK}/${NAME}/benchmark.input
runjob --ranks-per-node 64 --np 4096 --cwd ${WORK}/${NAME} --exe ${HOME}/tmLQCD/build_bgq_etmc/benchmark
# @ job_name = examplejob
# @ error = $(job_name).$(jobid).out
# @ output = $(job_name).$(jobid).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 00:30:00
# @ notification = always
# @ notify_user = [email protected]
# @ job_type = bluegene
# @ bg_connectivity = TORUS
# @ bg_size = 1024
# @ queue
export NP=$LOADL_BG_SIZE
export NPN=1
export OMP_NUM_THREADS=64
export NAME=${LOADL_JOB_NAME}
export RUNDIR=${NAME}
# working directory
export WDIR=${WORK}/${RUNDIR}
# executable
export EXEC=${HOME}/tmlqcd_binaries/4D_LEMON/hmc_tm
# the output file
export OFILE=${LOADL_STEP_OUT}
echo "output file: ${OFILE}"
if [ ! -d ${WDIR} ]
then
mkdir -p ${WDIR}
fi
cd ${WDIR}
# determine optimal mapping to network topology
echo loadl shape is $LOADL_BG_SHAPE
# default mapping for bg_size=32 and bg_size=512
export MP=EABCDT
# different mappings for different machine partitions, the configuration of processes is given
# in parentheses, so for bg_size=1024, for example, we would have 16 processes in T, and 4 processes
# each in X, Y and Z directions
case ${LOADL_BG_SHAPE} in
# mappings for BG_SIZE=1024 (16x4x4x4)
2x1x1x1 )
MP=EABCDT
;;
1x2x1x1 )
MP=EBACDT
;;
1x1x2x1 )
MP=ECABDT
;;
1x1x1x2 )
MP=EDABCT
;;
# mappings for BG_SIZE=64 (8x2x2x2)
2x2x2x4x2 )
MP=EDABCT
;;
2x2x4x2x2 )
MP=ECABDT
;;
2x4x2x2x2 )
MP=EBACDT
;;
4x2x2x2x2 )
MP=EABCDT
;;
# mappings for bg_size=128 (16x2x2x2)
2x2x4x4x2 )
MP=CDABET
;;
2x4x2x4x2 )
MP=BDACET
;;
4x2x2x4x2 )
MP=ADBCET
;;
2x4x4x2x2 )
MP=BCADET
;;
4x2x4x2x2 )
MP=ACBDET
;;
4x4x2x2x2 )
MP=ABCDET
;;
# mappings for bg_size=256 (4x4x4x4)
2x4x4x4x2 )
MP=BCDAET
;;
4x4x2x4x2 )
MP=ABDCET
;;
4x2x4x4x2 )
MP=ACDBET
;;
4x4x4x2x2 )
MP=ABCDET
;;
esac
echo mapping is ${MP}
runjob --mapping ${MP} --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --ranks-per-node ${NPN} --np ${NP} --cwd ${WDIR} --exe ${EXEC}
runjob
option --mapping $MAP
with default "ABCDET" might need tuning.
This case statement provides appropriate mappings for 4D parallelization, one process per node, 64 threads per process. The process geometry that should be set in the tmLQCD input file is given in parantheses above each section. The case statement should be placed before the runjob call in the job command file (job script).
## determine optimal mapping to network topology
echo loadl shape is $LOADL_BG_SHAPE
# default mapping for BG_SIZE=512 and 32
export MP=EABCDT
case ${LOADL_BG_SHAPE} in
# mappings for BG_SIZE=1024 (16x4x4x4)
2x1x1x1 )
MP=EABCDT
;;
1x2x1x1 )
MP=EBACDT
;;
1x1x2x1 )
MP=ECABDT
;;
1x1x1x2 )
MP=EDABCT
;;
# mappings for BG_SIZE=64 (8x2x2x2)
2x2x2x4x2 )
MP=EDABCT
;;
2x2x4x2x2 )
MP=ECABDT
;;
2x4x2x2x2 )
MP=EBACDT
;;
4x2x2x2x2 )
MP=EABCDT
;;
# mappings for bg_size=128 (16x2x2x2)
2x2x4x4x2 )
MP=CDABET
;;
2x4x2x4x2 )
MP=BDACET
;;
4x2x2x4x2 )
MP=ADBCET
;;
2x4x4x2x2 )
MP=BCADET
;;
4x2x4x2x2 )
MP=ACBDET
;;
4x4x2x2x2 )
MP=ABCDET
;;
# mappings for bg_size=256 (4x4x4x4)
2x4x4x4x2 )
MP=BCDAET
;;
4x4x2x4x2 )
MP=ABDCET
;;
4x2x4x4x2 )
MP=ACDBET
;;
4x4x4x2x2 )
MP=ABCDET
;;
esac
Initial scalasca runs of the benchmark executable seem to suggest that the MPI and OpenMP overheads on BG/Q are much lower than on any other architecture so far. Of course these are relative measurements and the indication could well change with improvements to single thread performance.
Note that scalasca does not understand the conditional loop headers in my omp-final branch. Before trying to use scalasca they need to be removed in Hopping_Matrix (3 or 7 depending on halfspinor usage), clover.c and clover_leaf.c.
Here are some of the compiler intrinsic functions that we will probably need for optimising the Dirac operator for the Blue Gene/Q (needs xlc compiler), where the old BG/P intrinsics are given where applicable:
add, subtract, multiply:
d = vec_add(a, b); // former __fpadd
d = vec_sub(a, b); // former __fpsub
d = vec_mul(a, b); // former __fpmul
multiply-add operations
d = vec_madd(a, b, c); // d = a*b + c // former __fpmadd
d = vec_msub(a, b, c); // d = a*b - c // former __fpmsub
d = vec_nmadd(a, b, c); // d = -(a*b + c) // former __fpnmadd
d = vec_nmsub(a, b, c); // d = -(a*b - c) // former __fpnmsub
load and store:
d = vec_ld(a, b); // loads a full vector4double
d = vec_lda(a, b);
d = vec_ld2(a, b); // former __lfpd
d = vec_lda2(a, b);
vec_st(a, b, c);
vec_st2(a, b, c); // former __stfpd
distributing
d = vec_splat(a, b);
d = vec_splats(a);
which sets all elements of d
to a[b]
(a
).
cross multiply and multiply-add operations:
d = vec_xmadd(a,b,c); // former __fxmadd
which does
d[0] = (a[0] * b[0]) + c[0];
d[1] = (a[0] * b[1]) + c[1];
d[2] = (a[2] * b[2]) + c[2];
d[3] = (a[2] * b[3]) + c[3];
and
d = vec_xmul(a,b); // former _fxpmul
which does
d[0] = (a[0] * b[0]);
d[1] = (a[0] * b[1]);
d[2] = (a[2] * b[2]);
d[3] = (a[2] * b[3]);
and
d = vec_xxcpnmadd(a, b, c);
which does
d[0] = +( a[1] * b[1] ) + c[0];
d[1] = -( a[0] * b[1] ) + c[1];
d[2] = +( a[3] * b[3] ) + c[2];
d[3] = -( a[2] * b[3] ) + c[3];
and
d = vec_xxmadd(a, b, c);
which does
d[0] = ( a[1] * b[1] ) + c[0];
d[1] = ( a[0] * b[1] ) + c[1];
d[2] = ( a[3] * b[3] ) + c[2];
d[3] = ( a[2] * b[3] ) + c[3];
and
d = vec_xxnpmadd(a, b, c);
which does
d[0] = -( a[1] * b[1] ) + c[0];
d[1] = +( a[0] * b[1] ) + c[1];
d[2] = -( a[3] * b[3] ) + c[2];
d[3] = +( a[2] * b[3] ) + c[3];
cores have 32 of 256 bit registers, and no other registers. You need to use
__align(32)
statement for pointers and the new data type is vector4double
and possibly
#pragma disjoint ...
in order to allow the vector unit to work.
vector4double can be used like a C native type, so you may write a*b
, a+b
...
here is sample code for performing a complex multiplication d=a*b
using the new vector instructions, but only half of the 4 doubles:
complex double ca, cb, cc;
vector4double a, b, c, d;
cb = 0.5 + I;
cc = 3. + 0.1*I;
ca = cb*cc;
printf("%e %e\n", creal(ca), cimag(ca));
a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(a, b);
d = vec_xxnpmadd(b, a, c);
vec_st2(d, 0, (double*) &ca);
printf("vec %e %e\n", creal(ca), cimag(ca));
or alternatively (more symmetrical to the following d=a*conj(b)
)
a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(b, a);
d = vec_xxnpmadd(a, b, c);
vec_st2(d, 0, (double*) &ca);
and for d=a*conj(b)
a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(b, a);
d = vec_xxcpnmadd(a, b, c);
vec_st2(d, 0, (double*) &ca);
and for d=conj(a)*b
a = vec_ld2(0, (double*) &cb);
b = vec_ld2(0, (double*) &cc);
c = vec_xmul(a, b);
d = vec_xxcpnmadd(b, a, c);
vec_st2(d, 0, (double*) &ca);
I have coded scalar_prod_r
with QPX intrinsics, and at least there the ratio of intrinsics to -O5 -qsimd=auto
is 0.75/9, so quite encouraging. It turns out that
__prefetch_by_load(addr);
also speeds up the code again. I didn't do a careful and systematic study depending on the problem size etc. yet.
-
thread level speculation with compiler flag
-qsmp=speculative
`#pragma speculative for` `#pragma speculative section`
tries to optimise overhead and conflict probability. Useful for us??
-
transactional memory with
-qtm
, useful for us?? -
I'm not sure actually. I think to a certain extent we have very good data separation between threads and the "transparent locking" provided by transactional memory might be more of an overhead than not. I guess in principle the usefulness can be evaluated by a) benchmarks b) looking at how often memory is actually being locked and unlocked. Forcing -qnotm does not seem to make a difference in the benchmark executable, but of course this could be due to the current low efficiency.
-
list based prefetching, by a new library... useful?
-
L2 Atomic Operations (very efficient):
- LoadClear
- LoadIncrement,...
PAMID_ASYNC_PROGRESS
This variable determines whether one or more communications threads are started to make asynchronous progress. This variable is required for maximum performance in message throughput cases. But it requires MPI_Init_thread
to be called instead of MPI_Init
.
PAMID_THREAD_MULTIPLE
Specifies the messaging execution environment. It specifically selects whether there can be multiple independent communications occurring in parallel, driven by internal communications threads.
PAMID_CONTEXT_POST
This variable must be enabled to allow parallelism of multiple contexts. It might increase latency. Enabling this variable is the only method to allow parallelism between contexts.
PAMID_CONTEXT_MAX
This variable sets the maximum allowable number of contexts. Contexts are a method of dividing hardware resources among a Parallel Active Messaging Interface (PAMI) client (for example, MPI) to set how many parallel operations can occur at one time. Contexts are similar to channels in a communications system. The practical maximum is usually 64 contexts per node.