-
Notifications
You must be signed in to change notification settings - Fork 48
JSC test setup
This document describes how to set up a special test version of tmLQCD linked against QUDA and another one linked against QPHIX to perform benchmarks and scaling tests.
tmLQCD uses the lemon library, https://github.com/etmc/lemon for parallel I/O. In addition, the lime library is required: https://github.com/usqcd-software/c-lime.
Note that for lemon an MPI compiler is required and this should always be matched with the MPI compiler used for tmLQCD and any libraries it depends on.
These can be compiled using the usual configure/make
steps and should be installed into local directories using --prefix
and make install
. Note that two versions should be compiled: one using GCC for QUDA-enabled builds and one using ICC for QPHIX-enabled build.
The test configurations for these benchmarks are ETMC 2+1+1 configurations at two lattice spacings (~0.09 fm and ~0.065 fm) and have 32c64 and 48c96 lattice points respectively.
Ensemble | configurations | kappa_c | a mu |
---|---|---|---|
A30.32 | 1000, 1200, 1400, 1600, 1800 | 0.163272 | 0.0030 |
D15.48 | 100, 200, 300, 400, 500, 600 | 0.156361 | 0.0015 |
This test was set up on commit 923a8a6592a9f50b135dd8e9419c969432249a0e
of the feature/multigrid branch of http://github.com/lattice/quda.
In any case, in order to maximise QUDA performance at scale, the recommendations of https://github.com/lattice/quda/wiki/Multi-GPU-Support should be followed. The multigrid solver is documented here https://github.com/lattice/quda/wiki/Multigrid-Solver instead.
QUDA can currently not be compiled using the Intel compiler, we are thus proceeding with GCC. For concreteness, these instructions are for Jureca.
$ module list
Currently Loaded Modules:
1) GCCcore/.5.5.0 (H) 7) MVAPICH2/2.3a-GDR (g)
2) binutils/.2.30 (H) 8) ncurses/.6.0 (H)
3) StdEnv (H) 9) CMake/3.11.1
4) nvidia/.driver (H) 10) Bison/.3.0.4 (H)
5) CUDA/9.1.85 (g) 11) flex/2.6.4
6) GCC/5.5.0 12) imkl/2018.2.199
Where:
g: Built for GPU
H: Hidden Module
$ git clone https://github.com/lattice/quda -b feature/multigrid
$ cd quda
$ git checkout 923a8a6592a9f50b135dd8e9419c969432249a0e
$ mkdir build
$ cd build
$ cmake \
-DCMAKE_CXX_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/g++ \
-DCMAKE_C_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/gcc \
-DCMAKE_CUDA_HOST_COMPILER=/usr/local/software/jureca/Stages/2018a/software/GCCcore/5.5.0/bin/g++ \
-DMPI_CXX_COMPILER=/usr/local/software/jureca/Stages/2018a/software/MVAPICH2/2.3a-GCC-5.5.0-GDR/bin/mpicxx \
-DMPI_C_COMPILER=/usr/local/software/jureca/Stages/2018a/software/MVAPICH2/2.3a-GCC-5.5.0-GDR/bin/mpicc \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DIRAC_DOMAIN_WALL=OFF \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_STAGGERED=OFF \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DYNAMIC_CLOVER=OFF \
-DQUDA_MPI=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_MULTIGRID=ON \
-DQUDA_GPU_ARCH=sm_37 \
.. # the sources are in the parent directory
$ make -j
For QUDA_GPU_ARCH
, one must specify the architecture to be used. For example, K80 -> sm_37
, P100 (Pascal) -> sm_60
, V100 (Volta) -> sm_70
!
This test is based on commit b010542a9bb4e17142fa1288b91f73d630ac0400 of the jsc_benchmark branch of https://github.com/etmc/tmLQCD. Blas/Lapack is required for tmLQCD to compile, but for the purpose of this test, it is not necessary for this to be a highly optimized library version. For Jureca, we will use imkl.
$ module list
Currently Loaded Modules:
1) GCCcore/.5.5.0 (H) 7) MVAPICH2/2.3a-GDR (g)
2) binutils/.2.30 (H) 8) ncurses/.6.0 (H)
3) StdEnv (H) 9) CMake/3.11.1
4) nvidia/.driver (H) 10) Bison/.3.0.4 (H)
5) CUDA/9.1.85 (g) 11) flex/2.6.4
6) GCC/5.5.0 12) imkl/2018.2.199
Where:
g: Built for GPU
H: Hidden Module
$ git clone https://github.com/etmc/tmLQCD tmLQCD.quda -b jsc_benchmark
$ cd tmLQCD.quda
$ autoconf # only autoconf and perhaps aclocal should be executed
# tmLQCD does not use the full autotools chain
# this generates 'configure'
$ mkdir build
$ cd build
$ configure --enable-halfspinor --enable-gaugecopy \
--with-lemondir=${PATH_TO_LEMON} \
--with-limedir=${PATH_TO_LIME} \
--enable-mpi --with-mpidimension=4 --enable-omp \
--disable-sse2 --disable-sse3 --enable-alignment=32 \
--with-lapack="-L${MKLROOT}/lib/intel64 -lmkl_blas95_lp64 -lmkl_avx2 -lmkl_core -lmkl_intel_thread -lmkl_intel_lp64" \
F77=gfortran CC=mpicc CFLAGS="-fopenmp -std=c99 -O3 -march=haswell -mtune=haswell" \
CXXFLAGS="-fopenmp --std=c++11 -O3 -march=haswell -mtune=haswell" \
CXX=mpicxx --with-qudadir=${PATH_TO_QUDA} --with-cudadir=${CUDA_HOME} \
LDFLAGS=-lcuda
$ make -j
If the job scheduler exports the environment automatically, running can be as simple as shown below for two Jureca nodes.
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --output=out.2nodes.%j.out
#SBATCH --error=err.2nodes.%j.err
#SBATCH --partition=gpus
#SBATCH --gres=gpu:4
module load CUDA
module load GCC/5.5.0
module load MVAPICH2/2.3a-GDR
module load imkl
# QUDA needs a path to store the results of the auto-tuning step
# and we identify it via its git commit hash to keep track of the
# exact quda version that it is intended for
export QUDA_RESOURCE_PATH=~/misc/jureca/quda_resources/923a8a6592a9f50b135dd8e9419c969432249a0e
if [ ! -d ${QUDA_RESOURCE_PATH} ]; then
mkdir -p ${QUDA_RESOURCE_PATH}
fi
# we disable the device memory pool to reduce memory consumption
OMP_NUM_THREADS=6 \
MV2_USE_CUDA=1 MV2_GPUDIRECT_LIMIT=1000000000 \
CUDA_DEVICE_MAX_CONNECTIONS=1 \
QUDA_ENABLE_DEVICE_MEMORY_POOL=0 \
srun ${path_to_tmlqcd}/invert -v -f invert.A30.32.quda.input
This test is set up on top of commit dd937915e507c5eb87e7f4c42c76a6103adda3c1 of the juelich_qphix-tmf branch of https://github.com/kostrzewa/qphix.
QMP (https://github.com/usqcd-software/qmp) is a required dependency of QPHIX. Make sure to use the MPI
comms type (the default is SINGLE
):
$ cd $qmp_build_dir
$ CC=mpicc ${path_to_qmp_source}/configure --prefix=${qmp_installation_dir} CFLAGS=-std=c99 --with-qmp-comms-type=MPI
$ make
$ make install
Compilation proceeds using the autotools chain.
$ module list
Currently Loaded Modules:
1) GCCcore/.5.4.0 7) ParaStationMPI/5.1.5-1
2) binutils/.2.27 8) Bison/.3.0.4
3) icc/.2017.0.098-GCC-5.4.0 9) flex/2.6.0
4) ifort/.2017.0.098-GCC-5.4.0 10) ncurses/.6.0
5) Intel/2017.0.098-GCC-5.4.0 11) CMake/3.6.2
6) pscom/.Default 12) imkl/2017.0.098
$ git clone -b juelich_qphix-tmf https://github.com/kostrzewa/qphix
$ cd qphix
$ autoreconf -fi # generate all necessary autotools files
$ mkdir build
$ cd build
# for KNL, --enable-proc=AVX512
$ CC=mpicc CXX=mpicxx \
../configure --enable-openmp --enable-parallel-arch=parscalar --enable-soalen=4 \
--enable-proc=AVX2 --enable-twisted-mass \
--with-qmp=${path_to_qmp} \
--prefix=${installation_directory}
$ make -j
$ make install
The qphix-enabled benchmark is based on commmit 996046b8a794b7587ca9af4d984120e0899580c9 of the juelich_qphix_devel branch of https://github.com/kostrzewa/tmLQCD
In the commands below, the usage of MKL is completely optional, no LAPACK routines will be called during the benchmark. OpenBLAS would be just as good.
$ module list
Currently Loaded Modules:
1) GCCcore/.5.4.0 7) ParaStationMPI/5.1.5-1
2) binutils/.2.27 8) Bison/.3.0.4
3) icc/.2017.0.098-GCC-5.4.0 9) flex/2.6.0
4) ifort/.2017.0.098-GCC-5.4.0 10) ncurses/.6.0
5) Intel/2017.0.098-GCC-5.4.0 11) CMake/3.6.2
6) pscom/.Default 12) imkl/2017.0.098
$ git clone -b juelich_qphix_devel https://github.com/kostrzewa/tmLQCD tmLQCD.qphix
$ cd tmLQCD.qphix
# run at most aclocal and autoconf, tmLQCD does not use the full autotools set
$ [aclocal], autoconf
$ mkdir build
$ cd build
$ ../configure --with-limedir=${path_to_lime} \
--with-lemondir=${path_to_lemon} \
--with-mpidimension=4 --enable-omp --enable-mpi \
--disable-sse2 --disable-sse3 \
--with-lapack="-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl" \
--enable-halfspinor --enable-gaugecopy \
--with-qphixdir=${path_to_qphix} \
--with-qmpdir=${path_to_qmp} \
CC=mpicc CXX=mpicxx F77=ifort \
CFLAGS="-O3 -std=c99 -qopenmp -xCORE-AVX2" \
CXXFLAGS="-O3 -std=c++11 -qopenmp -xCORE-AVX2" \
LDFLAGS="-qopenmp"
$ make -j
For KNL, one would probably use -xMIC-AVX512
in CFLAGS and CXXFLAGS. Note that you will need to also do cross-compilation because the test executables built by configure will not work on the compiling system:
../configure [...] --host=x86_64-linux-gnu [...] \
CFLAGS="-O3 -std=c99 -qopenmp -xMIC-AVX512" \
CXXFLAGS="-O3 -std=c++11 -qopenmp -xMIC-AVX512" \
LDFLAGS="-qopenmp"
With a job scheduler which automatically exports the environment, running QPHIX-enabled tmLQCD can be done as below for two Jureca nodes.
## load modules
OMP_NUM_THREADS=12 KMP_AFFINITY=balanced srun --ntasks=4 --ntasks-per-node=2 \
--cpus-per-task=12 \
${path_to_tmLQCD}/invert -v -f invert.A30.32.qphix.input
On KNL, thread-placement probably requires fine-tuning. On dual-socket machines, it is generally beneficial to have two MPI tasks per node.
The input files are located in this gist and contain platform-related documentation in the form of comments.
mixed precision CG
QUDA: invert.A30.32.quda.CG.input
QPhiX: invert.A30.32.qphix.input
QUDA MULTIGRID
mixed precision CG
QUDA: invert.D15.48.quda.CG.input
QPhiX: invert.D15.48.qphix.input
QUDA MULTIGRID