Skip to content

CINECA Test Setup for QPhiX

Bartosz Kostrzewa edited this page May 25, 2017 · 8 revisions

Observed Problems

The performance of the QPhiX library is not as good as expected.

Single Node Performance

The single node performance is lacking in comparison to other KNL systems. The following chart shows the single node performance on a single KNL. It uses one MPI rank, 64×4 OpenMP threads with a lattice size of 48³×96. The system DEEP-ER is a KNL 7230 machine in Jülich, Marconi A2 is the KNL 7250 machine, of course.

One can see that the performance with the exact same setup is worse. Both have used the Intel C++ 17 compiler, both use the Intel MPI implementation.

During the KNL meeting at Cineca in Casalecchio di Reno in March 2017, Peter Labus and Martin Ueding have given a talk which contained the above diagram. Another speaker from Jülich commented that the DEEP-ER uses water cooling whereas Marconi A2 has air cooling. Other people have said that there were thermal issues and that the clock speed has been lowered.

Multi Node Performance

The strong scaling does not work as expected, either. We have run 270 jobs with various lattice sizes and number of MPI ranks. In each case we have used 68×4 OpenMP threads. QPhiX supports various 4D domain decompositions (more about that below). We have tried all possible ones have have found a significant difference between good and bad decompositions. Below are the best results for a particular number of nodes.

One would expect a straight line from the origin, the actual results are rather unsatisfying. Also the absolute numbers are fairly low. We have seen results from other KNL machines, namely an early Cori and Frioul. There the scaling is much better, also the total performance is better.

One would expect that up to 64 nodes or so, one would be able to keep around 150-200 Gflop/s per node in single precision, or, in other words, achieve around 10 Tflop/s on 64 nodes. (which is a factor of 10 more than what is observed!)

Test Setup

Installing the Software

For the tests one need to compile QPhiX and its dependencies QDP++ and libxml2. Easiest is to use the bootstrap script. That will load the appropriate compiler module, download and install all the dependencies (QMP, libxml2, QDP++).

The git submodules of QDP++ are fetched via SSH from GitHub. This means that you have to register a SSH key from the frontend with GitHub. If you do not want to add the frontend SSH key to your GitHub account, just create some dummy repository and add that SSH key as a “deploy key”. If you get errors like “access denied, public key”, this is where it comes from.

If you have problems with the compilation script, please contact Martin Ueding via [email protected].

Testing Parameters

Physical options:

  • Lattice size: XYZT = 48³×96
  • Precision: double, float
  • Dslash performance measurement: -dslash
  • Conjugate Gradient performance measurement: -cg

Single node options:

  • SoA length: 4, 8
  • SMT Threads: 4
  • Blocking: -by 4 -bz 4
  • Padding: -pxy 1 -pxyz 0

Domain Decomposition / Number of MPI Ranks

  • Up to 72 nodes
  • Geometry parameters -geom gx gy gz gt.

The local lattice size is x = X / gx, y = Y / gy, and so forth. Unfortunately, there are lots of constraints to take into account when choosing the domain decomposition. These are:

  • x % soalen == 0
  • y % by == 0
  • z % bz == 0
  • soalen % veclen == 0
  • gx * gy * gz * gt == number_of_ranks

From the DDalphaAMG library with an optimal MG aggregation of 4^4 blocks and the ability to do even-odd preconditioning in a 3-level MG setup, we also get additional constraints:

  • Local lattice site must be at least 8 for three directions, one direction must be 16. Choosing x >= 16 and y >= 8, z >= 8 and t >= 8 is probably the best variant for QPhiX.
Lattice Size Maximum Ranks Total Ranks
32²×64 2×4×4×8 256
48³×96 3×6×6×12 1296
64³×128 4×8×8×16 65536

There are various ways to factor a fixed number of ranks. From our experience, it is best to choose gx < gy < gz < gt, or at least somewhat in that direction. It is hard to say exactly how to choose these factors, therefore the scripts (below) will generate all possibilities for you.

There are benchmarking scripts that can generate job files for Marconi A2. These will have to be adapted for the particular tests that one wants to run.