-
Notifications
You must be signed in to change notification settings - Fork 0
Jugene Benchmark
Testing the current Hybrid code on Jugene I've come to the following results when running on just one compute card. All results are in Mflops from the benchmark application. The local lattice size is chosen to be 2x24x24x24 with j_max preset to 2048. MPI-parallelisation is in 1 dimension only. All these results are from one run only so there could be statistical anomalies present.
The hybrid / OMP version of the code only compiles cleanly with -O2 or -O3 -qstrict. (Just -O3 stalls in the IPA step and just sits there for hours, using more and more memory)
So far, the halfspinor version of the BG/X code does not seem to work with OpenMP
645 / nocom: 771
Bandwidth ~ 548 MB/s
717 / nocom: 789
Bandwidth ~ 1100 MB/s
754 / nocom: 803
Bandwidth ~ 1700 MB/s
1142 per task (571 per thread) / nocom: 1466 per task (733 per thread)
Bandwidth ~ 720 MB/s
638 / nocom: 755
Bandwidth ~ 570 MB/s
633 / nocom: 749
Bandwidth ~ 570 MB/s
575 / nocom: 734
Bandwidth ~ 370 MB/s
These runs elucidate the overhead of running the MPI calls with exchange between one task and itself.
2182 (545 per thread) / nocom: 2918 (729 per thread)
Bandwidth ~ 1200 MB/s (of course, this is non-sensical)
1231 (615 per thread) / nocom: 1497 (748 per thread)
662 / nocom: 762
Bandwidth ~ 700 MB/s
This is a special test because it might show whether the coexistence of MPI and OpenMP is causing a slowdown. At the same time it elucidates how good clock() works as a timer on BG/P.
3129 (782 per thread)
It seems as though this is by far the fastest setup on one compute card, but I'm not sure how good the time measurement is since without MPI I'm forced to use clock() to measure the times.
1593 (796 per thread)
808
1307 (326 per thread) / 2918 (729 per thread)
Bandwidth ~ 328 MB/s
569 / nocom: 771
Bandwidth ~ 300 MB/s
NrTProcs=8, NrXProcs=8, NrYProcs=4, 16x32^3
1059 (264 per thread) / nocom: 2731 (682 per thread)
Bandwidth ~ 420 MB/s
NrTProcs=16, NrXProcs=8, NrYProcs=4, 32^4
762 (381 per thread) / nocom: 1403 (701 per thread)
Bandwidth ~ 405 MB/s
NrTProcs=8, NrXProcs=8, NrYProcs=4, NrZProcs=4
Fails with MPI null pointers despite sufficient local lattice size!
Hybrid code is really slow on the BG/P. This will be investigated a bit more with scalasca.