-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTUNING
419 lines (330 loc) · 17.1 KB
/
TUNING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
==============================================================
Performance Tuning and setting up the input data file HPL.dat
Current as of release HPL - 2.2 - February 24, 2016
==============================================================
Check out the website www.netlib.org/benchmark/hpl for the
latest information.
After having built the executable hpl/bin/<arch>/xhpl, one
may want to modify the input data file HPL.dat. This file
should reside in the same directory as the executable
hpl/bin/<arch>/xhpl. An example HPL.dat file is provided by
default. This file contains information about the problem
sizes, machine configuration, and algorithm features to be
used by the executable. It is 30 lines long. All the selected
parameters will be printed in the output generated by the
executable.
At the end of this file, there is a couple of experimental
guide lines that you may find useful.
==============================================================
File HPL.dat (description):
Line 1: (unused) Typically one would use this line for its
own good. For example, it could be used to summarize the con-
tent of the input file. By default this line reads:
HPL Linpack benchmark input file
Line 2: (unused) same as line 1. By default this line reads:
Innovative Computing Laboratory, University of Tennessee
Line 3: the user can choose where the output should be re-
directed to. In the case of a file, a name is necessary, and
this is the line where one wants to specify it. Only the
first name on this line is significative. By default, the li-
ne reads:
HPL.out output file name (if any)
This means that if one chooses to redirect the output to a
file, the file will be called "HPL.out". The rest of the line
is unused, and this space to put some informative comment on
the meaning of this line.
Line 4: This line specifies where the output should go. The
line is formatted, it must be a positive integer, the rest is
unsignificant. 3 choices are possible for the positive inte-
ger, 6 means that the output will go the standard output, 7
means that the output will go to the standard error. Any o-
ther integer means that the output should be redirected
to a file, which name has been specified in the line above.
This line by default reads:
6 device out (6=stdout,7=stderr,file)
which means that the output generated by the executable
should be redirected to the standard output.
Line 5: This line specifies the number of problem sizes to be
executed. This number should be less than or equal to 20. The
first integer is significant, the rest is ignored. If the
line reads:
3 # of problems sizes (N)
this means that the user is willing to run 3 problem sizes
that will be specified in the next line.
Line 6: This line specifies the problem sizes one wants to
run. Assuming the line above started with 3, the 3 first
positive integers are significant, the rest is ignored. For
example:
3000 6000 10000 Ns
means that one wants xhpl to run 3 (specified in line 5) pro-
blem sizes, namely 3000, 6000 and 10000.
Line 7: This line specifies the number of block sizes to be
runned. This number should be less than or equal to 20.
The first integer is significant, the rest is ignored. If the
line reads:
5 # of NBs
this means that the user is willing to use 5 block sizes that
will be specified in the next line.
Line 8: This line specifies the block sizes one wants to run.
Assuming the line above started with 5, the 5 first positive
integers are significant, the rest is ignored. For example:
80 100 120 140 160 NBs
means that one wants xhpl to use 5 (specified in line 7)
block sizes, namely 80, 100, 120, 140 and 160.
Line 9 specifies how the MPI processes should be mapped onto
the nodes of your platform. There are currently two possible
mappings, namely row- and column-major. This feature is main-
ly useful when these nodes are themselves multi-processor
computers. A row-major mapping is recommended.
Line 10: This line specifies the number of process grid to
be runned. This number should be less than or equal to 20.
The first integer is significant, the rest is ignored. If the
line reads:
2 # of process grids (P x Q)
this means that you are willing to try 2 process grid sizes
that will be specified in the next line.
Line 11-12: These two lines specify the number of process
rows and columns of each grid you want to run on. Assuming
the line above (10) started with 2, the 2 first positive in-
tegers of those two lines are significant, the rest is igno-
red. For example:
1 2 Ps
6 8 Qs
means that one wants to run xhpl on 2 process grids (line
10), namely 1 by 6 and 2 by 8. Note: In this example, it is
required then to start xhpl on at least 16 nodes (max of P_i
xQ_i). The runs on the two grids will be consecutive. If one
was starting xhpl on more than 16 nodes, say 52, only 6 would
be used for the first grid (1x6) and then 16 (2x8) would be
used for the second grid. The fact that you started the MPI
job on 52 nodes, will not make HPL use all of them. In this
example, only 16 would be used. If one wants to run xhpl with
52 processes one needs to specify a grid of 52 processes, for
example the following lines would do the job:
4 2 Ps
13 8 Qs
Line 13: This line specifies the threshold the residuals
should be compared to. The residuals should be or order 1,
but are in practice slightly less than this, typically 0.001.
This line is made of a real number, the rest is unsignifi-
cant. For example:
16.0 threshold
In practice, a value of 16.0 will cover most cases. For va-
rious reasons, it is possible that some of the residuals be-
come slightly larger, say for example 35.6. xhpl will flag
those runs as failed, however they can be considered as cor-
rect. A run can be considered as failed if the residual is a
few order of magnitude bigger than 1 for example 10^6 or mo-
re. Note: if one was to specify a threshold of 0.0, all tests
would be flagged as failed, even though the answer is likely
to be correct. It is allowed to specify a negative value for
this threshold, in which case the checks will be by-passed,
no matter what the value is, as soon as it is negative. This
feature allows to save time when performing a lot of experi-
ments, say for instance during the tuning phase. Example:
-16.0 threshold
The remaning lines allow to specifies algorithmic features.
xhpl will run all possible combinations of those for each
problem size, block size, process grid combination. This is
handy when one looks for an "optimal" set of parameters. To
understand a little bit better, let say first a few words
about the algorithm implemented in HPL. Basically this is a
right-looking version with row-partial pivoting. The panel
factorization is matrix-matrix operation based and recursive,
dividing the panel into NDIV subpanels at each step. This
part of the panel factorization is denoted below by
"recursive panel fact. (RFACT)". The recursion stops when the
current panel is made of less than or equal to NBMIN columns.
At that point, xhpl uses a matrix-vector operation based
factorization denoted below by "PFACTs". Classic recursion
would then use NDIV=2, NBMIN=1. There are essentially 3
numerically equivalent LU factorization algorithm variants
(left-looking, Crout and right-looking). In HPL, one can
choose every one of those for the RFACT, as well as the
PFACT. The following lines of HPL.dat allows you to set those
parameters.
Lines 14-21: (Example 1)
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try all variants of PFACT, 4 values for
NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3
and 4, and all variants for RFACT. Lines 14-21: (Example 1)
2 # of panel fact
2 0 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
This example would try 2 variants of PFACT namely right loo-
king and left looking, 2 values for NBMIN, namely 4 and 8, 1
value for NDIV namely 2, and one variant for RFACT.
In the main loop of the algorithm, the current panel of co-
lumn is broadcast in process rows using a virtual ring to-
pology. HPL offers various choices, and one most likely want
to use the increasing ring modified encoded as 1. 4 is also
a good choice. Lines 22-23: (Example 1):
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the
increasing ring modified topology. Lines 22-23: (Example 2):
2 # of broadcast
0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
This will cause HPL to broadcast the current panel using the
increasing ring virtual topology and the long message algori-
thm.
Lines 24-25 allow to specify the look-ahead depth used by
HPL. A depth of 0 means that the next panel is factorized af-
ter the update by the current panel is completely finished. A
depth of 1 means that the next panel is factorized immediate-
ly after being updated. The update by the current panel is
then finished. A depth of k means that the k next panels are
factorized immediately after being updated. The update by the
current panel is then finished. It turns out that a depth of
1 seems to give the best results, but may need a large pro-
blem size before one can see the performance gain. So use 1,
if you do not know better, otherwise you may want to try 0.
Look-ahead of depths 2 and larger will probably not give you
better results. Lines 24-25: (Example 1):
1 # of lookahead depth
1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depth 1.
Lines 24-25: (Example 2):
2 # of lookahead depth
0 1 DEPTHs (>=0)
This will cause HPL to use a look-ahead of depths 0 and 1.
Lines 26-27 allow to specify the swapping algorithm used by
HPL for all tests. There are currently two swapping algo-
rithms available, one based on "binary exchange" and the
other one based on a "spread-roll" procedure (also called
"long" below. For large problem sizes, this last one is like-
ly to be more efficient. The user can also choose to mix both
variants, that is "binary-exchange" for a number of columns
less than a threshold value, and then the "spread-roll" al-
gorithm. This threshold value is then specified on Line 27.
Lines 26-27: (Example 1):
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll" swap-
ping algorithm. Note that a threshold is specified in that
example but not used by HPL. Lines 26-27: (Example 2):
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
This will cause HPL to use the "long" or "spread-roll" swap-
ping algorithm as soon as there is more than 60 columns in
the row panel. Otherwise, the "binary-exchange" algorithm
will be used instead.
Line 28 allows to specify whether the upper triangle of the
panel of columns should be stored in no-transposed or
transposed form. Example:
0 L1 in (0=transposed,1=no-transposed) form
Line 29 allows to specify whether the panel of rows U should
be stored in no-transposed or transposed form. Example:
0 U in (0=transposed,1=no-transposed) form
Line 30 enables/disables the equilibration phase. This option
will not be used unless you selected 1 or 2 in Line 26. Ex:
1 Equilibration (0=no,1=yes)
Line 31 allows to specify the alignment in memory for the
memory space allocated by HPL. On modern machines, one proba-
bly wants to use 4, 8 or 16. This may result in a tiny amount
of memory wasted. Example:
4 memory alignment in double (> 0)
==============================================================
Guide lines:
1) Figure out a good block size for the matrix-matrix
multiply routine. The best method is to try a few out. If you
happen to know the block size used by the matrix-matrix
multiply routine, a small multiple of that block size will do
fine.
HPL uses the block size NB for the data distribution as well
as for the computational granularity. From a data
distribution point of view, the smallest NB, the better the
load balance. You definitely want to stay away from very
large values of NB. From a computation point of view, a too
small value of NB may limit the computational performance by
a large factor because almost no data reuse will occur in the
highest level of the memory hierarchy. The number of messages
will also increase. Efficient matrix-multiply routines are
often internally blocked. Small multiples of this blocking
factor are likely to be good block sizes for HPL. The bottom
line is that "good" block sizes are almost always in the
[32..256] interval. The best values depend on the computation
/ communication performance ratio of your system. To a much
less extent, the problem size matters as well. Say for
example, you emperically found that 44 was a good block size
with respect to performance. 88 or 132 are likely to give
slightly better results for large problem sizes because of a
slighlty higher flop rate.
2) The process mapping should not matter if the nodes of
your platform are single processor computers. If these nodes
are multi-processors, a row-major mapping is recommended.
3) HPL likes "square" or slightly flat process grids. Unless
you are using a very small process grid, stay away from the
1-by-Q and P-by-1 process grids.
4) Panel factorization parameters: a good start are the fol-
lowing for the lines 14-21:
1 # of panel fact
1 PFACTs (0=left, 1=Crout, 2=Right)
2 # of recursive stopping criterium
4 8 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
5) Broadcast parameters: at this time, it is far from obvious
to me what the best setting is, so i would probably try them
all. If I had to guess I would probably start with the follo-
wing for the lines 22-23:
2 # of broadcast
1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
The best broadcast depends on your problem size and harware
performance. My take is that 4 or 5 may be competitive for
machines featuring very fast nodes comparatively to the
network.
6) Look-ahead depth: as mentioned above 0 or 1 are likely to
be the best choices. This also depends on the problem size
and machine configuration, so I would try "no look-ahead (0)"
and "look-ahead of depth 1 (1)". That is for lines 24-25:
2 # of lookahead depth
0 1 DEPTHs (>=0)
7) Swapping: one can select only one of the three algorithm
in the input file. Theoretically, mix (2) should win, however
long (1) might just be good enough. The difference should be
small between those two assuming a swapping threshold of the
order of the block size (NB) selected. If this threshold is
very large, HPL will use bin_exch (0) most of the time and if
it is very small (< NB) long (1) will always be used. In
short and assuming the block size (NB) used is say 60, I
would choose for the lines 26-27:
2 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
I would also try the long variant. For a very small number
of processes in every column of the process grid (say < 4),
very little performance difference should be observable.
8) Local storage: I do not think Line 28 matters. Pick 0 in
doubt. Line 29 is more important. It controls how the panel
of rows should be stored. No doubt 0 is better. The caveat is
that in that case the matrix-multiply function is called with
( Notrans, Trans, ... ), that is C := C - A B^T. Unless the
computational kernel you are using has a very poor (with
respect to performance) implementation of that case, and is
much more efficient with ( Notrans, Notrans, ... ) just pick
0 as well. So, my choice:
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
9) Equilibration: It is hard to tell whether equilibration
should always be performed or not. Not knowing much about the
random matrix generated and because the overhead is so small
compared to the possible gain, I turn it on all the time.
1 Equilibration (0=no,1=yes)
10) For alignment, 4 should be plenty, but just to be safe,
one may want to pick 8 instead.
8 memory alignment in double (> 0)
==============================================================