Skip to content

Commit

Permalink
review Cilkscale reference
Browse files Browse the repository at this point in the history
  • Loading branch information
Bruce Hoppe authored and Bruce Hoppe committed Sep 19, 2022
1 parent ef9520f commit fcbcc0d
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 48 deletions.
68 changes: 27 additions & 41 deletions src/doc/reference/cilkscale.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ eleventyNavigation:

The OpenCilk Cilkscale tool comprises three main components:

- Infrastructure in the OpenCilk compiler and runtime for work/span analysis.
- Infrastructure in the OpenCilk compiler and runtime system for work/span analysis.
- A C/C++ API for fine-grained analysis of program regions.
- A Python script that automates scalability analysis, benchmarking on multiple
cores, and visualization of parallel performance results.
Expand All @@ -36,27 +36,22 @@ page](/doc/users-guide/install/#example).
Cilkscale work/span analysis reports contain the following measurements for
each analyzed program region.

- **Work**
The total {% defn "work" %} $(T_1)$ of the computation, measured as CPU time.
- {% defn "Work" %}: the CPU time of the computation when run on one processor, sometimes denoted $(T_1)$.
The actual wall-clock time it takes to run the computation will generally be
smaller than the work, since the latter adds together the time spent on
different CPU cores in parallel.

- **Span**
The {% defn "span" %} $(T_{\infty})$ of the computation, measured as CPU
time. The span is the maximum amount of work along any path in the {% defn
"parallel trace" %} of the computation. One way of understanding the span is
as the expected wall-clock execution time if the computation was run on an
infinite number of parallel cores.

- **Parallelism**
The {% defn "parallelism" %} of a computation is its work-to-span ratio $(T_1
/ T_{\infty})$. Parallelism can be thought of as the maximum possible
parallel speedup of the computation, or as the maximum number of cores that
could theoretically yield perfect linear speedup.

- **Burdened span**
The burdened span is similar to the span after accounting for worst-case
- {% defn "Span" %}: the theoretically fastest CPU time of the computation
when run on an infinite number of parallel processors (discounting overheads for communication and scheduling),
sometimes denoted $(T_{\infty})$. The span is the maximum amount of work along any path in the {% defn
"parallel trace" %} of the computation.

- {% defn "Parallelism" %}: the ratio of work to span for a computation $(T_1 / T_{\infty})$,
which is the maximum speedup it could attain when run on an infinite number of processors.
Parallelism can also be interpreted as the maximum number of processors that
could theoretically yield {% defn "perfect linear speedup" %}.

- ***Burdened span***: similar to span after accounting for worst-case
scheduling overhead. The scheduling burden overhead is based on a heuristic
estimate of the costs associated with migrating and synchronizing parallel
tasks among processors. The worst-case scenario is when every time it is
Expand All @@ -65,9 +60,8 @@ each analyzed program region.
slow down parallel execution, such as insufficient memory bandwidth,
contention on parallel resources, false sharing, etc.)

- **Burdened parallelism**
The burdened parallelism is the ratio of work to burdened span. It can be
thought of as a lower bound for the parallelism of the computation assuming
- ***Burdened parallelism***: the ratio of work to burdened span. It can be
interpreted as a lower bound for the parallelism of the computation assuming
worst-case parallel scheduling.

{% alert "info" %}
Expand Down Expand Up @@ -231,7 +225,7 @@ the printed row are, in order: the `tag` string, work, span, parallelism,
burdened span, and burdened parallelism.
See also: [Cilkscale work/span analysis
measurements](#workspan-analysis-measurements-reported-by-cilkscale).
measurements](#workspan-analysis-measurements).
### C++ operator overloads
Expand All @@ -243,7 +237,7 @@ variables:
- The `<<` operator can be used with a prefix argument of type `std::ostream`
or `std::ofstream` to print work/span measurements. The `<<` operator
behaves similarly to `wsp_dump()`, except that (1) it does not print a tag
field and (2) its output stream is unaffected by the `CILKSCALE_OUT`
field, and (2) its output stream is unaffected by the `CILKSCALE_OUT`
environment variable.
### Examples
Expand Down Expand Up @@ -321,7 +315,7 @@ $ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py ARGUMENTS
`-fcilktool=cilkscale-benchmark`.

- `-cpus CPU_COUNTS`, `--cpu-counts CPU_COUNTS`
_(Optional)_ Comma-separated list of CPU counts to use when running empirical
_(Optional)_ Comma-separated list of how many cores to use when running empirical
performance benchmarks. On systems with [simultaneous multithreading
(SMT)](https://en.wikipedia.org/wiki/Simultaneous_multithreading) (aka
"hyper-threading" on Intel CPUs), Cilkscale only uses distinct physical
Expand Down Expand Up @@ -355,14 +349,6 @@ $ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py ARGUMENTS

_**Example:**_

{% alert "danger" %}

_**BUG:**_ The following `shell-session` code block only gets rendered badly if
it is within an alert-box. It seems there are generally some styling issues
with reference pages.

{% endalert %}

```shell-session
$ /opt/opencilk/bin/clang qsort.c -fopencilk -fcilktool=cilkscale -O3 -o qsort_cs
$ /opt/opencilk/bin/clang qsort.c -fopencilk -fcilktool=cilkscale-benchmark -O3 -o qsort_cs_bench
Expand All @@ -372,12 +358,12 @@ $ python3 /opt/opencilk/share/Cilkscale_vis/cilkscale.py \
--args 100000000
Namespace(args=['100000000'], cilkscale='./qsort_cs', cilkscale_benchmark='./qsort_cs_bench', cpu_counts=None, output_csv='qsort-bench.csv', output_plot='qsort-scalability-plots.pdf', rows_to_plot='all')
>> STDOUT (./qsort_cilkscale 100000000)
\>> STDOUT (./qsort_cilkscale 100000000)
Sorting 100000000 random integers
Sort succeeded
<< END STDOUT
>> STDERR (./qsort_cilkscale 100000000)
\>> STDERR (./qsort_cilkscale 100000000)
<< END STDERR
INFO:runner:Generating scalability data for 8 cpus.
Expand All @@ -396,15 +382,15 @@ INFO:plotter:Generating plot (2 subplots)

### Performance and scalability analysis plots

An example set of plots that are produced by the `cilkscale.py` script is shown
An example set of plots produced by the `cilkscale.py` script is shown
below. In this example, the instrumented application is a parallel quicksort
and the Cilkscale API was used to analyze one program region (tagged as
"sampled_qsort" in the relevant call to `wsp_dump()`) in addition to the whole
program which is always analyzed by Cilkscale. Details on how these plots were
generated can be found in the [Cilkscale user's
guide](/doc/users-guide/cilkscale).

{% img "/img/qsort-cilkscale-scalability-plots.png", "1000" %}
{% img "/img/qsort-cilkscale-scalability-plots.png", "100%" %}

The Cilkscale visualization plots are arranged in two columns and as many rows
as calls to the Cilkscale API `wsp_dump()` function (plus one untagged row for
Expand All @@ -419,11 +405,11 @@ Specifically, these figures plot four types of measurements:
measurement overheads.
- A dark green line shows what the execution time would be if the computation
exhibited _perfect linear speedup_, that is, if the time on $P$ processors
were to be $P$ times smaller than the time it took on $1$ processor.
were to be $P$ times smaller than the time it took on one processor.
- A teal line shows the heuristic _burdened-dag bound_ of the execution time
(the parallel trace of the computation is sometimes also referred to as its
directed acyclic graph or dag). In the absence of other sources of parallel
slowdown such as insufficient memory bandwidth, contention, etc, the
slowdown such as insufficient memory bandwidth, contention, etc., the
burdened-dag bound serves as a heuristic lower bound for the execution time
if the parallel computation does not exhibit sufficient parallelism and is
not too fine-grained.
Expand All @@ -433,8 +419,8 @@ Specifically, these figures plot four types of measurements:
etc.

**Parallel speedup.** The right-column plots contain the same information as
those in the left column, except that the $y$-axis shows parallel speedup.
those in the left column, except that the y-axis shows parallel speedup.
That is, all execution time measurements are divided by the execution time of
the computation on $1$ processor. The horizontal line for parallelism (serial
the computation on one processor. The horizontal line for parallelism (serial
execution time divided by span) is not visible in the speedup plots if its
value falls outside the range of the $y$-axis.
value falls outside the range of the y-axis.
14 changes: 7 additions & 7 deletions src/doc/users-guide/cilkscale.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ how to use them to diagnose parallel performance limitations of your
application. For details on the Cilkscale components, user options, and output
information, see the [Cilkscale reference page](/doc/reference/cilkscale).

{% img "/img/qsort-cilkscale-scalability-plots-sample-qsort-only.png", "1000" %}
{% img "/img/qsort-cilkscale-scalability-plots-sample-qsort-only.png", "100%" %}

{% alert "info" %}

Expand Down Expand Up @@ -192,18 +192,18 @@ achieve this, we make the following three changes to our code.
line 35 in `qsort.cpp`:

```cpp
wsp_t wsp_tic = wsp_getworkspan();
wsp_t start = wsp_getworkspan();
sample_qsort(a.data(), a.data() + a.size());
wsp_t wsp_toc = wsp_getworkspan();
wsp_t end = wsp_getworkspan();
```
3. Evaluate the work and span between the relevant snapshots and print the
analysis results with a descriptive tag. E.g., just before the program
terminates in line 39 in `qsort.cpp`:
```cpp
wsp_t wsp_elapsed = wsp_sub(wsp_toc, wsp_tic);
wsp_dump(wsp_elapsed, "qsort_sample");
wsp_t elapsed = wsp_sub(end, start);
wsp_dump(elapsed, "qsort_sample");
```

Then, we save our edited program as `qsort_wsp.cpp`, compile it with Cilkscale
Expand Down Expand Up @@ -395,7 +395,7 @@ page](/doc/reference/cilkscale/#performance-and-scalability-analysis-plots).

Here are the plots in `csplots_qsort.pdf` for the above example:

{% img "/img/qsort-cilkscale-scalability-plots.png", "1000" %}
{% img "/img/qsort-cilkscale-scalability-plots.png", "100%" %}


## Discussion: diagnosing performance limitations
Expand Down Expand Up @@ -454,4 +454,4 @@ serial and whose cost is linear with respect to the size of the input array.

We will not cover parallel partition algorithms for quicksort here, but warn
that designing and implementing efficient parallel partitions is an interesting
and nontrivial exercise!
and nontrivial exercise!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified src/img/qsort-cilkscale-scalability-plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit fcbcc0d

Please sign in to comment.