add cilk_for content from Cilk Plus programmer's guide

OpenCilk · Aug 16, 2022 · f5353af · f5353af
1 parent 39e2894
commit f5353af
Show file tree

Hide file tree

Showing 3 changed files with 217 additions and 4 deletions.
diff --git a/src/doc/tutorials/introduction-to-cilk_for.md b/src/doc/tutorials/introduction-to-cilk_for.md
@@ -8,11 +8,12 @@ attribution: true
 ---
 ## Context
 
-Below is a rough collection of content about `cilk_for` taken from
+Below is a collection of content about `cilk_for` taken from
 - https://www.intel.sg/content/dam/www/public/apac/xa/en/pdfs/ssg/Introduction_to_Intel_Cilk.pdf
 - 6.172 Lecture 8 https://canvas.mit.edu/courses/11151/files/1723140?module_item_id=444341
+- https://github.com/OpenCilk/documentation/tree/master/source_documents/Intel_Cilk%2B%2B_Programmers_Guide
 
-It includes a slide snapshot that I haven't yet redone. See also
+See also
 - https://www.smcm.iqfr.csic.es/docs/intel/compiler_c/main_cls/index.htm#cref_cls/common/cilk_for.htm 
 - https://cilk.mit.edu/programming/
 
@@ -35,19 +36,30 @@ several additional constraints compared to `for` loops.
 - Since the loop body is executed in parallel, it must not modify the control variable nor should it
 modify a nonlocal variable, as that would cause a data race. (You can use Cilksan to detect races.)
 
+These general restrictions have numerous specific consequences, which you can read at the bottom of this tutorial.
+
 ### Serial/parallel structure of cilk_for
 
 Note that using `cilk_for` is not the same as spawning each iteration of a `for` loop. In fact, the OpenCilk
 compiler converts the loop body to a function that is called recursively using a divide-and-conquer strategy that allows the OpenCilk scheduler to provide significantly better performance. 
 Here is a graphical depiction of how OpenCilk runs the eight iterations of the example `cilk_for` loop (above),
 where the numbers indicate which loop iteration is being computed:
 
-{% img "/img/divide-conquer-cilk_for.png", "600" %}
+{% img "/img/divide-conquer-cilk_for-8-iter.png", "700" %}
 
 Note that at each division of work, half of the remaining work is done in the child and half in the continuation. Importantly, the
 overhead of both the loop itself and of spawning new work is divided evenly along with the cost
 of the loop body.
 
+Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well
+balanced, because each child does the work of only one iteration before incurring the scheduling
+overhead inherent in entering a sync. For a short loop, or a loop in which the work in the body is
+much greater than the control and spawn overhead, there will be little measurable performance
+difference. However, for a loop of many cheap iterations, the overhead cost will overwhelm any
+advantage provided by parallelism.
+
+{% img "/img/sequential-spawn-cilk_for-8-iter.png", "700" %}
+
 ## In-place matrix transpose
 
 Let's look at in-place matrix transpose as an example of parallel loop computation.
@@ -103,4 +115,205 @@ cilk_for (int i=1; i<n; ++i) {
     A[j][i] = temp;
   }
 }
-```
+```
+
+### Specific restrictions on `cilk_for` loops
+
+In order to parallelize a loop using the "divide-and-conquer" technique, the runtime system must
+pre-compute the total number of iterations and must be able to pre-compute the value of the loop
+control variable at every iteration. To enable this computation, the control variable must act as
+an integer with respect to addition, subtraction, and comparison, even if it is a user-defined type.
+Integers, pointers, and random access iterators from the standard template library all have
+integer behavior and thus satisfy this requirement.
+
+In addition, a `cilk_for` loop has the following limitations, which are not present for a standard
+C/C++ `for` loop. The compiler will report an error or warning for most of these errors.
+
+- There must be exactly one loop control variable, and the loop initialization clause must
+assign the value. 
+{% alert "danger" %}
+Not supported:
+```c
+cilk_for (unsigned int i, j = 42; j < 1; i++, j++)
+```
+{% endalert %}
+{% alert "success" %}
+Supported:
+```c
+cilk_for (unsigned int j = 42; j < 101; j++)
+```
+{% endalert %}
+
+- The control variable must be declared in the loop header, not outside the loop.
+{% alert "danger" %}
+Not supported:
+```c
+int i; 
+cilk_for (i = 0; i < 100; i++)
+```
+{% endalert %}
+{% alert "success" %}
+Supported:
+```c
+cilk_for (int i = 0; i < 100; i++)
+```
+{% endalert %}
+
+- The loop control variable must not be modified in the loop body.
+{% alert "danger" %}
+Not supported:
+```c
+cilk_for (unsigned int i = 1; i < 16; ++i) i = f();
+```
+{% endalert %}
+{% alert "success" %}
+Supported:
+```c
+unsigned int j;
+cilk_for (unsigned int i = 1; i < 16; ++i) j = f();
+```
+{% endalert %}
+
+- The termination and increment values are evaluated once before starting the loop and will
+not be re-evaluated at each iteration. Thus, modifying either value within the loop body will
+not add or remove iterations. 
+{% alert "danger" %}
+Not supported:
+```c
+cilk_for (unsigned int i = 1; i < x; ++i) x = f();
+```
+{% endalert %}
+{% alert "success" %}
+Supported:
+```c
+cilk_for (unsigned int i = 1; i < 16; ++i) x = f();
+```
+{% endalert %}
+
+- A `break` or `return` statement will NOT work within the body of a `cilk_for` loop; the
+compiler will generate an error message. `break` and `return` in this context are reserved for
+future speculative parallelism support.
+- A `goto` can only be used within the body of a `cilk_for` loop if the target is within the loop
+body. The compiler will generate an error message if there is a `goto` transfer into or out of a
+`cilk_for` loop body. Similarly, a `goto` cannot jump into the body of a `cilk_for` loop from
+outside the loop.
+- A `cilk_for` loop may not be used in a constructor or destructor. It may be used in a
+function called from a constructor or destructor.
+- A `cilk_for` loop may not "wrap around." For example, in C/C++ you can write
+```c
+for (unsigned int i = 0; i != 1; i += 3);
+```
+and this has well-defined, if surprising, behavior; it means execute the loop 2,863,311,531
+times. Such a loop produces unpredictable results in OpenCilk when converted to a `cilk_for`.
+
+- A `cilk_for` may not be an infinite loop.
+{% alert "danger" %}
+Not supported:
+```c
+cilk_for (unsigned int 1 = 0; i < 16; i += 0);
+```
+{% endalert %}
+{% alert "success" %}
+Supported:
+```c
+cilk_for (unsigned int 1 = 0; i < 16; i += 2);
+```
+{% endalert %}
+
+## `cilk_for` grain size
+
+The `cilk_for` statement divides the loop into chunks containing one or more loop iterations.
+Each chunk is executed serially, and is spawned as a chunk during the execution of the loop.
+The maximum number of iterations in each chunk is the grain size.
+In a loop with many iterations, a relatively large grain size can significantly reduce overhead.
+Alternately, with a loop that has few iterations, a small grain size can increase the parallelism of
+the program and thus improve performance as the number of processors increases.
+
+### Setting the Grain Size
+
+Use the `cilk_grainsize` pragma to specify the grain size for one `cilk_for` loop:
+```c
+#pragma cilk_grainsize = expression
+```
+For example, you might write:
+```c
+#pragma cilk_grainsize = 1
+cilk_for (int i=0; i<IMAX; ++i) { . . . }
+```
+If you do not specify a grain size, the system calculates a default that works well for most loops.
+The default value is set as if the following pragma were in effect:
+```c
+#pragma cilk_grainsize = min(512, N / (8*p))
+```
+where $N$ is the number of loop iterations, and $p$ is the number of workers created during the
+current program run. Note that this formula will generate parallelism of at least 8 and at most
+512. For loops with few iterations (less than $8 * p$) the grain size will be set to 1, and each
+loop iteration may run in parallel. For loops with more than $4096 * p$ iterations, the grain size
+will be set to 512.
+
+If you specify a grain size of zero, the default formula will be used. The result is undefined if you
+specify a grain size less than zero.
+
+Note that the expression in the pragma is evaluated at run time. For example, here is an
+example that sets the grain size based on the number of workers:
+```c
+#pragma cilk_grainsize = n/(4*cilk::current_worker_count())
+```
+
+### Loop Partitioning at Run Time
+
+The number of chunks that are executed is approximately the number of iterations $N$ divided by the grain size $K$.
+The OpenCilk compiler generates a divide-and-conquer recursion to execute the loop. In pseudocode, the control structure looks like this:
+```c
+void run_loop(first, last)
+{
+  if (last - first) < grainsize)
+  {
+    for (int i=first; i<last ++i) LOOP_BODY;
+  }
+  else
+  {
+    int mid = (last-first)/2;
+    cilk_scope {
+      cilk_spawn run_loop(first, mid);
+                 run_loop(mid, last);
+    }
+  }
+}
+```
+
+In other words, the loop is split in half repeatedly until the chunk remaining is less than or equal
+to the grain size. The actual number of iterations run as a chunk will often be less than the grain
+size.
+For example, consider a `cilk_for` loop of 16 iterations:
+```c
+cilk_for (int i=0; i<16; ++i) { ... }
+```
+With grain size of 4, this will execute exactly 4 chunks of 4 iterations each. However, if the grain
+size is set to 5, the division will result in 4 unequal chunks consisting of 5, 3, 5 and 3 iterations.
+If you work through the algorithm in detail, you will see that for the same loop of 16 iterations, a
+grain size of 2 and 3 will both result in exactly the same partitioning of 8 chunks of 2 iterations
+each.
+
+### Selecting a Good Grain Size Value
+The default grain size usually performs well. However, here are guidelines for selecting a
+different value:
+
+- If the amount of work per iteration varies widely and if the longer iterations are likely to be
+unevenly distributed, it might make sense to reduce the grain size. This will decrease the
+likelihood that there is a time-consuming chunk that continues after other chunks have
+completed, which would result in idle workers with no work to steal.
+- If the amount of work per iteration is uniformly small, then it might make sense to increase
+the grain size. However, the default usually works well in these cases, and you don't want to
+risk reducing parallelism.
+- If you change the grain size, carry out performance testing to ensure that you've made the
+loop faster, not slower.
+- Use Cilkscope to estimate a program's work, span, and spawn overhead. 
+This information can help determine the best granularity and whether it is
+appropriate to override the default grain size.
+
+Several examples (from Cilk Plus programmer's guide) use the grain size pragma:
+
+- matrix-transpose
+- cilk-for
+- sum-cilk
diff --git a/src/img/divide-conquer-cilk_for-8-iter.png b/src/img/divide-conquer-cilk_for-8-iter.png
diff --git a/src/img/sequential-spawn-cilk_for-8-iter.png b/src/img/sequential-spawn-cilk_for-8-iter.png