-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathwriteup.tex
1057 lines (896 loc) · 52.2 KB
/
writeup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[sigconf]{acmart}
\usepackage{booktabs} % For formal tables
% Copyright
%\setcopyright{none}
%\setcopyright{acmcopyright}
%\setcopyright{acmlicensed}
\setcopyright{rightsretained}
%\setcopyright{usgov}
%\setcopyright{usgovmixed}
%\setcopyright{cagov}
%\setcopyright{cagovmixed}
% DOI
\acmDOI{10.475/123_4}
% ISBN
\acmISBN{123-4567-24-567/08/06}
%Conference
\acmConference[SPAA '19]{ACM Symposium on Parallelism in Algorithms and Architectures}{July 2019}{Phoenix, AZ, USA}
\acmYear{2019}
\copyrightyear{2019}
\acmArticle{4}
\acmPrice{15.00}
\usepackage{enumitem}
\usepackage{subcaption}
\usepackage{tikz,pgfplots}
\usepackage{etoolbox}
%% This makes the colors annoyingly bright, but at least they're easy to distinguish.
\pgfplotsset{
every tick/.style={red,}, minor x tick num=1,
cycle list={teal,every mark/.append style={fill=teal!80!black},mark=*\\%
orange,every mark/.append style={fill=orange!80!black},mark=square*\\%
cyan!60!black,every mark/.append style={fill=cyan!80!black},mark=otimes*\\%
red!70!white,mark=star\\%
lime!80!black,every mark/.append style={fill=lime},mark=diamond*\\%
red,densely dashed,every mark/.append style={solid,fill=red!80!black},mark=*\\%
yellow!60!black,densely dashed,
every mark/.append style={solid,fill=yellow!80!black},mark=square*\\%
black,every mark/.append style={solid,fill=gray},mark=otimes*\\%
blue,densely dashed,mark=star,every mark/.append style=solid\\%
red,densely dashed,every mark/.append style={solid,fill=red!80!black},mark=diamond*\\%
}
}
\input{paralleltable.tex}
\input{serialtable.tex}
\newcommand{\dec}{\operatorname{dec}}
\newcommand{\github}{\url{github.com/williamkuszmaul/Parallel-Partition}}
\newcommand{\defn}[1] {{\textit{\textbf{\boldmath #1}}}}
\renewcommand{\paragraph}[1]{\vspace{0.09in}\noindent{\bf \boldmath #1.}}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{todonotes}
\newtheorem{thm}{Theorem}[section]
\newtheorem{lem}[thm]{Lemma}
\newtheorem{prop}[thm]{Proposition}
\newtheorem{clm}[thm]{Claim}
\newtheorem{cor}[thm]{Corollary}
\newtheorem{conj}[thm]{Conjecture}
\theoremstyle{remark}
\newtheorem{rem}[thm]{Remark}
\newtheorem{ex}[thm]{Example}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{definition}[thm]{Definition}
\newtheorem{lemma}[thm]{Lemma}
\newtheorem{proposition}[thm]{Proposition}
\newtheorem{claim}[thm]{Claim}
\newtheorem{corollary}[thm]{Corollary}
\newtheorem{conjecture}[thm]{Conjecture}
\theoremstyle{remark}
\newtheorem{remark}[thm]{Remark}
\newtheorem{example}[thm]{Example}
\newtheorem{observation}[thm]{Observation}
\usepackage{hyperref}
\begin{document}
\title{Simple and Space-Efficient Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory}
\subtitle{}
\author{William Kuszmaul}
\authornote{Supported by a Hertz Fellowship and a NSF GRFP Fellowship}
\affiliation{%
\institution{Massachusetts Institute of Technology}
}
\email{[email protected]}
% The default list of authors is too long for headers.
\renewcommand{\shortauthors}{William Kuszmaul}
\begin{abstract}
We present a simple in-place algorithm for parallel partition that has
work $O(n)$ and span $O(\log n \cdot \log \log n)$. The algorithm uses
only exclusive read/write shared variables, and can be implemented
using parallel-for-loops without any additional concurrency
considerations (i.e., the algorithm is in the EREW PRAM model). As an
immediate consequence, we also get a simple in-place quicksort
algorithm with work $O(n \log n)$ and span $O(\log^2 n \log \log n)$.
Using our algorithmic techniques, we implement an (almost) in-place
parallel partition. In addition to achieving much better memory
utilization, the algorithm leverages its improved cache behavior to
achieve a speedup over its out-of-place counterpart.
We also present an alternative in-place parallel-partition algorithm
with a larger span of $O(\sqrt{n \log n})$, but which is designed to
have very small overhead. We show that when the algorithm
is tuned appropriately, and given a large enough input to achieve high
parallelism, the algorithm can be made to outperform is lower-span
peers.
\end{abstract}
%
% The code below should be generated by the tool at
% http://dl.acm.org/ccs.cfm
% Please copy and paste the code instead of the example below.
%
\begin{CCSXML}
<ccs2012>
<concept>
<concept_id>10003752.10003809.10010170.10010171</concept_id>
<concept_desc>Theory of computation~Shared memory algorithms</concept_desc>
<concept_significance>500</concept_significance>
</concept>
</ccs2012>
\end{CCSXML}
\ccsdesc[500]{Theory of computation~Shared memory algorithms}
\keywords{Parallel Partition, EREW PRAM, in-place algorithms}
\maketitle
\section{Introduction}
A \defn{parallel partition} operation rearranges the elements in an
array so that the elements satisfying a particular \defn{pivot
property} appear first. In addition to playing a central role in
parallel quicksort, the parallel partition operation is used as a
primitive throughout parallel algorithms.\footnote{In several
well-known textbooks and surveys on parallel algorithms
\cite{AcarBl16,Blelloch96}, for example, parallel partitions are
implicitly used extensively to perform what are referred to as
\emph{filter} operations.}
A parallel algorithm can be measured by its \defn{work}, the time
needed to execute in serial, and its \defn{span}, the time to execute
on infinitely many processors. There is a well-known algorithm for
parallel partition on arrays of size $n$ with work $O(n)$ and span
$O(\log n)$ \cite{Blelloch96,AcarBl16}. Moreover, the algorithm uses
only exclusive read/write shared memory variables (i.e., it is an
\defn{EREW} algorithm). This eliminates the need for concurrency
mechanisms such as locks and atomic variables, and ensures good
behavior even if the time to access a location is a function of the
number of threads trying to access it (or its cache line)
concurrently. EREW algorithms also have the advantage that their
behavior is internally deterministic, meaning that
the behavior of the algorithm will not differ from run to run, which
makes test coverage, debugging, and reasoning about performance
substantially easier \cite{BlellochFi12}.
The parallel-partition algorithm suffers from using a large amount of
auxiliary memory, however. Whereas the serial algorithm is typically
implemented in place, the parallel algorithm relies on the use of two
auxiliary arrays of size $n$. To the best of our knowledge, the only
known linear-work and $\operatorname{polylog}(n)$-span algorithms for
parallel partition that are in-place require the use of atomic
operations (e.g, fetch-and-add)
\cite{HeidelbergerNo90,AxtmannWi17,TsigasZh03}.
An algorithm's memory efficiency can be critical on large inputs. The
memory consumption of an algorithm determines the largest problem size
that can be executed in memory. Many external memory algorithms (i.e.,
algorithms for problems too large to fit in memory) perform large
subproblems in memory; the size of these subproblems is again
bottlenecked by the algorithm's memory-overhead \cite{Vitter08}. In
multi-user systems, memory efficiency is also important on small
inputs, since processes with larger memory-footprints can hog the
cache, slowing down other processes.
For sorting algorithms, in particular, special attention to memory
efficiency is often given. This is because (a) a user calling the sort
function may already be using almost all of the memory in the system;
and (b) sorting algorithms, and especially parallel sorting
algorithms, are often bottlenecked by memory bandwidth.
In the context of parallel algorithms, however, the most practically
efficient sorting algorithms fail to run in place, at least without
the additional use of atomic-fetch-and-add variables
\cite{HeidelbergerNo90, AxtmannWi17, TsigasZh03}, or the loss of
theoretical guarantees on parallelism \cite{FrancisPa92}. Parallel
merge sort \cite{Hagerup89} was made in-place by Katajainen
\cite{Katajainen93}, but has proven too sophisticated for practical
applications. Bitonic sort \cite{BlellochLe98} is naturally in-place,
and can be practical in certain applications on super computers, but
suffers in general from requiring work $\Theta(n \log^2 n)$ rather
than $O(n \log n)$. Parallel quicksort, on the other hand, despite the
many efforts to optimize it \cite{HeidelbergerNo90, AxtmannWi17,
TsigasZh03, FrancisPa92, Frias08}, has eluded any in-place EREW (or
even CREW) algorithms due to its reliance on parallel
partition.\footnote{In a \defn{CREW} algorithm, reads may be
concurrent, but writes may not. CREW stands for
\emph{concurrent-read exclusive-write}.}
\paragraph{Results}
We present a simple in-place algorithm for parallel partition that has
work $O(n)$ and span $O(\log n \cdot \log \log n)$. The algorithm uses
only exclusive read/write shared variables, and can be implemented
using parallel-for-loops without any additional concurrency
considerations. As an immediate consequence, we also get a simple
in-place quicksort algorithm with work $O(n \log n)$ and span
$O(\log^2 n \log \log n)$.
Using our algorithmic techniques, we implement and optimize a
space-efficient parallel partition. Because the in-place algorithm
eliminates the use of large auxiliary arrays, the algorithm is able to
achieve a significant reduction in cache misses over its out-of-place
counterpart, resulting in performance improvements both in serial and
in parallel.
We also present an alternative in-place parallel-partition algorithm
with a larger span of $O(\sqrt{n \log n})$, but which is designed to
have very small engineering overhead. On a single core, the algorithm
performs within 25\% of the serial GNU Libc quicksort partition
algorithm. We show that when the algorithm is tuned appropriately, and
given a large enough input to achieve high parallelism, the algorithm
can be made to outperform is lower-span peers. The low-span
algorithms, on the other hand, ensure good scaling with less
sensitivity to the number of cores and the input size.
\section{Preliminaries}\label{secprelim}
We begin by describing the the parallelism and memory model used in
the paper, and by presenting background on parallel partition.
\paragraph{Workflow Model} We consider a language-based model of parallelism in which algorithms may achieve parallelism
through the use of \defn{parallel-for-loops}, though our algorithm
also works in the less restrictive PRAM model
\cite{Blelloch96,AcarBl16}. A parallel-for-loop is given a range $R
\in \mathbb{N}$, a constant number of arguments $\arg_1, \arg_2,
\ldots, \arg_c$, and a body of code. For each $i \in \{1, \ldots,
R\}$, the loop launches a thread that is given loop-counter $i$ and
local copies of the arguments $\arg_1, \arg_2, \ldots, \arg_c$. The
threads then perform the body of the loop.\footnote{Note that
parallel-for-loops also implicitly allow for the implementation of
parallel recursion by placing recursive function calls in the body
of the parallel-for-loop.}
A parallel algorithm may be run on an arbitrary number $p$ of
processors. The algorithm itself is oblivious to $p$, however, leaving
the assignment of threads to processors up to a scheduler.
The \defn{work} $T_1$ of an algorithm is the time that the algorithm
would require to execute on a single processor. The \defn{span}
$T_\infty$ of an algorithm is the time to execute on infinitely many
processors. The scheduler is assumed to contribute no overhead to the
span. In particular, if the body of a
parallel-for-loop has span $s$, then the full parallel loop has span
$s + O(1)$ \cite{Blelloch96,AcarBl16}.
The work $T_1$ and span $T_\infty$ can be used to quantify the time $T_p$
that an algorithm requires to execute on $p$ processors using a greedy
online scheduler. If the scheduler is assumed to contribute no
overhead, then Brent's Theorem \cite{Brent74} states that for any
$p$,
$$T_1 / p \le T_p \le T_1 / p + T_\infty.$$
The work-stealing algorithms used in the Cilk extension of C/C++ realize
the guarantee offered by Brent's Theorem within a constant factor
\cite{BlumofeJo96,BlumofeLe99}, with the added caveat that parallel-for-loops typically induce an additional additive overhead of $O(\log
R)$.
\paragraph{Memory Model} Memory is \defn{exclusive-read} and \defn{exclusive-write}. That is, no two threads are ever permitted to attempt to read or write to the same variable concurrently. The exclusive-read exclusive-write memory model is sometime referred to as the \defn{EREW model} (see, e.g., \cite{Hagerup89}).
In an \defn{in-place} algorithm, each thread is given $O(\log n)$
memory upon creation that is deallocated when the thread dies. This
memory can be shared with the thread's children. The depth of the
parent-child tree is not permitted to exceed $O(\log n)$.\footnote{The
algorithm in this paper satisfies a slightly stronger property that
the total memory being used is never more than $O(\log n) \cdot p$,
where $p$ is an upper-bound on the number of worker threads.}
\paragraph{The Parallel Partition Problem}
The parallel partition problem takes an input array $A$ of size $n$,
and a \defn{decider function} $\dec$ that determines for each element
$a[i] \in A$ whether or not $A[i]$ is a \defn{predecessor} or a
\defn{successor}. That is, $\dec(A[i]) = 1$ if $A[i]$ is a
predecessor, and $\dec(A[i]) = 0$ if $A[i]$ is a successor. The
behavior of the parallel partition is to reorder the elements in the
array $A$ so that the predecessors appear before the successors.
\paragraph{The (Standard) Linear-Space Parallel Partition} The linear-space implementation of parallel partition consists of two phases \cite{Blelloch96,AcarBl16}:
\noindent\emph{The Parallel-Prefix Phase: }In this phase, the algorithm
constructs an array $B$ whose $i$-th element $B[i] = \sum_{j = 1}^i
\dec(A[i])$ is the number of predecessors in the first $i$ elements of
$A$. The transformation from $A$ to $B$ is called a \defn{parallel
prefix sum} and can be performed with $O(n)$ work and $O(\log n)$
span using a simple recursive algorithm: (1) First construct an array
$A'$ of size $n / 2$ with $A'[i] = A[2i - 1] + A[2i]$; (2)
Recursively construct a parallel prefix sum $B'$ of $A'$; (3) Build
$B$ by setting each $B[i] = B'[\lfloor i / 2 \rfloor] + A[i]$ for odd
$i$ and $B[i] = A'[i / 2]$ for even $i$.
\noindent\emph{The Reordering Phase: }In this phase, the algorithm
constructs an output-array $C$ by placing each predecessor $A[i] \in A$
in position $B[i]$ of $C$. If there are $t$ predecessors in $A$, then
the first $t$ elements of $C$ will now contain those $t$ predecessors
in the same order that they appear in $A$. The algorithm then places
each successor $A[i] \in A$ in position $t + i - B[i]$. Since $i - B[i]$
is the number of successors in the first $i$ elements of $A$, this
places the successors in $C$ in the same order that they appear in
$A$. Finally, the algorithm copies $C$ into $A$, completing the
parallel partition.
Both phases can be implemented with $O(n)$ work and $O(\log n)$
span. Like its serial out-of-place counterpart, the algorithm is
stable but not in place. The algorithm uses two auxiliary arrays of
size $n$. Kiu, Knowles, and Davis \cite{LiuKn05} were able to reduce
the extra space consumption to $n + p$ under the assumption that the
number of processors $p$ is hard-coded; their algorithm breaks the
array $A$ into $p$ parts and assigns one part to each thread. Reducing
the space below $o(n)$ has remained open until now, even when the
number of threads is fixed.
\section{A Simple In-Place Algorithm}\label{secalg}
In this section, we present a simple in-place algorithm for parallel
partition.
We assume without loss of generality that the total number of
successors in $A$ exceeds the number of predecessors, since otherwise
their roles can simply be swapped in the algorithm. Further, we assume
for simplicity that the elements of $A$ are distinct; this assumption
is removed at the end of the section.
\paragraph{Algorithm Outline}
We begin by presenting an overview of the key algorithmic ideas needed
to construct an in-place algorithm.
Consider how to remove the auxiliary array $C$ from the Reordering
Phase. If one attempts to simply swap in parallel each predecessor
$A[i]$ with the element in position $j = B[i]$ of $A$, then the swaps
will almost certainly conflict. Indeed, $A[j]$ may also be a
predecessor that needs to be swapped with $A[B[j]]$. Continuing like
this, there may be an arbitrarily long list of dependencies on the
swaps.
To combat this, we begin the algorithm with a Preprocessing Phase in
which $A$ is rearranged so that every prefix is
\defn{successor-heavy}, meaning that for all $t$, the first $t$ elements contain at
least $\frac{t}{4}$ successors. Then we compute the
prefix-sum array $B$, and begin the Reordering Phase. Using the
fact that the prefixes of $A$ are successor-heavy, the reordering can
now be performed in place as follows: (1) We begin by recursively
reordering the prefix $P$ of $A$ consisting of the first $4/5 \cdot n$
elements, so that the predecessors appear before the successors; (2)
Then we simply swap each predecessor $A[i]$ with the corresponding
element $B[A[i]]$. The fact that the prefix $P$ is successor-heavy
ensures that the final $\frac{1}{5} \cdot n$ elements of the reordered
$P$ will consist of successors. This implies that for each of the
swaps between predecessors $A[i]$ and earlier positions $B[A[i]]$, the
latter element will be a successor. In other words, the swaps are now
conflict free.
Next consider how to remove the array $B$ from the Parallel-Prefix
Phase. At face value, this would seem quite difficult since the
reordering phase relies heavily on $B$. Our solution is to
\emph{implicitly} store the value of every $O(\log n)$-th element of
$B$ in the ordering of the elements of $A$. That is, we break $A$ into
blocks of size $O(\log n)$, and use the order of the elements in each
block to encode an entry of $B$. (If the elements are not all
distinct, then a slightly more sophisticated encoding is necessary.)
Moreover, we modify the algorithm for building $B$ to only construct
every $O(\log n)$-th element and to perform the construction also
using implicitly storing values. The new parallel-prefix sum performs
$O(n / \log n)$ arithmetic operations on values that are implicitly
encoded in blocks; since each such operation requires $O(\log n)$
work, the total work remains linear.
In the remainder of the section, we present the algorithm in detail.
It proceeds in three phases.
\paragraph{A Preprocessing Phase} Recall that for each $t \in 1, \ldots, n$, we call the $t$-prefix $A[1], \ldots, A[t]$ of $A$ successor-heavy if it contains at least $\frac{t}{4}$ successors. The goal of the preprocessing phase is to rearrange $A$ so that every prefix is successor heavy.
We begin with a parallel-for-loop: For each $i = 1, \ldots, \lfloor n
/ 2\rfloor$, if $A[i]$ is a predecessor and $A[n - i + 1]$ is a
successor, then we swap their positions in $A$.
This ensures that at least half the successors in $A$ reside in the
first $\lceil n / 2 \rceil$ positions. Thus the first $\lceil n / 2
\rceil$ positions contain at least $\lceil n / 4\rceil$ successors,
making every $t$-prefix with $t \ge \lceil n / 2 \rceil$
successor-heavy.
Since $\lceil n / 4 \rceil \ge \frac{\lceil n / 2 \rceil}{2}$, the
first $\lceil n / 2 \rceil$ positions of $A$ now contain at least as
many successors as predecessors. Thus we can recursively repeat the
same process on the subarray $[A[1], \ldots, A[\lceil n / 2 \rceil]]$
in order to make each of its prefixes successor-heavy.
Each recursive step has constant span and performs work proportional
to the size of the subarray being considered. The preprocessing phase
therefore has total work $O(n)$ and span $O(\log n)$.
\paragraph{An Implicit Parallel Prefix Sum} Pick a \defn{block-size} $b \in \Theta(\log n)$ satisfying $b \ge 2 \lceil \log (n + 1) \rceil$. Consider $A$ as a series of $\lfloor n / b \rfloor$ blocks of size $b$, with the final block of size between $b$ and $2b - 1$. Denote the blocks by $X_1, \ldots, X_{\lfloor n / b \rfloor}$.
Within each block $X_i$, we can implicitly store a value in the range
$0, \ldots, n$ through the ordering of the elements. In particular,
$X_i$ can be broken into (at least) $\lceil \log (n + 1) \rceil$
disjoint pairs of adjacent elements, and by rearranging the order in
which a given pair $(x_j, x_{j + 1})$ occurs, the lexicographic
comparison of whether $x_j < x_{j + 1}$ can be used to encode one bit
of information. Values $v \in [0,n]$ can therefore be read and written to $X_i$ with
work $O(b) = O(\log n)$ and span $O(\log b) = O(\log \log n)$ using a
simple divide-and-conquer recursive approach.
After the preprocessing phase, our algorithm performs a parallel-for
loop through the blocks, and stores in each block $X_i$ a value $v_i$
equal to the number of predecessors in the block. This can be done
in place with work $O(n)$ and span $O(\log \log n)$.
The algorithm then performs an in-place parallel-prefix operation on
the values $v_1, \ldots, v_{\lfloor n / b \rfloor}$ stored in the
blocks. This is done by first resetting each even-indexed value
$v_{2i}$ to $v_{2i} + v_{2i - 1}$; then recursively performing a
parallel-prefix sum on the even-indexed values; and then replacing
each odd-indexed $v_{2i + 1}$ with $v_{2i + 1} + v_{2i}$, where $v_0$
is defined to be zero. If the $v_i$'s could be read and written in
constant time, then the prefix sum would take work $O(n)$
and span $O(\log n)$. Since each $v_i$ actually requires work $O(\log
n)$ and span $O(\log \log n)$ to read/write, the prefix sum takes work
$O(n)$ and span $O(\log n \cdot \log \log n)$.
At the end of this phase of the algorithm, the array $A$ satisfies two
important properties: (1) Every block $X_i$ encodes a value $v_i$
counting the number of predecessors in the prefix $X_1 \circ X_2 \circ
\cdots \circ X_i$; and (2) Each prefix $X_1 \circ X_2 \circ \cdots
\circ X_i$ is successor-heavy.
\paragraph{In-Place Reordering} In the final phase of the algorithm, we reorder $A$ so that the predecessors appear before the successors. Let $P = X_1 \circ X_2 \circ \cdots \circ X_t$ be the smallest prefix of blocks that contain at least $4/5$ of the elements in $A$. We begin by recursively reordering the elements in $P$ so that the predecessors appear before the successors; as a base case, when $|P| \le 5b = O(\log n)$, we simply perform the reordering in serial.
After $P$ has been reordered, it will be of the form $P_1
\circ P_2$ where $P_1$ contains only predecessors and $P_2$ contains
only successors. Because $P$ is successor-heavy, we have that $|P_2|
\ge |P| / 4$, and thus that $|P_2| \ge |X_{t + 1} \circ \cdots \circ
X_n|$.
To complete the reordering of $A$, we perform a parallel-for-loop
through each of the blocks $X_{t + 1}, \ldots, X_n$. For each block
$X_i$, we first extract $v_i$ (with work $O(\log n)$ and span $O(\log
\log n)$). We then create an auxiliary array $Y_i$ of size $|X_i|$,
using $O(\log n)$ memory from the thread in charge of $Y_i$ in the
parallel-for-loop. Using a parallel-prefix sum (with work $O(\log n)$
and span $O(\log \log n)$), we set each $Y_i[j]$ equal to $v_i$ plus
the number of predecessors in $X_i[1], \ldots, X_i[j]$. In other
words, $Y_i[j]$ equals the number of predecessors in $A$ appearing at
or before $X_i[j]$.
After creating $Y_i$, we then perform a parallel-for-loop through the
elements $X_i[j]$ of $X_i$ (note we are still within another parallel
loop through the $X_i$'s), and for each predecessor $X_i[j]$, we swap
it with the element in position $Y_i[j]$ of the array $A$. Critically,
because $|P_2| \ge |X_{t + 1} \circ \cdots \circ X_n|$, we are
guaranteed that the element with which $X_i[j]$ is being swapped is a
successor in $P_2$. After the swaps have been performed, all of the
elements of $X_i$ are now successors.
Once the outer for-loop through the $X_i$'s is complete, so will be
the parallel partition of $A$. The total work in the reordering phase
is $O(n)$ since each $X_i$ appears in a parallel-for-loop at exactly
one level of the recursion, and incurs $O(\log n)$ work. The total
span of the reordering phase is $O(\log n \cdot \log \log n)$, since
there are $O(\log n)$ levels of recursion, and within each level of
recursion each $X_i$ in the parallel-for-loop incurs span $O(\log \log
n)$.
Combining the phases, the full algorithm has work $O(n)$ and span
$O(\log \log n)$. Thus we have:
\begin{theorem}
There exists an in-place algorithm using exclusive-read-write
variables that performs parallel-partition with work $O(n)$ and span
$O(\log n \cdot \log \log n)$.
\label{thminplace}
\end{theorem}
\paragraph{Allowing for Repeated Elements} In proving Theorem \ref{thminplace} we assumed for simplicity that the elements of $A$ are distinct. This plays an important role in how we store the values $v_i$ in the blocks $X_i$. To eliminate this requirement without changing the work and span of the algorithm, we can require that $b \ge 4 \lceil \log (n + 1) \rceil + 2$, and use the following slightly more complex encoding of the $v_i$'s.
Consider the first $b$ letters of $X_i$ as a sequence of pairs, given
by $(x_1, x_2), \ldots, (x_{b - 1}, x_b)$. If at least half of the
pairs consist of distinct elements, then we can reorder those pairs to
appear at the front of $X_i$, and use them to encode values
$v_i$. (For each $X_i$ this reordering can be done once before the
Implicit-Parallel-Prefix-Sum phase, adding only linear work and
logarithmic span to the full algorithm.) If, on the other hand, at
least half the pairs consist of equal-value elements, then we can
reorder the pairs so that the first $\lceil \log (n + 1) \rceil + 1$
of them satisfy this property. That is, if we reindex the reordered
elements as $x_0,x_1,\ldots$, then $x_{2j + 1} = x_{2j + 2}$ for each
$j = 0, 1, \ldots, \lceil \log (n + 1) \rceil$. To encode a value
$v_i$, we explicitly overwrite the second element in each of the pairs
$(x_3, x_4), (x_5, x_6), \ldots$ with the bits of $v_i$, overwriting
each element with one bit.
To read the value $v_i$, we check whether $x_1 = x_2$ in order to
determine which encoding is being used and then unencode the bits
appropriately. In the Reordering phase of the algorithm, once the
blocks $X_i$ are no longer required to encode values, we can replace
each overwritten $x_i$ with its correct value (given by the value of
the preceding element).
\section{Experimental Evaluation}\label{secexp}
\begin{figure*}
\begin{center}
\CILKsorttable
\end{center}
\caption{We compare the performance of the low-space and high-span
sorting implementations running on varying numbers of threads and
input sizes. The $x$-axis is the number of worker threads, the
$y$-axis is the multiplicative speedup when compared to the serial
baseline, and the log-base-two size of the input is indicated for
each curve in the key. Each time (including each serial baseline)
is averaged over five trials.}
\label{tablesort}
\end{figure*}
\begin{figure*}
\begin{center}
\CILKtable
\end{center}
\caption{For a fixed table-size $n = 2^{28}$, we compare each
implementation's runtime to a serial baseline, which takes 0.96
seconds to complete (averaged over five trials). The $x$-axis
plots the number of worker threads being used, and the $y$-axis
plots the multiplicative speedup over the serial baseline. Each
time is averaged over five trials.}
\label{tablecilk}
\end{figure*}
\begin{figure*}
\begin{center}
\cilktwotabletwo
\end{center}
\caption{We compare the performance of the implementations running
on eighteen worker threads on varying input sizes. The $x$-axis is
the log-base-$2$ of the input size, and the $y$-axis is the
multiplicative speedup when compared to the serial baseline. Each
time (including each serial baseline) is averaged over five
trials.}
\label{tablecilk2}
\end{figure*}
\begin{figure*}
\begin{center}
\serialtable
\end{center}
\caption{We compare the performance of the implementations in
serial, with no scheduling overhead. The $x$-axis is the
log-base-$2$ of the input size, and the $y$-axis is the
multiplicative slowdown when compared to the serial baseline. Each
time (including each serial baseline) is averaged over five
trials.}
\label{tableserial}
\end{figure*}
\begin{figure*}
\begin{center}
\partitionbandwidthboundtable
\end{center}
\caption{We compare the performances of the low-space and high-span
parallel-partition algorithms to their ideal performance
determined by memory-bandwidth constraints on inputs of size
$2^{28}$. The $x$-axis is the number of worker threads, and the
$y$-axis is the multiplicative speedup when compared to the serial
baseline (which is computed by an average over five trials). Each
data-point is averaged over five trials.}
\label{tablebandwidth}
\end{figure*}
In this section, we implement the techniques from Section \ref{secalg}
to build a space-efficient parallel-partition function. Our
implementation considers an array of $n$ 64-bit integers, and
partitions them based on a pivot. The integers in the array are
initially generated so that each is randomly either larger or smaller
than the pivot.
In Subsection \ref{subseclowspan}, we compare the performance of three
parallel-partition implementations: (1) The \defn{high-space}
implementation which follows the standard parallel-partition algorithm
exactly; (2) a \defn{medium-space} implementation which reduces the
space used for the parallel-prefix step; and (3) a \defn{low-space}
implementation which further eliminates the auxiliary space used in the
reordering phase of the algorithm. The low-space implementation still
uses a small amount of auxiliary memory for the parallel-prefix,
storing every $O(\log n)$-th element of the parallel-prefix array
explicitly rather than using the implicit-storage approach in Section
\ref{secalg}. Nonetheless the space consumption is several orders of
magnitude smaller than the original algorithm.
In addition to achieving a space-reduction, the better cache-behavior
of the low-space implementation allows for it to achieve a speed
advantage over its peers, in some cases completing roughly twice as
fast as the medium-space implementation and four times as fast as the
low-space implementation.
In Subsection \ref{subsechighspan}, we present a fourth
parallel-partition algorithm which we call the \defn{two-layer
algorithm}, and which runs fully in place but has a polynomial span
of $\Theta(\sqrt{n \log n})$. The polynomial span of the algorithm
makes it so that a naive implementation performs poorly. Nonetheless,
we show that by tuning the algorithm to the number of worker threads,
further speedup can often be achieved over the low-space algorithm.
The two-layer algorithm has the advantage that is very simple to
implement, and runs in serial at almost the same speed as GNU Libc
quicksort's serial algorithm. On the other hand the algorithm's
performance is much more sensitive to input size and number of cores
than is the low-space implementation. On a machine with sufficiently
many cores (and sufficiently large memory bandwidth), the
polylogarithmic span of the low-space implementation is desirable.
\paragraph{Machine Details}
Our experiments are performed on a two-socket machine with eighteen
2.9 GHz Intel Xeon E5-2666 v3 processors. To maximize the memory
bandwidth of the machine, we use a NUMA memory-placement policy in
which memory allocation is spread out evenly across the nodes of the
machine; this is achieved using the \emph{interleave=all} option in
the Linux \emph{numactl} tool \cite{Kleen05}. Worker threads in our
experiments are each given their own core, with hyperthreading
disabled.
Our algorithms are implemented using the CilkPlus task parallelism
library in C++. The implementations avoid the use of concurrency
mechanisms and atomic operations, but do allow for concurrent reads to
be performed on shared values such as $n$ and the pointer to the input
array. Our code is compiled using g++ 7.3.0, with \emph{march=native}
and at optimization level three.
Our implementations are available at \\ \github.
\subsection{Comparing Low-Span Algorithms}\label{subseclowspan}
In this section, we compare four partition implementations:
\begin{itemize}[leftmargin = .15in]
\item \emph{A Serial Baseline:} This uses the serial in-place
partition implementation from GNU Libc quicksort, with minor
adaptions to optimize it for the case of sorting 64-bit integers
(i.e., inlining the comparison function, etc.).
\item \emph{The High-Space Parallel Implementation:} This uses the
standard parallel partition algorithm \cite{Blelloch96,AcarBl16}, as
described in Section \ref{secprelim}. The space overhead is roughly
$2n$ eight-byte words.
\item \emph{The Medium-Space Parallel Implementation:} Starting with
the high-space implementation, we reduce the space used by the
Parallel-Prefix Phase by only constructing every $O(\log n)$-th
element of the prefix-sum array $B$, as in Section
\ref{secalg}. (Here $O(\log n)$ is hard-coded as 64.) The array $B$
is initialized to be of size $n / 64$, with each component equal to
a sum of 64 elements, and then a parallel prefix sum is computed on
the array. Rather than implicitly encoding the elements of $B$ in
$A$, we use an auxiliary array of size $n / 64$ to explicitly store
the prefix sums.\footnote{We suspect that an implementation in which
the values are implicitly stored could also be made fast. In
particular, the value 64 can be increased to compensate for
whatever constant overhead is induced by the implicit storage of
values. Nonetheless, the auxiliary array is already quite small
relative to the input and is more practical to implement.} The algorithm
has a space overhead of $\frac{n}{32} + n$ eight-byte
words.\footnote{In addition to the auxiliary array of size $n / 64$,
we use a series of smaller arrays of sizes $n / 128, n / 256,
\ldots$ in the recursive computation of the prefix sum. The
alternative of performing the parallel-prefix sum in place, as in
Section \ref{secalg}, tends to be less cache-friendly in
practice.}
\item \emph{The Low-Space Parallel Implementation:}
Starting with the medium-space implementation, we make the reordering
phase completely in-place using the preprocessing technique in Section
\ref{secalg}.\footnote{Depending on whether the majority of elements
are predecessors are successors, the algorithm goes down separate
(but symmetric) code paths. In our timed experiments we test only
with inputs containing more predecessors than successors, since this
the slower of the two cases (by a very slight amount) for our
implementation.} The only space overhead in this implementation is
the $\frac{n}{32}$ additional 8-byte words used in the prefix sum.
\end{itemize}
All three parallel-implementations have work $O(n)$ The high- and
medium- space implementations have span $O(\log n)$, while the
low-space implementation has span $O(\log^2 n)$ (due to the fact that
for convenience of implementation parallel-for-loops are broken into
chunks of size $64 = O(\log n)$).
We remark that the ample parallelism of the low-space algorithm makes
it so that for large inputs the value $64$ can easily be increased
substantially without negatively effecting algorithm performance. For
example, on an input of size $2^{28}$, increasing it to $4096$ has
essentially no effect on the empirical runtime while bringing the
auxiliary space-consumption down to a $\frac{1}{2048}$-fraction of the
input size. (In fact, the increase from 64 to 4096 results in roughly
a 5\% speedup.)
\paragraph{An Additional Optimization for The High-Space Implementation}
The optimization of reducing the prefix-sum by a factor of $O(\log n)$
at the top level of recursion, rather than simply by a factor of two,
can also be applied to the standard parallel-prefix algorithm when
constructing a prefix-sum array of size $n$. Even without the space
reduction, this reduces the (constant) overhead in the parallel prefix
sum, while keeping the overall span of the parallel-prefix operation
at $O(\log n)$. We perform this optimization in the high-space
implementation.
\paragraph{Performance Comparison}
Figure \ref{tablecilk} graphs the speedup of the each of the parallel
algorithms over the serial algorithm, using varying numbers of worker
threads on an 18-core machine with a fixed input size of $n =
2^{28}$. Both space optimizations result in performance improvements,
with the low-space implementation performing almost twice as well as
the medium-space implementation on eighteen threads, and almost four
times as well as the high-space implementation. Similar speedups are
achieved on smaller inputs; see Figure \ref{tablecilk2}, which graphs
speedup for input sizes starting at $2^{23}$.
Figure \ref{tableserial} compares the performances of the
implementations in serial. Parallel-for-loops are replaced with serial
for-loops to eliminate scheduler overhead. As the input-size varies,
the ratios of the runtimes vary only slightly. The low-space
implementation performs within a factor of roughly 1.8 of the serial
implementation. As in Tables \ref{tablecilk} and \ref{tablecilk2},
both space optimizations result in performance improvements.
\paragraph{The Source of the Speedup}
If we compare the number of instructions performed by the three
parallel implementations, then the medium-space algorithm would seem
to be the clear winner. Using Cachegrind to profile the number of
instructions performed in a (serial) execution on an input of size
$2^{28}$, the high-space, medium-space, and low-space implementations
perform 4.4 billion, 2.9 billion, and 4.6 billion instructions,
respectively.
Cache misses tell a different story, however. Using Cachegrind to
profile the number of top-level cache misses in a (serial) execution
on an input of size $2^{28}$, the high-space, medium-space, and
low-space implementations incur 305 million, 171 million, and 124
million cache misses, respectively.
To a first approximation, the number of cache misses by each algorithm
is proportional to the number of times that the algorithm scans
through a large array. By eliminating the use of large auxiliary
arrays, the low-space implementation has the opportunity to achieve a
reduction in the number of such scans. Additionally, the low-space
algorithm allows for steps from adjacent phases of the algorithm to
sometimes be performed in the same pass. For example, the enumeration
of the number of predecessors and the top level of the Preprocessing
Phase can be performed together in a single pass on the input
array. Similarly, the later levels of the Preprocessing Phase (which
focus on only one half of the input array) can be combined with the
construction of (one half of) the auxiliary array used in the Parallel
Prefix Sum Phase, saving another half of a pass.
%% There are additional slightly more tricky potential optimizations
%% that we did not make. One could, for example, also combine the
%% construction of the other half of the auxiliary array used in the
%% Parallel Prefix Sum Phase with the top level of the Preprocessing
%% Phase; this is made more subtle by the fact that when performing the
%% top level of the Preprocessing Phase, we do not know yet whether we
%% will be recursing on the left or right half of the array. One could
%% also implement the later levels of the Preprocessing Phase to have a
%% more cache-friendly recursive structure, starting with the final step
%% of the Preprocessing Phase and recursively working backwords to
%% perform the entire phase in a depth-first recursive tree. Evaluating
%% these optimizations would be an interesting direction of future work.
\paragraph{The Memory-Bandwidth Limitation}
The comparison of cache misses suggests that, as the number of worker
threads grows, the performance of the low-space algorithm becomes
bottlenecked by memory bandwidth. To evaluate whether this is the
case, we measure for each $t \in \{1, \ldots, 18\}$ the memory
throughput of $t$ threads attempting to scan through disjoint portions
of a large array in parallel. We measure two types of bandwidth, the
\defn{read-bandwidth}, in which the threads are simply trying to read
from the array, and the \defn{read/write bandwidth}, in which the
threads are attempting to immediately overwrite entries to the array
after reading them. Given read-bandwidth $r$ bytes/second and
read/write bandwidth $w$ bytes/second, the time needed for the
low-space algorithm to perform its memory operations on an input of
$m$ bytes will be roughly $3.5 m / w + .5m / r$ seconds. We call this
the \defn{bandwidth constraint}. Assuming that large scans through
arrays are unaided by caching, the runtime of the low-space
implementation is limited by the bandwidth
constraint.\footnote{Empirically, the total number of cache misses is
within $8\%$ of what this assumption would predict,
suggesting that the bandwidth constraint is within a small amount of
the true bandwidth-limited runtime.}
Figure \ref{tablebandwidth} compares the time taken by the low-space
algorithm to the bandwidth constraint as the number of threads $t$
varies from $1$ to $18$. As the number of threads grows, the algorithm
becomes bandwidth limited, achieving its best possible parallel
performance on the machine. The algorithm scales particularly well on
the first socket of the machine, achieving a speedup on nine cores of
roughly six times better than its performance on a single core, and
then scales more poorly on the second socket as it becomes
bottlenecked by memory bandwidth.
\paragraph{A Full Quicksort}
In Figure \ref{tablesort}, we graph the performance of a parallel
quicksort implementation using our low-space algorithm. We compare the
algorithm's performance to GNU Libc quicksort with varying numbers of
worker threads and input sizes; the input array is initially in a
random permutation.
Our parallel quicksort uses the parallel-partition algorithm at the
top levels of recursion, and then swaps to the serial-partitioning
algorithm once the input size has been reduced by at least a factor of
$8p$, where $p$ is the number of worker threads. By using the
serial-partitioning algorithm on the small recursive subproblems we
avoid the overhead of the parallel algorithm, while still achieving
parallelism between subproblems. Small recursive problems also exhibit
better cache behavior than larger ones, reducing the effects of
memory-bandwidth limitations on the performance of the parallel
quicksort, and improving the scaling.
\paragraph{Implementation Details}
In each implementation, the parallelism is achieved through simple
parallel-for-loops, with one exception at the beginning of the
low-space implementation, when the number of predecessors in the input
array is computed. Although CilkPlus Reducers (or OpenMP Reductions)
could be used to perform this parallel summation within a
parallel-for-loop \cite{FrigoLe09}, we found a slightly more ad-hoc
approach to be faster: Using a simple recursive structure, we manually
implemented a parallel-for-loop with Cilk Spawns and Syncs, allowing
for the summation to be performed within the recursion; to amortize
the cost of Cilk thread spawns.
\subsection{An In-Place Algorithm with Polynomial Span}\label{subsechighspan}
In this subsection, we consider a simple in-place parallel-partition
algorithm with polynomial span. We evaluate the algorithm as a simple
and even-lower-overhead alternative to the low-space algorithm in the
previous subsection.
The algorithm takes two steps:
\begin{itemize}
\item \textbf{Step 1:} The algorithm breaks the input array $A$ into
$t$ equal-sized parts, $P_1, \ldots, P_t$, for some parameter $t$. A
serial partition is performed on each of the $P_i$'s in
parallel. This step takes work $\Theta(n)$ and span $\Theta(n / t)$.
\item \textbf{Step 2:} The algorithm loops in serial through each of
the $t$ parts $P_1, \ldots, P_t$. Upon visiting $P_i$, the algorithm
has already performed a complete partition on the subarray $P_1
\circ \cdots \circ P_{i - 1}$. Let $j$ denote the number of
predecessors in $P_1, \ldots, P_{i - 1}$, and $k$ denote the number
of predecessors in $P_i$. The algorithm computes $k$ through a
simple binary search in $P_i$. The algorithm then moves the $k$
elements at the start of $P_i$ to take the place of the $k$ elements
$A[j + 1], \ldots, A[j + k]$. If the two sets of $k$ elements are
disjoint, then they are swapped with one-another in a
parallel-for-loop; otherwise, the non-overlapping portions of the
two sets of $k$ elements are swapped in parallel, while the
overlapping portion is left untouched. This completes the
partitioning of the parts $P_1 \circ \cdots \circ P_i$. Performing
this step for $i = 1, \ldots, t$ requires work $O(t \log n + n)$ and
span $\Theta(t \log n)$.
\end{itemize}
Setting $t = \sqrt{n}$, the algorithm runs in linear time with span
$\sqrt{n} \log n$; refining $t$ to an optimal value of $\sqrt{n / \log
n}$ results a span of $\sqrt{n \log n}$. In practice,
however, this leaves too little parallelism in the parallel-for-loops
in Step 2, resulting in poor scaling.\footnote{On 18 threads an on an
input of size $2^{28}$, for example, setting $t = \sqrt{n}$ results
in a performance a factor of two slower than the low-space
implementation, and setting $t = \sqrt{n / \log n}$ makes only
partial progress towards closing that gap.} To mitigate this, we
tune our implementation of the algorithm to the number of processors
$p$ on which it is being run, setting $t = 8 \cdot p$, in order to
maximize the parallelism in the for-loops in Step 2, while still
providing sufficient parallelism for Step 1.
Figures \ref{tablecilk} and \ref{tablecilk2} compare the parallel
performance of the algorithm, which is referred to as the
\defn{two-layer algorithm}, to its lower-span peers. On 18 cores and
on an input of size $2^{28}$, the two-layer algorithm offers a speedup of
roughly 50\% over the low-space algorithm. The algorithm is more
sensitive to input-size and number of cores, however, requiring a
large enough ratio between the two to compensate for the algorithm's
large span (See Figure \ref{tablecilk2}).
Figure \ref{tableserial} compares the performance of the two-layer
algorithm in serial to GNU Libc quicksort. The algorithm runs within a
small fraction (less than $1/4$) of the serial implementation.
Figure \ref{tablebandwidth} evaluates the degree to which the
algorithm is memory-bandwidth bound on an input of size $2^{28}$. If
the read/write bandwidth of the algorithm is $w$ bytes/second, then
the bandwidth constraint for the algorithm on an input of $m$ bytes is
given by $2m / w$. In particular, Step 1 of the algorithm makes one
scan through the array, requiring time $m / w$; and then Step 2
rearranges the predecessors (which constitute half of the array and
each must be moved to a new location), requiring additional time $m /
w$. Figure \ref{tablebandwidth} compares the time taken by the
algorithm to the bandwidth constraint as the number of threads $t$
varies from $1$ to $18$. Like the low-space algorithm, as the number
of threads grows, the algorithm becomes close to bandwidth limited.
Figure \ref{tablesort} compares the performance of a quicksort
implementation using the two-layer partition algorithm, to the
performance of an implementation using the low-space algorithm. The
implementation using the two-layer algorithm achieves a modest speedup
over the low-space algorithm, but also demonstrates its larger span by
suffering on smaller inputs as the number of cores grows.
\paragraph{Related Algorithms}
Despite the simplicity of the two-layer algorithm, we are unaware of
any discussion of the algorithm in prior work. A related algorithm is
that of Francis and Pannan \cite{FrancisPa92}, which is the the only
previously known algorithm of which we are aware that performs
parallel-partition in-place using only exclusive-read-and-write memory
memory. The algorithm, which we call the \defn{Strided Algorithm}
consists of two steps:
\begin{itemize}
\item Partition the array $A$ logically into $t$ equal-size parts, denoted by
$P_1, P_2, \ldots, P_t$ for some parameter $t$. Unlike the two-layer
algorithm, the parts are interleaved, with each $P_i$ consisting of
array entries $A[i], A[i + t], A[i + 2t], \ldots$. The first step of
the algorithm is to perform a serial partition on each of the
$P_i$'s, rearranging the elements within the $P_i$ so that the
predecessors come first. This step has work $\Theta(n)$ and span
$\Theta(n/t)$.
\item For each $P_i$, define the \defn{splitting position} $v_i$ to be
the position in $A$ of the final predecessor in (the already
partitioned) $P_i$. Define $v_{\text{min}} = \min\{v_1, \ldots,
v_t\}$ and define $v_{\text{max}} = \max\{v_1, \ldots, v_t\}$. Then the
second step of the algorithm is to perform a serial partition on the
sub-array $A[v_{\text{min}} : v_{\text{max}}]$. This completes the
full partition.
\end{itemize}
Note that Step 2 of the Strided Algorithm has no parallelism, with
span $\Theta(v_{\text{max}} - v_{\text{min}})$. In general, this
results in an algorithm with linear-span (i.e., no parallelism
guarantee). When the number of predecessors in each of the $P_i$'s is
close to equal, however, the quantity $v_{\text{max}} -
v_{\text{min}}$ can be much smaller. For example, if $A$ is randomly
ordered, then one can use Chernoff bounds to prove that with high
probability $v_{\text{max}} - v_{\text{min}} \le O(\sqrt{n \cdot t
\cdot \log n})$. The full span of the algorithm is then
$\tilde{O}(n/t + \sqrt{n \cdot t})$, which optimizes at $t = n^{1/3}$
to $\tilde{O}(n^{2/3})$.
A more cache-friendly version of the algorithm, in which each part
$P_i$ consists blocks of $b$ elements separated from each other by
runs of length $(t - 1) b$ was considered by Frias and Petit
\cite{Frias08}. With this optimization, one advantage of the Strided
Algorithm is that when $v_{\text{max}} - v_{\text{min}}$ is small, the
total number of cache misses by the algorithm is close to the same as