Linear time suffix array generator in Go.
Suffix arrays are a data structure that allows for very fast searching
of a large corpus in O(m log n)
time, where n
is the length of the corpus
and m
is the length of the search string.
The basic idea for using a suffix array is that you have a search string
(na
), a corpus (banana
), and the sorted suffix tree for the corpus:
0: $ [ptr to 6]
1: a$ [ptr to 5]
2: ana$ [ptr to 3]
3: anana$ [ptr to 1]
4: banana$ [ptr to 0]
5: na$ [ptr to 4]
6: nana$ [ptr to 2]
We then do a binary search to locate the region where our search string, na
,
is a prefix of the suffixes. That's indices 5 and 6 in this case, which are
pointers to corpus offsets 4 and 2 respectively.
The actual suffix array only contains the pointers themselves, with the text for each array slot being reconstructed on demand from the corpus and the slot's pointer. This implementation uses variable-length pointers, removing the space blow-up that some other implementations have for small corpora.
Construction of the suffix array uses the SA-IS algorithm first defined by Nong, Zhang, and Chan in "Linear Suffix Array Construction by Almost Pure Induced-Sorting" link, using Screwtape's "A walk through the SA-IS Suffix Array Construction Algorithm" link as a practical guide to the implementation.
SA-IS runs in O(n)
time. In practice, a reasonably modern laptop can index
a hundreds-of-megabytes corpus in tens-of-minutes.