Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

team-spectre/go-suffixarray

Repository files navigation

go-suffixarray

Linear time suffix array generator in Go.

License GoDoc Build Status Issues Pull Requests Latest Release

Overview

Suffix arrays are a data structure that allows for very fast searching of a large corpus in O(m log n) time, where n is the length of the corpus and m is the length of the search string.

The basic idea for using a suffix array is that you have a search string (na), a corpus (banana), and the sorted suffix tree for the corpus:

0: $       [ptr to 6]
1: a$      [ptr to 5]
2: ana$    [ptr to 3]
3: anana$  [ptr to 1]
4: banana$ [ptr to 0]
5: na$     [ptr to 4]
6: nana$   [ptr to 2]

We then do a binary search to locate the region where our search string, na, is a prefix of the suffixes. That's indices 5 and 6 in this case, which are pointers to corpus offsets 4 and 2 respectively.

The actual suffix array only contains the pointers themselves, with the text for each array slot being reconstructed on demand from the corpus and the slot's pointer. This implementation uses variable-length pointers, removing the space blow-up that some other implementations have for small corpora.

Construction

Construction of the suffix array uses the SA-IS algorithm first defined by Nong, Zhang, and Chan in "Linear Suffix Array Construction by Almost Pure Induced-Sorting" link, using Screwtape's "A walk through the SA-IS Suffix Array Construction Algorithm" link as a practical guide to the implementation.

SA-IS runs in O(n) time. In practice, a reasonably modern laptop can index a hundreds-of-megabytes corpus in tens-of-minutes.

About

Linear time suffix array generator in Go.

Resources

License

Stars

Watchers

Forks

Packages

No packages published