go-suffixarray

Linear time suffix array generator in Go.

Overview

Suffix arrays are a data structure that allows for very fast searching of a large corpus in O(m log n) time, where n is the length of the corpus and m is the length of the search string.

The basic idea for using a suffix array is that you have a search string (na), a corpus (banana), and the sorted suffix tree for the corpus:

0: $       [ptr to 6]
1: a$      [ptr to 5]
2: ana$    [ptr to 3]
3: anana$  [ptr to 1]
4: banana$ [ptr to 0]
5: na$     [ptr to 4]
6: nana$   [ptr to 2]

We then do a binary search to locate the region where our search string, na, is a prefix of the suffixes. That's indices 5 and 6 in this case, which are pointers to corpus offsets 4 and 2 respectively.

The actual suffix array only contains the pointers themselves, with the text for each array slot being reconstructed on demand from the corpus and the slot's pointer. This implementation uses variable-length pointers, removing the space blow-up that some other implementations have for small corpora.

Construction

Construction of the suffix array uses the SA-IS algorithm first defined by Nong, Zhang, and Chan in "Linear Suffix Array Construction by Almost Pure Induced-Sorting" link, using Screwtape's "A walk through the SA-IS Suffix Array Construction Algorithm" link as a practical guide to the implementation.

SA-IS runs in O(n) time. In practice, a reasonably modern laptop can index a hundreds-of-megabytes corpus in tens-of-minutes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.travis.yml		.travis.yml
BUILD.bazel		BUILD.bazel
LICENSE		LICENSE
README.md		README.md
WORKSPACE		WORKSPACE
buckets.go		buckets.go
debug.go		debug.go
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum
lcparray.go		lcparray.go
lcparray_test.go		lcparray_test.go
options.go		options.go
sais.go		sais.go
sais_test.go		sais_test.go
search.go		search.go
search_test.go		search_test.go
shared_test.go		shared_test.go
suffixarray.go		suffixarray.go
text.go		text.go
typemap.go		typemap.go
util.go		util.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-suffixarray

Overview

Construction

About

Releases 1

Packages

Languages

License

team-spectre/go-suffixarray

Folders and files

Latest commit

History

Repository files navigation

go-suffixarray

Overview

Construction

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages