Suffix Array

1. Introduction

Suffix Array is a sorted array of all suffixes of a string T with usually long length n. It is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics. This data structure is very related to Suffix Tree data structure.

2. Visualisation

后缀数组的可视化仅仅是一个表,其中每行代表一个后缀,每一列代表后缀的属性。

3. Available operations

All available operations on the Sufix Array are listed below.

  1. Construct Suffix Array (SA) is the O(n log n) Suffix Array construction algorithm based on the idea by Karp, Miller, & Rosenberg (1972) that sort prefixes of the suffix in increasing length (1, 2, 4, 8, ...).
  2. Search utilizes the fact that the suffixes in Suffix Array are sorted and call two binary searches in O(m log n) to find the first and the last occurrence(s) of pattern string P of length m.
  3. Longest Common Prefix (LCP) between two adjacent suffixes (excluding the first suffix) can be computed in O(n) using the Permuted LCP (PLCP) theorem.
  4. Longest Repeated Substring (LRS) is a simple O(n) algorithm that finds the suffix with the highest LCP value.
  5. Longest Common Substring (LCS) is a simple O(n) algorithm that finds the suffix with the highest LCP value that comes from two different strings.

4. Implementation

You are allowed to use/modify our implementation code for fast Suffix Array+LCP:
sa_lcp.cpp
sa_lcp.java
sa_lcp.py
sa_lcp.ml