Suffix Array

1. Introduction

Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).


Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.


This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.

2. Suffix Array Visualization

The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.


The four (basic) attributes of each row i are:

  1. index i, ranging from 0 to n-1,
  2. SA[i]: the i-th lexicographically smallest suffix of T is the SA[i]-th suffix,
  3. LCP[i]: the Longest Common Prefix between the i-th and the (i-1)-th lexicographically smallest suffixes of T is LCP[i] (we will see the application of this attribute later), and
  4. Suffix T[SA[i]:] - the i-th lexicographically smallest suffix of T is from index SA[i] to the end (index n-1).

Some operations may add more attributes to each row and are explained when that operations are discussed.

3. 可用的操作

后缀数组的所有可用操作如下所列。

  1. 构建后缀数组 (SA) 是基于 Karp,Miller,和 Rosenberg (1972) 的想法的 O(n log n) 后缀数组构建算法,该算法按照增长长度 (1, 2, 4, 8, ...) 对后缀的前缀进行排序。
  2. 搜索 利用后缀数组中的后缀已排序的事实,并在 O(m log n) 中调用两个二进制搜索,以找到模式字符串 P 的长度 m 的第一次和最后一次出现。
  3. 最长公共前缀 (LCP) 可以在 O(n) 中使用排列 LCP (PLCP) 定理计算两个相邻后缀(不包括第一个后缀)之间的最长公共前缀。这个算法的名字叫做 Kasai's 算法。
  4. 最长重复子串 (LRS) 是一个简单的 O(n) 算法,它找到具有最高 LCP 值的后缀。
  5. 最长公共子串 (LCS) 是一个简单的 O(n) 算法,它找到来自两个不同字符串的具有最高 LCP 值的后缀。

3-1. Construct Suffix Array - UI

在这个可视化中,我们展示了基于Karp,Miller和Rosenberg(1972)的想法,通过按递增长度(1,2,4,8,...)排序后缀的前缀,即所谓的前缀倍增算法,来正确地构建后缀数组的O(n log n)方法。


我们限制输入只接受12个(由于可用的绘图空间,不能太长 - 但在后缀树的实际应用中,n可以是十万到百万个字符的顺序)大写(我们会删除您的小写输入)字母和特殊终止符'$'字符(即,[A-Z$])。如果您没有在输入字符串的末尾写一个终止符'$',我们将自动这样做。如果您在输入字符串的中间放一个'$',它们将被忽略。如果您输入一个空的输入字符串,我们将默认为"GATAGACA$"。


为了方便,我们提供了一些通常在后缀树/数组讲座中找到的经典测试用例输入字符串,但为了展示这个可视化工具的强大,我们鼓励您输入任何您选择的12个字符的字符串(以字符'$'结束)。


请注意,LCP数组列在此操作中保持为空。它们将通过最长公共前缀操作单独计算。

3-2. The Prefix Doubling Algorithm

这个前缀倍增算法在 O(log n) 迭代中运行,其中每次迭代,它比较子字符串 T[SA[i]:SA[i+k]] 与 T[SA[i+k]:SA[i+2*k]],即,首先比较两对字符,然后比较前两个字符与下一个两个字符,然后比较前四个字符与下一个四个字符,依此类推。


通过可视化最好探索此算法,看 ConstructSA("GATAGACA$") 的动作。


时间复杂度:有 O(log n) 前缀倍增迭代,每次迭代我们调用 O(n) 基数排序,因此它在 O(n log n) 中运行 - 足以处理最多 n ≤ 200K 字符的典型编程竞赛问题中涉及的长字符串。

3-3. Search

After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).


Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.


For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:

  1. P returns a range of rows: Search("GA"), occurrences = {4, 0}
  2. P returns one row only: Search("CA"), occurrences = {2}
  3. P is not found in T: Search("WONKA"), occurrences = {NIL}

3-4. Longest Common Prefix (LCP)

我们可以使用Kasai算法的三个阶段在O(n)时间内计算两个相邻后缀(在后缀数组顺序中)的最长公共前缀(LCP)。这个算法利用了一个优点,即如果我们在两个相邻后缀(在后缀数组顺序中)之间有一个长的LCP,那么当其第一个字符被移除时,这个长的LCP与另一个在位置顺序中的后缀有很多重叠。


第一阶段:计算Phi[]的值,其中Phi[SA[i]] = SA[i-1]在O(n)中。这是为了帮助算法在$O(1)时间内知道哪个后缀在后缀数组顺序中位于后缀-SA[i]之后。


第二阶段:计算位置顺序中的后缀-i与后缀-Phi[i](在后缀数组顺序中位于后缀-i之后的那个)之间的PLCP[]值。当我们前进到位置顺序中的下一个索引i+1时,我们将移除后缀的最前面的字符,但可能保留大量的后缀-(i+1)和后缀-Phi[(i+1)]之间的LCP值。PLCP定理(未证明)显示,LCP值只能增加到n次,因此也只能减少到最多n次,使得第二阶段的总体复杂性也是O(n)。


第三阶段:我们计算LCP[]的值,其中LCP[i] = PLCP[SA[i]]在O(n)中。这些LCP值是我们稍后用于其他后缀数组应用的值。


时间复杂性:Kasai的算法利用了PLCP定理,其中LCP值的增加(和减少)操作的总数最多是O(n)。因此,Kasai的算法总体上运行在O(n)中。因此,O(n log n)后缀数组构造(通过前缀倍增算法)和使用这个Kasai的算法计算LCP数组的O(n)计算对于处理涉及长字符串的典型编程比赛问题(最多n ≤ 200K字符)是足够的。

3-5. Longest Repeated Substring (LRS)

After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.


This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.


The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.


Without further ado, try LRS("GATAGACA$"). We have LRS = "GA".


It is possible that T contains more than one LRS, e.g., try LRS("BANANABAN$").
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).

3-6. Longest Common Substring (LCS)

After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.


Without further ado, try LCS("GATAGACA$", "CATA#") on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".

4. Implementation

You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.