Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).
Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.
This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.
The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.
The four (basic) attributes of each row i are:
Some operations may add more attributes to each row and are explained when that operations are discussed.
后缀数组的所有可用操作如下所列。
在这个可视化中,我们展示了基于Karp,Miller和Rosenberg(1972)的想法,通过按递增长度(1,2,4,8,...)排序后缀的前缀,即所谓的前缀倍增算法,来正确地构建后缀数组的O(n log n)方法。
我们限制输入只接受12个(由于可用的绘图空间,不能太长 - 但在后缀树的实际应用中,n可以是十万到百万个字符的顺序)大写(我们会删除您的小写输入)字母和特殊终止符'$'字符(即,[A-Z$])。如果您没有在输入字符串的末尾写一个终止符'$',我们将自动这样做。如果您在输入字符串的中间放一个'$',它们将被忽略。如果您输入一个空的输入字符串,我们将默认为"GATAGACA$"。
为了方便,我们提供了一些通常在后缀树/数组讲座中找到的经典测试用例输入字符串,但为了展示这个可视化工具的强大,我们鼓励您输入任何您选择的12个字符的字符串(以字符'$'结束)。
请注意,LCP数组列在此操作中保持为空。它们将通过最长公共前缀操作单独计算。
This Prefix Doubling Algorithm runs in O(log n) iterations, where for each iteration, it compares substring T[SA[i]:SA[i+k]] with T[SA[i+k]:SA[i+2*k]], i.e., in layman's terms: first compare two pairs of characters, then compare first two characters with the next two, then compare the first four characters with the next four, and so on.
This algorithm is best explored via visualization, see
in action (it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing especially during the first sorting iteration).Time complexity: There are O(log n) prefix doubling iterations, and each iteration we call O(n) Radix Sort, thus it runs in O(n log n) — good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.
After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).
Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.
For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:
PS: There is a slightly faster O(m+log n) variant that has not been visualized yet.
We can compute the Longest Common Prefix (LCP) of two adjacent suffixes (in Suffix Array order) in O(n) time using three phases of Kasai's algorithm. This algorithm takes advantage that if we have a long LCP between two adjacent suffixes (in Suffix Array order), that long LCP has lots of overlap with another suffix in positional order when its first character is removed.
The first phase: Compute the value of Phi[], where Phi[SA[i]] = SA[i-1] in O(n). This is to help the algorithm knows in O(1) time of which Suffix is behind Suffix-SA[i] in Suffix Array order. Try
and focus on the first part on filling column Phi[] (it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).The second phase: Compute the PLCP[] values between a Suffix-i in positional order with Suffix-Phi[i] (the one behind Suffix-i in Suffix Array order). When we advance to the next index i+1 in positional order, we will remove the front most character of the suffix, but possibly retain lots of LCP value between Suffix-(i+1) and Suffix-Phi[(i+1)].
PLCP Theorem (not proven) shows that the LCP values can only be incremented up to n times, and thus can only be decremented at most n times too, making the overall complexity of the second phase to be also O(n).
Now, retry
again and focus on the middle part on filling column PLCP[] (again, it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).The third phase: We compute the value of LCP[], where LCP[i] = PLCP[SA[i]] in O(n). This LCP values are the one that we use for other Suffix Array applications later.
Finally, retry
again and focus on the last part on filling column LCP[] (as usual, exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).Time complexity: Kasai's algorithm utilizes the PLCP theorem where the total number of increase (and decrease) operations of the value of the LCP is at most O(n). Thus Kasai's algorithm runs in O(n) overall. Thus, the combination of O(n log n) Suffix Array construction (via the Prefix Doubling algorithm) and the O(n) computation of LCP Array using this Kasai's algorithm is good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.
After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.
This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.
The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.
Without further ado, try
. We have LRS = "GA".It is possible that T contains more than one LRS, e.g., try
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).
After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.
Without further ado, try
on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.