Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).
Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.
This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.
The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.
The four (basic) attributes of each row i are:
Some operations may add more attributes to each row and are explained when that operations are discussed.
All available operations on the Suffix Array are listed below.
In this visualization, we show the proper O(n log n) construction of Suffix Array based on the idea of Karp, Miller, & Rosenberg (1972) that sort prefixes of the suffix in increasing length (1, 2, 4, 8, ...), a.k.a. the prefix doubling algorithm.
We limit the input to only accept 12 (cannot be too long due to the available drawing space — but in the real application of Suffix Tree, n can be in order of hundred thousand to million characters) UPPERCASE (we delete your lowercase input) alphabet and the special terminating symbol '$' characters (i.e., [A-Z$]). If you do not write a terminating symbol '$' at the back of your input string, we will automatically do so. If you place a '$' in the middle of the input string, they will be ignored. And if you enter an empty input string, we will resort to the default "GATAGACA$".
For convenience, we provide a few classic test case input strings usually found in Suffix Tree/Array lectures, but to showcase the strength of this visualization tool, you are encouraged to enter any 12-characters string of your choice (ending with character '$').
Note that the LCP Array column remains empty in this operation. They are to be computed separately via the Longest Common Prefix operation.
This Prefix Doubling Algorithm runs in O(log n) iterations, where for each iteration, it compares substring T[SA[i]:SA[i+k]] with T[SA[i+k]:SA[i+2*k]], i.e., first compare two pairs of characters, then compare first two characters with the next two, then compare the first four characters with the next four, and so on.
This algorithm is best explored via visualization, see
in action.Time complexity: There are O(log n) prefix doubling iterations, and each iteration we call O(n) Radix Sort, thus it runs in O(n log n) — good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.
After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).
Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.
For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios: