Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).
Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.
This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.
Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
If you are an NUS student and a repeat visitor, please login.
The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.
The four (basic) attributes of each row i are:
- index i, ranging from 0 to n-1,
- SA[i]: the i-th lexicographically smallest suffix of T is the SA[i]-th suffix,
- LCP[i]: the Longest Common Prefix between the i-th and the (i-1)-th lexicographically smallest suffixes of T is LCP[i] (we will see the application of this attribute later), and
- Suffix T[SA[i]:] - the i-th lexicographically smallest suffix of T is from index SA[i] to the end (index n-1).
Some operations may add more attributes to each row and are explained when that operations are discussed.
Pro-tip 1: Since you are not logged-in, you may be a first time visitor (or not an NUS student) who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown]/[PageUp] to go to the next/previous slide, respectively, (and if the drop-down box is highlighted, you can also use [→ or ↓/← or ↑] to do the same),and [Esc] to toggle between this e-Lecture mode and exploration mode.
All available operations on the Suffix Array are listed below.
- Construct Suffix Array (SA) is the O(n log n) Suffix Array construction algorithm based on the idea by Karp, Miller, & Rosenberg (1972) that sort prefixes of the suffix in increasing length (1, 2, 4, 8, ...).
- Search utilizes the fact that the suffixes in Suffix Array are sorted and call two binary searches in O(m log n) to find the first and the last occurrence(s) of pattern string P of length m.
- Longest Common Prefix (LCP) between two adjacent suffixes (excluding the first suffix) can be computed in O(n) using the Permuted LCP (PLCP) theorem. The name of this algorithm is Kasai's algorithm.
- Longest Repeated Substring (LRS) is a simple O(n) algorithm that finds the suffix with the highest LCP value.
- Longest Common Substring (LCS) is a simple O(n) algorithm that finds the suffix with the highest LCP value that comes from two different strings.
Pro-tip 2: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2021). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.
In this visualization, we show the proper O(n log n) construction of Suffix Array based on the idea of Karp, Miller, & Rosenberg (1972) that sort prefixes of the suffix in increasing length (1, 2, 4, 8, ...), a.k.a. the prefix doubling algorithm.
We limit the input to only accept 12 (cannot be too long due to the available drawing space — but in the real application of Suffix Tree, n can be in order of hundred thousand to million characters) UPPERCASE (we delete your lowercase input) alphabet and the special terminating symbol '$' characters (i.e., [A-Z$]). If you do not write a terminating symbol '$' at the back of your input string, we will automatically do so. If you place a '$' in the middle of the input string, they will be ignored. And if you enter an empty input string, we will resort to the default "GATAGACA$".
For convenience, we provide a few classic test case input strings usually found in Suffix Tree/Array lectures, but to showcase the strength of this visualization tool, you are encouraged to enter any 12-characters string of your choice (ending with character '$').
Note that the LCP Array column remains empty in this operation. They are to be computed separately via the Longest Common Prefix operation.
Pro-tip 3: Other than using the typical media UI at the bottom of the page, you can also control the animation playback using keyboard shortcuts (in Exploration Mode): Spacebar to play/pause/replay the animation, ←/→ to step the animation backwards/forwards, respectively, and -/+ to decrease/increase the animation speed, respectively.
This Prefix Doubling Algorithm runs in O(log n) iterations, where for each iteration, it compares substring T[SA[i]:SA[i+k]] with T[SA[i+k]:SA[i+2*k]], i.e., first compare two pairs of characters, then compare first two characters with the next two, then compare the first four characters with the next four, and so on.
This algorithm is best explored via visualization, see
in action.Time complexity: There are O(log n) prefix doubling iterations, and each iteration we call O(n) Radix Sort, thus it runs in O(n log n) — good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.
After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).
Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.
For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:
- P returns a range of rows: , occurrences = {4, 0}
- P returns one row only: , occurrences = {2}
- P is not found in T: , occurrences = {NIL}
We can compute the Longest Common Prefix (LCP) of two adjacent suffixes (in Suffix Array order) in O(n) time using three phases of Kasai's algorithm. This algorithm takes advantage that if we have a long LCP between two adjacent suffixes (in Suffix Array order), that long LCP has lots of overlap with another suffix in positional order when its first character is removed.
The first phase: Compute the value of Phi[], where Phi[SA[i]] = SA[i-1] in O(n). This is to help the algorithm knows in $O(1) time of which Suffix is behind Suffix-SA[i] in Suffix Array order.
The second phase: Compute the PLCP[] values between a Suffix-i in positional order with Suffix-Phi[i] (the one behind Suffix-i in Suffix Array order). When we advance to the next index i+1 in positional order, we will remove the front most character of the suffix, but possibly retain lots of LCP value between Suffix-(i+1) and Suffix-Phi[(i+1)]. PLCP Theorem (not proven) shows that the LCP values can only be incremented up to n times, and thus can only be decremented at most n times too, making the overall complexity of the second phase to be also O(n).
The third phase: We compute the value of LCP[], where LCP[i] = PLCP[SA[i]] in O(n). This LCP values are the one that we use for other Suffix Array applications later.
Time complexity: Kasai's algorithm utilizes the PLCP theorem where the total number of increase (and decrease) operations of the value of the LCP is at most O(n). Thus Kasai's algorithm runs in O(n) overall. Thus, the combination of O(n log n) Suffix Array construction (via the Prefix Doubling algorithm) and the O(n) computation of LCP Array using this Kasai's algorithm is good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.
After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.
This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.
The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.
Without further ado, try
. We have LRS = "GA".It is possible that T contains more than one LRS, e.g., try
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).
After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.
Without further ado, try
on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.
You have reached the last slide. Return to 'Exploration Mode' to start exploring!
Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.
Construct Suffix Array
Search
Longest Common Prefix
Longest Repeated Substring
Longest Common Substring