

A Suffix Tree is a compressed tree containing all the suffixes of the given (usually long) text string T of length n characters (n can be on order of hundred thousands of characters).
The positions of each suffix in the text string T are recorded as integer indices at the leaves of the Suffix Tree whereas the path labels (concatenation of edge labels starting from the root) of the leaves describe the suffixes.
Suffix Tree provides a particularly fast implementation for many important (long) string operations.
This data structure is very related to the Suffix Array data structure. Both data structures are usually studied together.
Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
If you are an NUS student and a repeat visitor, please login.
后缀 i(或第 i 个后缀)是一个(通常很长的)文本字符串 T 的一个'特殊情况'的子字符串,它从字符串的第 i 个字符开始,一直到它的最后一个字符。
例如,如果 T = "STEVEN$",那么 T 的后缀 0 是 "STEVEN$"(0-based indexing),后缀 2 是 "EVEN$",后缀 4 是 "EN$",等等。
Pro-tip 1: Since you are not logged-in, you may be a first time visitor (or not an NUS student) who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown]/[PageUp] to go to the next/previous slide, respectively, (and if the drop-down box is highlighted, you can also use [→ or ↓/← or ↑] to do the same),and [Esc] to toggle between this e-Lecture mode and exploration mode.
字符串T的后缀树的可视化基本上是一个根树,其中从根到每个叶子的路径标签(边缘标签的连接)描述了T的一个后缀。每个叶子顶点都是一个后缀,叶子顶点内部写着的整数值(我们通过终止符号$确保这个属性)是后缀编号。
一个内部顶点将分支到多个子顶点,因此从根到叶子通过这个内部顶点有多个后缀。内部顶点的路径标签是那些后缀中的公共前缀。
Pro-tip 2: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2021). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.
The Suffix Tree above is built from string T = "GATAGACA$" that have these 9 suffixes:
| i | Suffix |
|---|---|
| 0 | GATAGACA$ |
| 1 | ATAGACA$ |
| 2 | TAGACA$ |
| 3 | AGACA$ |
| 4 | GACA$ |
| 5 | ACA$ |
| 6 | CA$ |
| 7 | A$ |
| 8 | $ |
Now verify that the path labels of suffix 7/6/2 are "A$"/"CA$"/"TAGACA$", respectively (there are 6 other suffixes). The internal vertices with path label "A"/"GA" branch out to 4 suffixes {7, 5, 3, 1}/2 suffixes {4, 0}, respectively. Root vertex branches out to all 9 suffixes.
Pro-tip 3: Other than using the typical media UI at the bottom of the page, you can also control the animation playback using keyboard shortcuts (in Exploration Mode): Spacebar to play/pause/replay the animation, ←/→ to step the animation backwards/forwards, respectively, and -/+ to decrease/increase the animation speed, respectively.
In order to ensure that every suffix of the input string T ends in a leaf vertex, we enforce that string T ends with a special terminating symbol '$' that is not used in the original string T and has ASCII value lower than the lowest allowable character in T (which is character 'A' in this visualization). This way, edge label '$' always appears at the leftmost edge of the root vertex of this Suffix Tree visualization.
For the Suffix Tree example above (for T = "GATAGACA$"), if we do not have terminating symbol '$', notice that suffix 7 "A" (without the '$') does NOT end in a leaf vertex and can complicate some operations later.
As we have ensured that all suffixes end at a leaf vertex, there are at most n leaves/suffixes in a Suffix Tree. All internal vertices (including the root vertex if it is an internal vertex) are always branching thus there can be at most n-1 such vertices, as shown with one of the extreme test case on the right.
The maximum number of vertices in a Suffix Tree is thus = n (leaves) + (n-1) internal vertices = 2n-1 ∈ O(n) vertices. As Suffix Tree is a tree, the maximum number of edges in a Suffix Tree is also (2n-1)-1 ∈ O(n) edges.
当字符串T中的所有字符都是不同的(例如,T = "ABCDE$"),我们可以得到以下非常短的后缀树,其中恰好有n+1个顶点(+1是由于根顶点)。
All available operations on the Suffix Tree in this visualization are listed below:
- Build Suffix Tree (instant/details omitted) — instantly build the Suffix Tree from string T.
- Search — Find the vertex in Suffix Tree of a (usually longer) string T that has path label containing the (usually (much) shorter) pattern/search string P.
- Longest Repeated Substring (LRS) — Find the deepest (the one that has the longest path label) internal vertex (as that vertex shares common prefix between two (or more) suffixes of T).
- Longest Common Substring (LCS) — Find the deepest internal vertex that contains suffixes from two different original strings.
There are a few other possible operations of Suffix Tree that are not included in this visualization.
In this visualization, we only show the fully constructed Suffix Tree without describing the details of the O(n) Suffix Tree construction algorithm — it is a bit too complicated. Interested readers can explore this instead.
We limit the input to only accept up to 25 (cannot be too long due to the available drawing space — but in the real application of Suffix Tree, n can be in order of hundred thousand to million characters) ASCII (or even Unicode) characters. If you do not write a terminating symbol '$' at the back of your input string, we will automatically do so. If you place character '$' in the middle of the input string, it will be ignored. And if you enter an empty input string, we will resort to the default "GATAGACA$".
For convenience, we provide a few classic test case input strings usually found in Suffix Tree/Array lectures, but to showcase the strength of this visualization tool, you are encouraged to enter any up-to-25-characters string of your choice (ending with character '$'). You can use Chinese characters, e.g., "四是四十是十十四不是四十四十不是十四$".
假设已经构建了一个(通常较长的)字符串T(长度为n)的后缀树,我们想要找到模式/搜索字符串P(长度为m)的所有出现位置。
为了做到这一点,我们在T的后缀树中寻找顶点x,该顶点的路径标签(从根到x的边标签的连接)的前缀是P。一旦我们找到这个顶点x,在x为根的子树中的所有叶子都是出现的位置。
时间复杂度:O(m+k),其中k是出现的总次数。
例如,在上面的T = "GATAGACA$"的后缀树中,尝试以下情况:
- P与顶点x的路径标签完全匹配:
,出现次数 = {7, 5, 3, 1} 或 ,出现次数 = {4, 0} - P与顶点x的路径标签部分匹配:
,出现次数 = {2} 或 ,出现次数 = {0} - P在T中未找到:
,出现次数 = {NIL}
假设已经构建了一个(通常较长的)字符串T(长度为n)的后缀树,我们可以通过简单地找到T的后缀树中最深的(路径标签最长的)内部顶点来找到T中的最长重复子字符串(LRS)。
这是因为T的后缀树的每个内部顶点至少分支到两个(或更多)后缀,即,路径标签(这些后缀的公共前缀)是重复的。
最深的(路径标签最长的)内部顶点就是所需的答案,可以通过简单的树遍历在O(n)中找到。
言归正传,试试 。我们有 LRS = "GA"。
有可能T包含多个 LRS,例如,试试 。
我们有 LRS = "ANA"(实际上重叠)或 "BAN"(无重叠)。
这次,我们需要两个以符号 '$'/'#' 结束的输入字符串 T1 和 T2。然后我们在 O(n) 时间内创建这两个字符串 T1+T2 的广义后缀树,其中 n = n1+n2(两个字符串长度的总和)。我们可以通过简单地找到 T1+T2 的广义后缀树中最深的且有效的内部顶点,来找到这两个字符串 T1 和 T2 的最长公共子串(LCS)。
要成为一个有效的内部顶点并被考虑为 LCS 候选者,一个内部顶点必须代表来自两个字符串的后缀,即,在 T1 和 T2 中都找到的公共子串。
然后,由于 T 的后缀树的内部顶点至少分支到两个(或更多)后缀,即,路径标签(这些后缀的公共前缀)是重复的。如果那个内部顶点也是一个有效的内部顶点,那么它就是一个重复的公共子串。
有效且最深(路径标签最长)的内部顶点就是我们需要的答案,可以通过简单的树遍历在 O(n) 时间内找到。
言归正传,尝试在字符串 T1 = "GATAGACA$" 和 T2 = "CATA#" 的广义后缀树上点击 (注意 UI 将切换到广义后缀树版本)。我们得到的 LCS = "ATA"。
我们可以使用后缀树做一些其他事情,如"找到最长的不重叠重复子字符串","找到≥ 2个字符串的最长公共子字符串"等,但我们将留到以后再讨论。
我们将继续讨论这个特定于字符串的数据结构,转向更通用的后缀数组数据结构。
You have reached the last slide. Return to 'Exploration Mode' to start exploring!
Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.
建立后缀树。
搜索
最长的重复子串。
最长的公共子串。