后缀树

1. Introduction

A Suffix Tree is a compressed tree containing all the suffixes of the given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).


The positions of each suffix in the text string T are recorded as integer indices at the leaves of the Suffix Tree whereas the path labels (concatenation of edge labels starting from the root) of the leaves describe the suffixes.


Suffix Tree provides a particularly fast implementation for many important (long) string operations.


This data structure is very related to the Suffix Array data structure. Both data structures are usually studied together.

1-1. Suffix of a String T

The suffix i (or the i-th suffix) of a (usually long) text string T is a 'special case' of substring that goes from the i-th character of the string up to its last character.


For example, if T = "STEVEN$", then suffix 0 of T is "STEVEN$" (0-based indexing), suffix 2 of T is "EVEN$", suffix 4 of T is "EN$", etc.

2. Suffix Tree Visualization

The visualization of Suffix Tree of a string T is basically a rooted tree where path label (concatenation of edge label(s)) from root to each leaf describes a suffix of T. Each leaf vertex is a suffix and the integer value written inside the leaf vertex (we ensure this property with terminating symbol $) is the suffix number.


An internal vertex will branch to more than one child vertex, therefore there are more than one suffix from the root to the leaves via this internal vertex. The path label of an internal vertex is a common prefix among those suffix(es).

2-1. Example with T = "GATAGACA$"

The Suffix Tree above is built from string T = "GATAGACA$" that have these 9 suffixes:

iSuffix
0GATAGACA$
1ATAGACA$
2TAGACA$
3AGACA$
4GACA$
5ACA$
6CA$
7A$
8$

Now verify that the path labels of suffix 7/6/2 are "A$"/"CA$"/"TAGACA$", respectively (there are 6 other suffixes). The internal vertices with path label "A"/"GA" branch out to 4 suffixes {7, 5, 3, 1}/2 suffixes {4, 0}, respectively (we ignore the trivial internal vertex = root vertex that branches out to all 9 suffixes).

2-2. Terminating Symbol $

In order to ensure that every suffix of the input string T ends in a leaf vertex, we enforce that string T ends with a special terminating symbol '$' that is not used in the original string T and has ASCII value lower than the lowest allowable character in T (which is character 'A' in this visualization). This way, edge label '$' always appear at the leftmost edge of the root vertex of this Suffix Tree visualization.


For the Suffix Tree example above (for T = "GATAGACA$"), if we do not have terminating symbol '$', notice that suffix 7 "A" (without the '$') does NOT end in a leaf vertex and can complicate some operations later.

2-3. Suffix Tree has O(n) Vertices

As we have ensured that all suffixes end at a leaf vertex, there are at most n leaves/suffixes in a Suffix Tree. All internal vertices (including the root vertex if it is an internal vertex) are always branching thus there can be at most n-1 such vertices, as shown with one of the extreme test case on the right.


The maximum number of vertices in a Suffix Tree is thus = n (leaves) + (n-1) internal vertices = 2n-1 = O(n) vertices. As Suffix Tree is a tree, the maximum number of edges in a Suffix Tree is also (2n-1)-1 = O(n) edges.

2-4. Much Shorter Suffix Tree

When all the characters in string T is all distinct (e.g., T = "ABCDE$"), we can have the following very short Suffix Tree with exactly n+1 vertices (+1 due to root vertex).

3. 可执行的操作

All available operations on the Suffix Tree in this visualization are listed below:

  1. Build Suffix Tree (instant/details omitted) — instant-build the Suffix Tree from string T.
  2. Search — Find the vertex in Suffix Tree of a (usually longer) string T that has path label containing the (usually shorter) pattern/search string P.
  3. Longest Repeated Substring (LRS) — Find the deepest (the one that has the longest path label) internal vertex (as that vertex shares common prefix between two (or more) suffixes of T).
  4. Longest Common Substring (LCS) — Find the deepest internal vertex that contains suffixes from two different original strings.

There are a few other possible operations of Suffix Tree that are not included in this visualization.

3-1. Build Suffix Tree (instant)

In this visualization, we only show the fully constructed Suffix Tree without describing the details of the O(n) Suffix Tree construction algorithm — it is a bit too complicated. Interested readers can explore this instead.


We limit the input to only accept 12 (cannot be too long due to the available drawing space — but in the real application of Suffix Tree, n can be in order of hundred thousand to million characters) UPPERCASE (we delete your lowercase input) alphabet and the special terminating symbol '$' characters (i.e., [A-Z$]). If you do not write a terminating symbol '$' at the back of your input string, we will automatically do so. If you place a '$' in the middle of the input string, they will be ignored. And if you enter an empty input string, we will resort to the default "GATAGACA$".


For convenience, we provide a few classic test case input strings usually found in Suffix Tree/Array lectures, but to showcase the strength of this visualization tool, you are encouraged to enter any 12-characters string of your choice (ending with character '$').