A **Suffix Tree** is a compressed tree containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix Tree provides a particularly fast implementation for many important string operations. This data structure is very related to __Suffix Array__ data structure.

The suffix **i** (or the **i**-th suffix) of a (usually long) text string **T** is a 'special case' of substring that goes from the **i**-th character of the string up to the *last* character of the string.

For example, if **T** = "STEVEN$", then suffix 0 of **T** is "STEVEN$" (0-based indexing), suffix 2 of **T** is "EVEN$", suffix 4 of **T** is "EN$", etc.

The visualization of Suffix Tree of a string **T** is basically a rooted tree where path label (concatenation of edge label(s)) from root to each leaf describes a suffix of **T**. Each leaf vertex is a suffix and the integer value written inside the leaf vertex is the suffix number.

An internal vertex will branch to more than one child vertex, therefore there are more than one suffix from the root to the leaves via this internal vertex. The path label of an internal vertex is a common prefix among those suffix(es).

The Suffix Tree above is built from string **T** = "GATAGACA$" that have these 9 suffixes:

i | Suffix |
---|---|

0 | GATAGACA$ |

1 | ATAGACA$ |

2 | TAGACA$ |

3 | AGACA$ |

4 | GACA$ |

5 | ACA$ |

6 | CA$ |

7 | A$ |

8 | $ |

Now verify that the path labels of suffix 7/6/2 are "A$"/"CA$"/"TAGACA$", respectively (there are 6 other suffixes). The internal vertices with path label "A"/"GA" branch out to 4 suffixes {7, 5, 3, 1}/2 suffixes {4, 0}, respectively (we ignore the trivial internal vertex = root vertex that branches out to all 9 suffixes).

In order to ensure that every suffix of the input string **T** ends in a leaf vertex, we enforce that string **T** ends with a special terminating symbol '$' that is not used in the original string **T** and has ASCII value lower than the lowest allowable character in **T** (which is character 'A'). This way, edge label '$' always appear at the leftmost edge of an internal vertex of this Suffix Tree visualization.

For the Suffix Tree example above (for **T** = "GATAGACA$"), if we do not have terminating symbol '$', notice that suffix 7 "A" does NOT end in a leaf vertex and can complicate some operations later.

As we have ensured that all suffixes end at a leaf vertex, there are *at most* **n** leaves/suffixes in a Suffix Tree. All internal vertices (including the root vertex if it is an internal vertex) are always branching thus there can be at most **n**-1 such vertices, as shown with one of the extreme test case on the right.

The maximum number of vertices in a Suffix Tree is thus = **n** (leaves) + (**n**-1) internal vertices = **2n**-1 = O(**n**) vertices. As Suffix Tree is a tree, the maximum number of edges in a Suffix Tree is also (**2n**-1)-1 = O(**n**) edges.

When all the characters in string **T** is all distinct (e.g. **T** = "ABCDE$"), we can have the following very short Suffix Tree with exactly **n**+1 vertices (+1 due to root vertex).

All available operations on the Suffix Tree in this visualization are listed below:

**Build Suffix Tree (instant)**— instant-build the Suffix Tree from string**T**.**Search**— Find the vertex in Suffix Tree of a (usually longer) string**T**that has path label containing the (usually shorter) pattern/search string**P**.**Longest Repeated Substring (LRS)**— Find the deepest internal vertex (as that vertex shares common prefix between two (or more) suffixes of**T**).**Longest Common Substring (LCS)**— Find the deepest internal vertex that contains suffixes from two different original strings.

In this visualization, we only show the fully constructed Suffix Tree *without describing the details of the O( n) Suffix Tree construction algorithm* — it is a bit too complicated.

We limit the input to only accept 12 UPPERCASE alphabet and the special terminating symbol '$' characters (ie.g [A-Z$]). If you do not write a terminating symbol '$' at the back of your input string, we will automatically do so. If you place a '$' in the middle of the input string, they will be ignored. And if you enter an empty input string, we will resort to the default "GATAGACA$".

For convenience, we provide a few classic test case input strings usually found in Suffix Tree/Array lectures.

Assuming that the Suffix Tree of a (usually longer) string **T** (of length **n**) has been built, we want to find all occurrences of pattern/search string **P** (of length **m**).

To do this, we search for the vertex **x** in the suffix Tree of **T** which has path label that represents **P**. Once we find this vertex **x**, all the leaves in the subtree rooted at **x** are the occurrences.

Time complexity: O(**m+occ**) where **occ** is the total number of occurrences.

For example, on the Suffix Tree of **T** = "GATAGACA$" above, let's try finding:

- , occurrences = {7, 5, 3, 1}
- , occurrences = {4, 0}
**P**= "T", should return occurrences = {2}, but there is a silly bug that we have not killed yet**P**= "Z", should return occurrences = {NIL}, but there is a silly bug that we have not killed yet

Assuming that the Suffix Tree of a (usually longer) string **T** (of length **n**) has been built, we can find the Longest Repeated Substring (LRS) in **T** by simply finding the deepest internal vertex of the Suffix Tree of **T**.

This is because each internal vertex of the Suffix Tree of **T** branches out to at least two (or more) suffixes, i.e. the path label (common prefix of these suffixes) are **repeated**.

The internal vertex with the deepest/longest path label is the required answer, which can be found in O(**n**) with a simple tree traversal.

Without further ado, try **T** = "GATAGACA$" above.

This time, we need two input strings **T1** and **T2** that terminate with symbol '$'/'#', respectively. We then create the **generalized** Suffix Tree of these two strings **T1+T2**. Then, we can find the Longest Common Substring (LCS) of those two strings **T1** and **T2** by simply finding the deepest **and valid** internal vertex of the generalized Suffix Tree of **T1+T2**.

This is because each internal vertex of the Suffix Tree of **T** branches out to at least two (or more) suffixes, i.e. the path label (common prefix of these suffixes) are **repeated**. Then, we add an additional constraint where an internal vertex is considered valid (to be considered as LCS candidate) only if it represents suffixes from **both strings**, i.e. not just **repeated**, but a **common** substring found in both **T1** and **T2**.

The valid internal vertex with the deepest/longest path label is the required answer, which can be found in O(**n**) with a simple tree traversal.

Without further ado, try **T1** = "GATAGACA$" and **T2** = "CATABB#" (notice that the UI will change to generalized Suffix Tree version).

There are a few other things that we can do with Suffix Tree like "Finding Longest Repeated Substring without overlap", "Finding Longest Common Substring of ≥ 2 strings", etc, but we will keep that for later.

We will continue the discussion of this String-specific data structure with the more versatile to __Suffix Array__ data structure.