display | more...

A suffix tree is a particular type of data structure that lends itself greatly to the solution of the exact string matching problem in the field of computer science. It was first introduced by P. Weiner at the 14th IEEE Annual Symposium on Switching and Automata Theory in 1973 during his presentation on linear pattern matching, and has been continually refined since then.

The exact string matching problem is stated as follows, in a rather formal fashion:

Given a string P called the pattern and a longer string T called the text, the exact string matching problem is to find all occurrences, if any, of pattern P in text T.

In simpler terms, the goal is to take a document and search through that document for a particular word or phrase of some length shorter than the document, and do this search as fast as possible. Solutions to this problem are abound in computer science; this writeup focuses on the use of the suffix tree to solve the problem.

A basic definition of the suffix tree data structure is as follows. If text=t1t2 ... ti ... tn is a string, then Ti=titi+1 ... tn is the suffix of text that starts at position i. These suffixes are then assembled into a tree. From this tree, simple searches can be made starting at the root and following the branches that match the particular pattern you desire.

A simple example of the suffix tree data structure is as follows. Let's say we have text="fordprefect". The following suffixes are a part of that string:

t1  = fordprefect
t2  = ordprefect
t3  = rdprefect
t4  = dprefect
t5  = prefect
t6  = refect
t7  = efect
t8  = fect
t9  = ect
t10 = ct
t11 = t
t12 = null

In order to make things more clear, let's sort these prefixes into alphabetical order and toss out the null:

t10 = ct
t4  = dprefect
t9  = ect
t7  = efect
t8  = fect
t1  = fordprefect
t2  = ordprefect
t5  = prefect
t3  = rdprefect
t6  = refect
t11 = t

From these fragments, a tree can be assembled. Simply by starting at the root of this tree and following the branches, any particular piece that can be found in the text fordprefect can quickly be found here. Wherever the tree branches, you simply follow the branch that matches the next letter in the pattern. If there are no matches for the next letter in the pattern, there is not a match for it in the text; if you run out of pattern without a mismatch, then there is a match for it in the text.

tree root-->|---ct
            |
            |---dprefect
            |
            |---e-->|---ct
            |       |
            |       |---fect
            |
            |---f-->|---ect
            |       |
            |       |---ordprefect
            |
            |---ordprefect
            |
            |---prefect
            |
            |---r-->|---dprefect
            |       |
            |       |---efect
            |
            |---t

Let's look at how this tree was built in the simplest way. We start out by adding each character in the tree as we find it.

           tree1

tree root-->|---f...

The ... after the f indicates that as each subsequent character from the text is added to the tree, that character is added to this particular branch of the tree. Let's continue.

           tree2

tree root-->|---fo...
            |
            |---o...

Now we add r...

           tree3

tree root-->|---for...
            |
            |---or...
            |
            |---r...

and then d...

           tree4

tree root-->|---d...
            |
            |---ford...
            |
            |---ord...
            |
            |---rd...

and then p. Branching should be pretty clear at this point.

           tree5

tree root-->|---dp...
            |
            |---fordp...
            |
            |---ordp...
            |
            |---p...
            |
            |---rdp...

Now comes the first trick. We're adding another r, so we will have to split up the branch that begins with r. One could also wait until the next step, since r does start off the prefix rdpr, but we will branch on this step because it makes the concept more clear.

           tree6

tree root-->|---dpr...
            |
            |---fordpr...
            |
            |---ordpr...
            |
            |---pr...
            |
            |---r--|-->dpr...
                   |
                   |-->null...

Now we add another new branch for e.

           tree7

tree root-->|---dpre...
            |
            |---e...
            |
            |---fordpre...
            |
            |---ordpre...
            |
            |---pre...
            |
            |---r--|-->dpre...
                   |
                   |-->e...

At this point, with the addition of f, another branching occurs.

           tree8

tree root-->|---dpref...
            |
            |---ef...
            |
            |---f--|-->ordpref...
            |      |
            |      |-->null...
            |
            |---ordpref...
            |
            |---pref...
            |
            |---r--|-->dpref...
                   |
                   |-->ef...

And now we branch again with the addition of the second e.

           tree9

tree root-->|---dprefe...
            |
            |---e--|-->fe...
            |      |
            |      |-->null...
            |
            |---f--|-->ordprefe...
            |      |
            |      |-->e...
            |
            |---ordprefe...
            |
            |---prefe...
            |
            |---r--|-->dprefe...
                   |
                   |-->efe...

Now we add new branches from the root for c and t, and having reached the end of the string, we have our final tree.

           treefinal

tree root-->|---ct
            |
            |---dprefect
            |
            |---e-->|---ct
            |       |
            |       |---fect
            |
            |---f-->|---ect
            |       |
            |       |---ordprefect
            |
            |---ordprefect
            |
            |---prefect
            |
            |---r-->|---dprefect
            |       |
            |       |---efect
            |
            |---t

The trees here were visually assembled using an off-the-cuff program I wrote in C for a programming class long ago. There are many public domain sources for code for this type of tree assembly, however.

Once the tree is built, searching for any substring is very quick and easy; simply traverse the tree. This makes the time limitation of a search rely much more on the preprocessing, which can take on the order of T time, where T is the length of the text. The search, though, only takes on the order of P time, where P is the length of the much shorter pattern.

Applications for the suffix tree include search engines, bioinformatics studies, and much more; suffix trees are often applied any time one uses any sort of search.

Code for suffix trees is available in many places. Try searching through Google for suffix tree code; you will find many examples in many languages.

Log in or register to write something here or to contact authors.