display | more...

The P-against-all problem is a large-scale string alignment problem that asks for a great amount of related alignment information. It is particularly interesting because there are so many ways to improve the speed of the solution by coming up with inventive methods for reducing the large amount of possibly redundant computation of the related alignment information.

The problem, simply stated, is as follows.

Assume that string P has length n and that string T has length m > n. Find the substring of T that has the shortest edit distance between itself and P. The edit distance between two strings is the minimum number of edit operations - insertions, deletions, and substitutions - needed to transform the first string into the second.

Note: The exact method of calculating the edit distance between two given substrings varies wildly between specific implementation and need. Values for the cost of insertion, deletion, and substitution are usually given by the description of a specific problem.

The most naive solution is to compute the edit distance between P and every single substring of T, of which there is a huge number. This takes on the order of n*m^3 time to calculate, which is ridiculously slow for a string T of any reasonable size.

The first speedup for solving the problem comes with the realization that we only need to calculate the edit distances between P and each suffix S of T. In the process of calculating edit distance, we already calculate the value of each prefix of each element in S, so simply by retaining this information, we save ourselves time on the order of m, reducing the overall time to n*m^2.

We can reduce this even further if we assume that T is very large; this can be done by generating a suffix tree of T. Then we traverse the tree in a depth-first fashion, calculating every substring along a given path. Once you traverse back up to a node, simply store the value of P's edit distance with the particular substring between that node and the root.

By calculating the edit distances in a non-linear fashion, we take advantage of any and all redundancy in T. We then reduce the time needed to be on the order of n*(length of tree) + m^2. For a large T, this is another significant increase in speed.

This problem constantly pops up in inexact string matching, where two long strings need to be compared not for an exact match, but for the best of all the inexact matches. An example of this problem is when you compare a genetic sequence against GenBank, when you assemble contigs, or when trying to find passages in long texts, such as bible verse searches.

Log in or register to write something here or to contact authors.