search for similar sub-strings (idea) by parmentier

When one searches for string similar to another inside a bigger string, one can not decide surely that the searched string is at any location of the analyzed string before to traverse it entirely. Once the traversal is done, one has yet to decide whether it is in, and if it is, at which location. In a strict search, on can go through the string to be searched from left to right and stop as soon as the string is found. In an approximate search, it is not possible,except when one actually finds strictly the same string. Let us interest to the other cases.

For example, one can look for belle in corbeilles. Taking at every time a part of corbeilles having the same size than belle, one obtains edition distances between 3 and 1. Table 1 shows that similarity increases until the most alike string, and then decreases.

Table 1: Evolving of the the strings similarity during scanning.

                         +-------+-------+-------+-------+-------+-------+ 
                         | corbe | orbei | rbeil | beill | eille | illes |
            +------------+-------+-------+-------+-------+-------+-------+
            |  distance  |   3   |   3   |   2   |   1   |   1   |   2   | 
            +------------+-------+-------+-------+-------+-------+-------+
            | similarity |  40%  |  40%  |  60%  |  80%  |  80%  |  60%  |
            +------------+-------+-------+-------+-------+-------+-------+

It is obvious that a string like corbe has almost nothing common with belle. It is thus useful to set a threshold under which it is not worthwhile to stop to say: "the searched string is perhaps here." Here, one can say that under 75% of similarity, the strings are enough dissimilar to be left away. The algorithm could then say that it perhaps found belle in corbeilles, but that is or beill (the first sub-string of the same length, which similarity value exceeds the 75% threshold), or eille (a sub-string of same length, exceeding the threshold and having the maximal similarity value).

Once that a similar string is found, one looks for the recognition peak (the higher similarity score) over the threshold. If the threshold was 60% for Table 1, the peak would have been 80% for the sub-string beill (that is to say the sub-string going from third position to seventh one). One chose to return only the first matching sub-string (in a left-to-right traversal), to allow the continuation of the search in the remaining string (right to the returned sub-string). The string to be found is sometimes present in several samples in the searched string.

It happens that the string to be found is nearer from sub-string with a different length. For example, for belle in corbeilles: the most adapted sub-string is really beille (with the costs defined in the edition distance node, it would be at a 0.5 distance, and would have a similarity rate of 91.66%). That's why, when the similarity score of the recognition peak is not optimal (100%), on runs again the process, but comparing the string to be found to sub-strings with different lengths (difference from -2 to +2).

separator seeker	exact string matching	Smith Waterman algorithm	BAsCET
instance seeker	Building Hierarchical Structures in the Blackboard	similarity	suffix tree
char *	Alignment	approximate	similar
alliteration	Operator overloading	search	strict