sequence clustering - Everything2.com

Sequence clustering refers to the grouping of strings of characters based on some criteria, usually similarity in their sequence. Sequence clustering is an important first step in several complex string-related computations, such as the construction of contigs in bioinformatics or the construction of a search table.

The Procedure of Sequence Clustering

Sequence clustering usually follows this step of procedures for a set of sequences.

1. Compare a given sequence to a clustered sequence that has not yet been compared to the given sequence.
2. If the given sequence is similar enough to the "clustered" sequence, add the given sequence to the same "cluster" of sequences.
3. If there are still sequences to compare, then go to the first step.
4. If you've reached this point, start a new "cluster" with this sequence.
5. If there are still sequences waiting to be clustered, choose one of these sequences and return to step 1.
6. If you reach this step, the clustering is finished.

The result should be several clusters of sequences, which are groupings of sequences that are highly similar to one another.

How Does One Determine Similarity?

There are many known algorithms for determining the similarity of two known sequences. Among these are the Needleman Wunsch algorithm, the Smith Waterman algorithm, and proper application of the suffix tree.

In bioinformatics, a field where sequence clustering plays a vital role as a preliminary step in the assembly of contigs (contigs are usually built from clusters), the sequence similarity tool blast is often used in building the clusters. Using blast meshes together some of the steps in the above algorithm, but it provides a quick and highly efficient tool for high-throughput sequence comparison.

Generally, one determines a necessary amount of similarity needed between two sequences by determining a particular value for matches, mismatches, insertion of blank space, and hanging ends, then counts the optimum score between two sequences. If this score meets a particular threshold, then the two sequences are considered to be similar.

A Clustering Example

Let's say that we have these six sequences and we want to cluster them:

Sequence 1: I dragged a comb across m
Sequence 2: I made the bus in s
Sequence 3: dragged a comb across my head
Sequence 4: the bus in seconds flat
Sequence 5: ged a comb
Sequence 6: cross my head

A match is constituted as 90% sequence similarity between two sequences, ignoring hanging ends. We start with sequence 1 which, since there are no clusters, immediately forms its own cluster. We then compare sequence 2 to the already clustered sequence 1:

               Sequence 2: I made the bus in s
                           |                 
Sequence 1 (in cluster 1): I dragged a comb across m

Clearly, sequence 2 and sequence 1 are different enough to not belong in the same cluster. So, sequence 2 becomes the start of a second cluster. Now we start with sequence 3:

               Sequence 3:   dragged a comb across my head
                             |||||||||||||||||||||||                 
Sequence 1 (in cluster 1): I dragged a comb across m

This meets our criteria of similarity, so we add sequence 3 to cluster 1. Our clusters now look like this:

Cluster 1 consists of:
Sequence 1: I dragged a comb across m
Sequence 3: dragged a comb across my head

Cluster 2 consists of:
Sequence 2: I made the bus in s

Sequence 4 is now compared to sequence 1:

Sequence 4:      the bus in seconds flat
                   |              |   
Sequence 1: I dragged a comb across m

Clearly not a match. On to the second sequence:

Sequence 4:        the bus in seconds flat
                   ||||||||||||
Sequence 2: I made the bus in s

A match! So we add sequence 4 to cluster 2 since sequence 2 is a member of cluster 2. Our clusters now look like this:

Cluster 1 consists of:
Sequence 1: I dragged a comb across m
Sequence 3: dragged a comb across my head

Cluster 2 consists of:
Sequence 2: I made the bus in s
Sequence 4: the bus in seconds flat

Sequences 5 and 6 both compare strongly to sequence 1, so both are added to the first cluster, giving a final clustering of:

Cluster 1 consists of:
Sequence 1: I dragged a comb across m
Sequence 3: dragged a comb across my head
Sequence 5: ged a comb
Sequence 6: cross my head

Cluster 2 consists of:
Sequence 2: I made the bus in s
Sequence 4: the bus in seconds flat

From these two clusters, we can now build contigs, which would express the overall sequence of all of the sequences in each cluster, or we could construct other data sets.

contig	Needleman Wunsch algorithm	(Don't Fear) The Reaper	blast
String	suffix tree	Smith Waterman algorithm	graph clustering