Shortest Common Superstring

What's the problem?

You want to know the value of a string. You had the string, in fact you had several copies, but someone's smashed them all into bits. What you have now is fragments of the original string with overlaps, but no clue to where they overlap, or to the order they were in. What you want to do, of course, is to recreate the original string and find whatever interesting messages are lurking within.

Is this the sort of problem I'm likely to have?

That depends. Are you interested in assembling the human genome? If so, you're in luck. The technique used to read huge amounts of genes (such as the 3,000,000,000 in the human genome) is known as shotgun sequencing. Basically, the DNA is smashed to bits and then read; leaving exactly the sort of problem outlined above.

So how do I do it?

What you're trying to do is find the shortest common superstring of all these substrings. Unfortunately, this problem is NP-Hard, which is basically Computer Science speak for 'bloody time-consuming'. It is difficult to get the correct solution, but sometimes a very good solution is good enough, and there's a quicker way to a near-as-dammit answer.

These things are always easiest to explain with an example, so let's get one of those, shall we?

Example Problem

We have the following fragment of a string floating about and want to recreate the original:

maryha		ryhad		teass
aswhit		hite		tlela
lambit		ssnow		yhadali
little		lelam		tsfle
ecewas		leecew

Now, I'm sure the slightly perceptive of you can probably guess what the original string was, but imagine thousands of strings, each thousands of characters long, containing only the characters a, c, g and t. Not so smug now, are you?

Step One: Build a graph

Now, ASCII art doesn't lend itself well to this sort of thing, so you're going to have to use your imaginations (or a pen and paper). Ready? Good. For each fragment draw an arrow from it to any other fragment it overlaps with marked a numberm, which is the length of the overlap. For example, for the fragment 'maryha' the overlaps look like this^*:

maryha
     aswhit
     ^------ 1

maryha
  ryhad
  ^^^^------ 4

maryha
   yhadali
   ^^^------ 3

So you draw lines from 'maryha' to each of those other fragments, marked with the correct overlap size. Now draw links to the fragments that overlap with the three other fragments and then link other fragments linked to other fragments and so on, until the page is a mess of fragments, lines, arrowheads and numbers.

Step 2: Sort edges by length

Look round the graph and pick out all the edges (the links) and put them in a list sorted by the length (the value).

leecew	-> ecewas  (4) 
maryha	-> ryhad   (4) 
ryhad	-> yhadali (4)
aswhit	-> hite	   (3) 
maryha	-> yhadali (3)
lelam	-> lambit  (3)
hite	-> teass   (2)
teass	-> ssnow   (2)
tlela	-> lambit  (2)
ecewas	-> aswhit  (2)
yhadali	-> little  (2)
tsfle	-> leecew  (2)
little	-> leecew  (2)
little	-> lelam   (2)
maryha	-> aswhit  (1)
aswhit	-> tlela   (1)
lambit  -> teass   (1)
lambit	-> tsfle   (1)
tsfle	-> ecewas  (1)

Step 3: Iteratively choose largest edge

For each iteration, take the largest (highest weighted) edge with the following properties:

It does not start or finish the same as the finish of a previously chosen edge (or it would form a cycle)
It does not start the same as the start of a previously chosen edge (or we'd build a tree)

Step 3.1

We choose leecew -> ecewas, since it's the highest edge.

Step 3.2

We can remove some of the edges from the list according to the rules. The list now looks like this:

leecew	-> ecewas  (4)
maryha	-> ryhad   (4) 
ryhad	-> yhadali (4)
aswhit	-> hite	   (3) 
maryha	-> yhadali (3)
lelam	-> lambit  (3)
hite	-> teass   (2)
teass	-> ssnow   (2)
tlela	-> lambit  (2)
ecewas	-> aswhit  (2)
yhadali	-> little  (2)
tsfle	-> leecew  (2)
little	-> leecew  (2)
little	-> lelam   (2)
maryha	-> aswhit  (1)
aswhit	-> tlela   (1)
lambit  -> teass   (1)
lambit	-> tsfle   (1)
tsfle	-> ecewas  (1)

So we now choose the highest weighted edge left, which is maryha -> ryhad.

Step 3.3

Let's take some more edges out of our list:

leecew	-> ecewas  (4)
maryha	-> ryhad   (4)
ryhad	-> yhadali (4)
aswhit	-> hite	   (3) 
maryha	-> yhadali (3)
lelam	-> lambit  (3)
hite	-> teass   (2)
teass	-> ssnow   (2)
tlela	-> lambit  (2)
ecewas	-> aswhit  (2)
yhadali	-> little  (2)
tsfle	-> leecew  (2)
little	-> leecew  (2)
little	-> lelam   (2)
maryha	-> aswhit  (1)
aswhit	-> tlela   (1)
lambit  -> teass   (1)
lambit	-> tsfle   (1)
tsfle	-> ecewas  (1)

So we chose: ryhad -> yhadali

Steps 3.4 - 3.9

The following steps all follow the same pattern as the previous ones. I think it would make a pretty boring node if I showed all the steps so we'll skip them and go on to the last step - I promise you can trust me that this is indeed how it works out.

Step 4: Putting it all together

So let's have a look at all the edges we've chosen:

leecew -> ecewas
maryha -> ryhad	
ryhad  -> yhadali
aswhit -> hite	
lelam  -> lambit
hite   -> teass	
teass  -> ssnow	
yhadali-> little
lambit -> tsfle

Then we can put these together in an order using full overlaps, as follows:

maryha > ryhad ryhad > yhadali yhadali > little  
lelam > lambit lambit > tsfle
leecew > ecewas
aswhit > hite hite > teass teass > ssnow

Or, to show it better:

maryhadalittle
lelambitsfle
leecewas
aswhiteassnow

We still have 4 fragments which still overlap, if you can't see what the message was by now you really need help. In other words, we still have a problem, but the problem is much easier than the mess of random fragments we had earlier, and we have them in order.

In real life, these inaccuracies can be ironed out further by using this method again and again with different random orderings of fragments and other statistical tools can be used.

Thanks goes to m_turner for help clearing up the overlap graph bit

shotgun sequencing	NP-Hard	Splenda	When life gives you lemons, suck on them. Seriously, lemons taste awesome.
DNA sequencing	Bioinformatics	Human Genome Project	BrevityQuest12
Apple of Sodom	gene chip	intron	greedy algorithm
iteration	superstring	computer science