Tell me, and I'll forget. Show me, and I may not remember. Involve me, and I'll understand

Undoubtedly you will have heard something like 'Chimpanzee DNA is 96% similar to humans', and may be wondering about that. First, a quick crash course in DNA:

DNA

DNA is a complex molecule which will take a book to explain in good depth, but to understand sequence comparisons, the important thing to know is that DNA is a long string of four different chemicals which can be represented by the letters A,C, G or T. For more in depth discussion, see the writeups in the DNA node.

A string of three of these chemicals either codes for one of 20 amino acids, or a start or stop character. Each amino acid has different sequences that will code for it. For example, a sequence of ATT would code for Isoleucine but ATC and ATA would also code for Isoleucine. Two sequences of the same collection of amino acids could be coded for in massively different ways. These amino acids are strung together to form proteins. Once again the same protein can be coded for using different strings of amino acids.

Comparing

There are different sequences then that can be compared. We can compare protein sequences, amino acid sequences and DNA sequences.In straight forward terms, comparing sequences is easy. Two sequences are compared to each other to see the degree of similarity.

As an example, we can compare the first 10 (of about 380)amino acids of the protein 'cytochrome b' of several organisms:

Again it's a little more complicated than all that - but I have a good piece of news for you! The good news is, I'm not going to just produce an old dry factual writeup about the procedure, I'm going to talk you through doing it for yourself. Within 30 minutes you can have done some sequence comparisons for yourself. Test evolutionary predictions from your very desktop!

Your first genetic comparison or Hello, Chimpanzee!

The first thing you are going to need is the sequences you want to compare. As of the time of writing a good resource for this is the National Centre for Biological Information (http://www.ncbi.nlm.nih.gov/). At the top of the page select 'Protein' from the dropdown menu and type in the name of your protein and the species you are looking for. Example: 'cytochrome b homo sapiens'. You will be presented with a list. You might need to hunt around for a good entry, not all of the entries on the list are complete sequences. For homo sapiens you are looking for something with about 378 amino acids in it (cytochrome c is another one to look for, and it is only 104 amino acids long). At the top of each entry it says something like:
LOCUS       AP_000651                378 aa            

where 378 aa means it is 378 amino acids long.

Once you have selected the list entry you are happy with, you can either scroll to the bottom to see the amino acid sequence, or much better for our purposes you can use the 'Display' drop down menu to display the protein in FASTA format. Here is the amino acid sequence for your enjoyment:

MTPMRKINPLMKLINHSFIDLPTPSNISAWWNFGSLLGACLILQITTGLFLAMHYSPDASTAFSSIAHIT
RDVNYGWIIRYLHANGASMFFICLFLHIGRGLYYGSFLYSETWNIGIILLLATMATAFMGYVLPWGQMSF
WGATVITNLLSAIPYIGTDLVQWIWGGYSVDSPTLTRFFTFHFILPFIIAALATLHLLFLHETGSNNPLG
ITSHSDKITFHPYYTIKDALGLLLFLLSLMTLTLFSPDLLGDPDNYTLANPLNTPPHIKPEWYFLFAYTI
LRSVPNKLGGVLALLLSILILAMIPILHMSKQQSMMFRPLSQSLYWLLAADLLILTWIGGQPVSYPFTII
GQVASVLYFTTILILMPTISLIENKMLK
Now you have the amino acid sequence for humans, now it's time to get the chimpanzee sequence. Follow the same procedure as above only search for Pan troglodytes (the latin species name for Chimps). I'm not going to give this one to you, you'll have to find it for yourself. Don't worry if you can't find one that is 378aa long, just use the longest one you can find.

OK, so now we have two long sequences of amino acids. We could compare them by hand, but I'd rather use a program to do it for me. There is no shortage of these things out there, from web based ones, to clever graphical programs you can download. I'm going to use ClustalW which (for the time being) can be found here: (http://align.genome.jp/). For more information see also: Blast.

Using the website above you will be presented with a form. In the textarea box you should first type '>Human' followed by new line, and then the sequence. It should look something like this:

>Human
MTPMRKINPLMKLIN...LIENKMLK
On a separate line, do the same for your Chimp code and click 'Execute Multiple Alignment'. This is the kind of output you should expect to see:
Sequence type explicitly set to Protein
Sequence format is Pearson
Sequence 1: Human           378 aa
Sequence 2: Chimp           333 aa
Start of Pairwise alignments
Aligning...


Sequences (1:2) Aligned. Score: 95.4955
And there you have your result - 95.5% similarity in the amino acid sequences! Careful though, this can get addictive...you can compare multiple sequences (chimp/human/alligator/mouse) and even have programs that will design trees to show relatedness.

Just comparing individual proteins is fraught with possible error - the sample size is very very small - but it does give an interesting insight into how the much grander and clever studies are done. The subject of bioinformatics is vast, but I think this could serve as an interesting spring board to those who are interested in learning more.

The key to all this is to play! Experiment, set up your own conditions, compare 25 different species with one another, create trees, compare raw DNA sequences, download even more clever programs and plug more and more data in. As you go along you will be learning more and more about bioinformatics, and more importantly it may help you ask questions you had never thought to ask before. Below are both my sources and some interesting resources to help those who are interested.

Sources/interesting resources:
www.kijko.com/evolution/comparison/comparison.htm a quick and dirty home project that I did (with 'full' cytochrome b sequences for a variety of species
www.nmsr.org/round1a.htm An interesting look at how and why this comparative technique is so powerful.
A molecular timescale for vertebrate evolution - By calibrating a genetic clock, Kumar &Hedges are able to show with good accuracy a time scale for evolutionary divergence, this paper was published in Nature, but can be found online at various places in pdf format
http://www.ebi.ac.uk/clustalw/clustalw_detail.html Discusses the ClustalW program
National Centre for Biological Information- This is where the protein sequences were found
http://evcforum.net/cgi-bin/dm.cgi?action=msg&f=5&t=588&m=1 This thread, on the evc forums, was the thread that started my interest in the topic. An interesting read which discusses the subject well including some common pitfalls.
Evolutionary genetics
Protein Databases