display | more...
(bioinformatics, computational biology:)
Algorithm for calculating global sequence similarity and sequence alignment. The 2 sequences are considered words over some small alphabet (typically 4 letters for DNA sequences, 20 for protein sequences). The algorithm is allowed to insert gaps into each sequence, in order to make as many identical letters line up. Each configuration is scored in the following way:
• An exact letter match gets MATCH.
• A mismatch (2 different letters) gets MISMATCH.
• Every gap gets GAPEXT
• Additionally, GAPOPEN is paid for every gap opened, i.e. for every sequence of consecutive gaps.

Thus, if MATCH=11, MISMATCH=-4, GAPOPEN=-3, and GAPEXT=-2, then this configuration:

A-CGTTTACGG-T
AAGG--TACG--C
would score 6*GAPEXT+4*GAPOPEN+2*MISMATCH+6*MATCH = 34.

The input of the algorithm is a pair of sequences; the output is the alignment with a highest score (an optimal alignment); both entire sequences are required to participate in it. In 1979, Needleman and Wunsch published a paper with their dynamic programming algorithm for calculating optimal alignments. Of course, Bellman had already solved similar problems (e.g. edit distance for words) in the 50s and early 60s. But those were considered in different fields.

An implementation of the Needleman Wunsch algorithm in perl

As one of the assignments of my Bioinformatics module at University we had to write one of these in whatever language we liked. Naturally I chose perl.

The assignment stated that the program had to ask for 2 files, one with the substitution matrix and the other with the dna sequences delimited by newlines, so that's what I did. For my ease of use, though, you can also put them on the command line in that same order.

The format of the dna file is quite obvious, but I've included a sample substitution matrix below for reference.

The Code

#!/usr/bin/perl
use warnings;
use strict;

my @pointers;
my (\$subFname,\$dnaFname);
(\$subFname,\$dnaFname)=@ARGV if @ARGV==2;
(\$subFname,\$dnaFname)=askForFnames() unless \$subFname && \$dnaFname;
my @dna=getDNA(\$dnaFname);
my (\$gp,%sub)=getSubMtx(\$subFname);
my @matrix=createMatrix(@{\$dna},@{\$dna},\$gp);
calcMatrix();
my @result=traceback();
print \$result."\n".\$result."\n";

print "Please enter name of file containing ";
print "substitution matrix and gap penalty\nFilename: ";
chomp (my \$subFname=<STDIN>);
print "Please enter name of file containing ";
print "DNA sequences\nFilename: ";
chomp(my \$dnaFname=<STDIN>);
return (\$subFname,\$dnaFname);
}

sub getDNA{
chomp(my \$fname=shift());
-e \$fname or die
open DNA, \$fname or die "Error opening \$fname.\n\$!";
chomp(my @dna=<DNA>);

(grep(/^[ACGT]*\$/i, @dna) == 2) and (@dna==2)
or die "DNA file \'\$fname\' in wrong format.";

my @d1=split //,\$dna;
my @d2=split //,\$dna;
return (\@d1,\@d2);
}

sub getSubMtx{
my (\$gp,@order,%sub);
chomp(my \$fname=shift());
-e \$fname or die
open MTX, \$fname or die "Error opening \$fname.\n\$!";

while(<MTX>){
next if /^\s*\$/;
next if /^\s*#/;
(\$gp)=\$_=~/^gp\s*=\s*(\d*)/i;
@order=split /\s*/ if (/^\s*([ACTG]\s*){4}\$/);
shift @order if @order>4;
if (/^[ACTG]/){
die ("Substitution matrix bad. ".
"Found matrix rows before heading or ".
"not enough heading columns.") unless @order;
my @row=(split /\s/);
my \$y=shift @row;
for my \$x (@order){
\$sub{\$y}{\$x}=shift @row;
}
}
}
close MTX;

return (abs \$gp,%sub);
}

sub createMatrix(\$\$\$){
my (\$maxX,\$maxY,\$gp)=@_;
my @matrix;
for my \$y (0..\$maxY){
\$matrix[\$y]=[];
\$matrix[\$y]=-1*\$y*\$gp;
push @pointers, [\$y,0,\$y-1,0];
}
for my \$x (0..\$maxX){
\$matrix[\$x]=-1*\$x*\$gp;
push @pointers, [0,\$x,0,\$x-1];
}
return @matrix;
}

sub calcMatrix{
for my \$x (1..(@{\$matrix}-1)){
for my \$y (1..@matrix-1){
\$matrix[\$y][\$x]=calcCellAt(\$x,\$y);
}
}
}

sub calcCellAt{
my (\$x, \$y)=@_;
my @t=( \$matrix[\$y-1][\$x-1]+\$sub{\$dna[\$y-1]}{\$dna[\$x-1]},
\$matrix[\$y][\$x-1]+-1*\$gp,
\$matrix[\$y-1][\$x]+-1*\$gp);
my @z=sort {\$b <=> \$a} @t;
push @pointers, [\$x,\$y,\$x-1,\$y]   if \$t==\$z;
push @pointers, [\$x,\$y,\$x,\$y-1]   if \$t==\$z;
push @pointers, [\$x,\$y,\$x-1,\$y-1] if \$t==\$z;
return \$z;
}

sub traceback{
my (\$p,\$q)=(scalar(@{\$dna}),scalar(@{\$dna}));
my (\$out1,\$out2);
until (\$p==0 && \$q==0){
for my \$z (@pointers){
if (\$z->==\$p && \$z->==\$q){
if (\$z->==\$z->+1){
\$out1.=\$dna[--\$p];
}else{
\$out1.="-";
}
if (\$z->==\$z->+1){
\$out2.=\$dna[--\$q];
}else{
\$out2.="-";
}
}
}
}
\$out1 = reverse \$out1;
\$out2 = reverse \$out2;
return (\$out1,\$out2);
}

The Substitution Matrix File

# This is the substitution matrix.
# The contents are whitespace delimited, and the program DOES
# take notice of the headings.
#
# Please note, blank lines and lines starting with a # are ignored
# but any other extra information will cause the program to refuse
# this file

A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1

# This sets the gap penalty

gp=1