An implementation of the Needleman Wunsch algorithm in perl

As one of the assignments of my Bioinformatics module at University we had to write one of these in whatever language we liked. Naturally I chose perl.

The assignment stated that the program had to ask for 2 files, one with the substitution matrix and the other with the dna sequences delimited by newlines, so that's what I did. For my ease of use, though, you can also put them on the command line in that same order.

The format of the dna file is quite obvious, but I've included a sample substitution matrix below for reference.

The Code

#!/usr/bin/perl
use warnings;         
use strict; 

my @pointers;
my ($subFname,$dnaFname);
($subFname,$dnaFname)=@ARGV if @ARGV==2;    
($subFname,$dnaFname)=askForFnames() unless $subFname && $dnaFname; 
my @dna=getDNA($dnaFname);      
my ($gp,%sub)=getSubMtx($subFname);     
my @matrix=createMatrix(@{$dna[0]},@{$dna[1]},$gp);   
calcMatrix();        
my @result=traceback();       
print $result[0]."\n".$result[1]."\n";     

sub askForFnames{
 print "Please enter name of file containing ";
 print "substitution matrix and gap penalty\nFilename: ";
 chomp (my $subFname=<STDIN>);     
 print "Please enter name of file containing ";
 print "DNA sequences\nFilename: ";
 chomp(my $dnaFname=<STDIN>);     
 return ($subFname,$dnaFname);     
}

sub getDNA{
 chomp(my $fname=shift());     
 -e $fname or die  
  "DNA Sequence  file \'$fname\' not found.\n$!";  
 open DNA, $fname or die "Error opening $fname.\n$!";  
 chomp(my @dna=<DNA>);      
         
 
 (grep(/^[ACGT]*$/i, @dna) == 2) and (@dna==2)   
   or die "DNA file \'$fname\' in wrong format.";  
     
 my @d1=split //,$dna[0];     
 my @d2=split //,$dna[1];
 return (\@d1,\@d2);      
}

sub getSubMtx{    
 my ($gp,@order,%sub);
 chomp(my $fname=shift());     
 -e $fname or die 
  "Substitution Matrix File \'$fname\' not found.\n$!"; 
 open MTX, $fname or die "Error opening $fname.\n$!";  
 
 while(<MTX>){       
  next if /^\s*$/;     
  next if /^\s*#/;     
  ($gp)=$_=~/^gp\s*=\s*(\d*)/i;    
  @order=split /\s*/ if (/^\s*([ACTG]\s*){4}$/);  
  shift @order if @order>4;    
  if (/^[ACTG]/){      
   die ("Substitution matrix bad. ".
    "Found matrix rows before heading or ".
    "not enough heading columns.") unless @order;
   my @row=(split /\s/);    
   my $y=shift @row;    
   for my $x (@order){    
    $sub{$y}{$x}=shift @row;  
   }
  }
 }
 close MTX;       
 
 return (abs $gp,%sub);      
}

sub createMatrix($$$){
 my ($maxX,$maxY,$gp)=@_;     
 my @matrix;
 for my $y (0..$maxY){      
  $matrix[$y]=[];      
  $matrix[$y][0]=-1*$y*$gp;
  push @pointers, [$y,0,$y-1,0];
 } 
 for my $x (0..$maxX){
  $matrix[0][$x]=-1*$x*$gp;
  push @pointers, [0,$x,0,$x-1];
 } 
 return @matrix;       
}

sub calcMatrix{
 for my $x (1..(@{$matrix[0]}-1)){
  for my $y (1..@matrix-1){
   $matrix[$y][$x]=calcCellAt($x,$y);
  }
 }
}

sub calcCellAt{
 my ($x, $y)=@_;       
 my @t=( $matrix[$y-1][$x-1]+$sub{$dna[1][$y-1]}{$dna[0][$x-1]}, 
  $matrix[$y][$x-1]+-1*$gp,           
  $matrix[$y-1][$x]+-1*$gp);      
 my @z=sort {$b <=> $a} @t;     
 push @pointers, [$x,$y,$x-1,$y]   if $t[1]==$z[0];  
 push @pointers, [$x,$y,$x,$y-1]   if $t[2]==$z[0];  
 push @pointers, [$x,$y,$x-1,$y-1] if $t[0]==$z[0];  
 return $z[0];       
}

sub traceback{
 my ($p,$q)=(scalar(@{$dna[0]}),scalar(@{$dna[1]}));             
 my ($out1,$out2);      
 until ($p==0 && $q==0){      
  for my $z (@pointers){     
   if ($z->[0]==$p && $z->[1]==$q){  
    if ($z->[0]==$z->[2]+1){  
     $out1.=$dna[0][--$p];  
    }else{     
       $out1.="-";   
    }     
    if ($z->[1]==$z->[3]+1){
     $out2.=$dna[1][--$q];
    }else{
       $out2.="-";
    }
   }
  }
 }
 $out1 = reverse $out1;      
 $out2 = reverse $out2;      
 return ($out1,$out2);      
}

The Substitution Matrix File

# This is the substitution matrix.
# The contents are whitespace delimited, and the program DOES
# take notice of the headings.
#
# Please note, blank lines and lines starting with a # are ignored
# but any other extra information will cause the program to refuse 
# this file

 A C G T     
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1

# This sets the gap penalty

gp=1

Node your Homework

Levenshtein distance	dynamic programming	Smith Waterman algorithm	The Fight Club / Calvin and Hobbes Connection
obscure variable names	optimal difference algorithm	Node your homework	Use meaningful variable names
sequence clustering	generic programming	algorithm	confounding variable
creating a spell checker	The last written words of Martin Luther	Computational biology	variable star
Bob Jones University	Bellman	global	Cheese
Reggae