DNA codes for protein using a three-letter genetic code. So, even in a coding region of DNA, looking in the right direction, there are 2 wrong ways to translate the DNA to amino acids and only 1 right way. The correct modulo 3 offset to read is known as the coding frame.

Finding the correct frame is fairly easy, given a clean sequence. There are 3 different stop codons (which terminate translation into protein, and therefore cannot appear in the middle of a coding region in the correct frame). 3 "bad" codons out of 43=64 codons total is a bit under 1 stop codon every 21 codons, assuming random DNA. Suprisingly, roughly the same statistic holds for coding regions in the wrong frame. So an open reading frame which goes on for over 120bp (40 codons), say, is almost certainly in the coding frame.

For eukaryotes the problem is a bit harder, since coding regions appear only in exons, which are separated by noncoding introns. And even for exons, some are (much) smaller than 120bp -- good luck finding them without a stronger statistic. But the above technique is still often helpful.

For regions with no genes, it's not so good. A long tandem repeat which contains no G's (or no T's) cannot contain a stop codon (all 3 require both a G and a T), and so will appear to code in all 3 frames. Of course, that in itself (3 open frames, rather than just one) is grounds for suspicion...

Log in or register to write something here or to contact authors.