You can see HIV-1
's genetic code
for yourself, courtesy of the National Institute of Health
One thing that has always impressed me about HIV is how SMALL it is. The DNA strand that codes for it takes up only 8379 base pairs, or base pair triplets to code for 2793 amino acid blocks. Since a base pair triplet can code any of 21 amino acid types (counting the 'end' instruction) each one is equal to (ln(21)/ln(2)) = 4.4 bits of information. This means that the code for the HIV only takes up (2793 *4.4/8)= 1537 bytes.
It's kind of unnerving to know that you can write Hello World programs in certain programming languages that take up more data than a virus that can hijack a human immune system.
CrazyIvan pointed out to me the fact that any DNA sequence has 6 possible frames for reading: Each strand can be read from one of three possible offsets, in either direction. This means that we cannot make the 6-bit to 21-amino acid compression, and must treat each base pair as an uncompressible 2 bits of information. Now, our code size is 8329 base pairs * 1 byte/4 base pairs = 2,083 bytes. (Although the implementation becomes 6 times more complex, most of this complexity is in the processor mechanism, and not in the encoding)