Nintendo used a strange data compression scheme in many of their games during the SNES era and beyond. Why Nintendo chose a hand-coded, fairly inefficient scheme over a more accepted method such as Huffman coding isn't known. (But Nintendo programmers were and still are infamous for coding things in odd, roundabout ways, as any experienced ROM hacker can attest to.)
(Note: lj and yerricde have informed me that Huffman decoding is slow. It is true that the 65816, the processor inside the SNES, had no cache and only three registers, only one of which could be used for any math. However, I still question why Nintendo felt it necessary to reinvent the wheel instead of using a known compression scheme.)
In various forms, this scheme, which is called "LZ" or "Lunar compression" in some circles, was used in quite a few games:
Format 1 was used in The Legend of Zelda: A Link to the Past and the Japanese version of Super Mario World.
Format 2 was also used in Zelda as well as the American version of Super Mario World and the much more advanced Super Mario World 2: Yoshi's Island.
Format 3 was used in both Pokémon Gold and Silver.
Format 5 was used in Super Metroid, Super Mario Kart, and also, incidentally, Sim City.
For such a common compression format, it's surprising that so little documentation is available. The only documentation that existed until now was some sparse information on Format 3. A library, Lunar Compress, does exist for decoding these and several other formats, but it sadly remains closed source. (Minor rant: Keeping one's discoveries secret only harms the hacker community until someone more generous decides to repeat your work.)
This documentation is incomplete in some places. I haven't gotten around to deciphering all the opcodes present in each format. Any information would be appreciated!
Here, then, is the compression scheme in all its gory detail. During this specification, I will use a $ to denote numbers in base-16 (hexadecimal) notation and a $$ to denote numbers in base-2 (binary) form.
The compressed data consists of a repeating sequence of instructions, which contain an opcode value followed by a single parameter, and data, which is written to the output in a manner specified by the instruction. To begin, the decompression program must read one byte. The first three bits of this byte determine how to extract the opcode and parameter:
If the first three bits are anything other than $$111, the instruction is in short form. Short form instructions take the form $$oooppppp, where ooo represents the opcode and ppppp represents the parameter - in other words, the opcode is the first three bits and the parameter is the last five.
However, if the first three bits are $$111, the instruction is in long form. Instructions in long form are two bytes long and have the form $$111ooopp pppppppp, where the parameter is a full ten bits long. The long form extends the range of the parameter from 32 values to 1024 values at the cost of an extra byte of space.
Now that the opcode and parameter have been determined, the decompression program can now begin to decode the data. The interpretation of the data, of course, varies depending on the opcode. The four incarnations of Nintendo compression diverge here.
Opcode $$000: [No compression.] Read bytes from the input and copy them to the output unchanged. The parameter plus one specifies the number of bytes to copy.
Opcode $$001: [Byte run length encoding.] Read one byte from the input. Copy it repeatedly to the output. The parameter plus one specifies the number of times to copy the byte.
Opcode $$010: [Word run length encoding.] Read two bytes from the input. Copy them repeatedly to the output. The parameter plus one specifies the total number of bytes to copy (not the number of times to copy the byte; if the parameter plus one comes to an odd number, copy the first byte again to round off).
Opcode $$011, Format 2: [Gradient.] Read one byte from the input. Write it to the output a given number of times, incrementing it by one each time. When the value reaches $ff, the byte "wraps around" to $00. The parameter plus one specifies the total number of bytes to write.
Opcode $$011, Format 3: [Zero padding.] Copy a zero byte repeatedly to the output. The parameter plus one specifies the number of bytes to copy.
Opcode $$100, Format 2: [Simple data repetition.] Starting at the two-byte offset which immediately follows this byte (in big-endian format), copy bytes from the decompressed data to the output. The parameter plus one specifies the number of bytes to copy.
Opcode $$100, Format 3: [Data repetition with reverse bit order.] Read one byte from the input. If its highest bit is clear, the two bytes following the opcode, plus one, specify the offset from which to copy. If its highest bit is set, the single byte following the opcode specifies the number of bytes backward, from the end of the data, from which to copy. Copy bytes from this offset into the already-decompressed data to the output, reversing the bit order of each byte but maintaining the same byte order. For example, $$11001100 10101010 would become $$00110011 01010101. The parameter plus one specifies the number of bytes to copy.
Opcode $$110, Format 3: [Data repetition with reverse byte order.] The offset into the decompressed data is encoded in the same method as in opcode $$100, format 3. This time, however, the byte order is reversed, so, for instance, $$11001100 10101010 would become $$10101010 11001100. Again, the parameter plus one specifies the number of bytes to copy.
Opcode $$111: [End of data.] In long form only, this opcode specifies the end of the data. The parameter is ignored. (In short form, this is an illegal opcode - it serves to specify long form, above.)
Once the end of the data has been reached, the decompressor should ensure that the size of the decompressed data is either a power of 2 or a multiple of 1024. If the data's size is any other value, the decompressed data is invalid. This restriction is very useful when scanning for compressed data in a ROM, as it greatly reduces the number of false matches.
This information should be enough to write a decompressor for format 3 (Pokémon) and give a head start on writing a decompressor for formats 1 and 2. (I have not studied format 1 closely, but Lunar Compress states that it's almost identical to format 2, so most of this information should transfer over.) Again, any information would be appreciated.
Sources: Necrosaro's Pokémon compression documentation: http://members.aol.com/sabindude/pokemon.html
FuSoYa's Lunar Compress: http://fusoya.cg-games.net/
and much analysis of game images.