file format documentation (thing) by wharfinger

This subject probably fills most everythingians with the kind of vaguely irritable apathy that you just can't buy in stores, or else pure dread, or maybe just mild revulsion.

But that's okay, because if we're not here to be nauseated, why in the name of God are we here, eh?

Furthermore, this is of abiding interest to some of us: Many computer programs (that's "applications" to the young folks) store data in all manner of obscure ways, and programmers often want to get at it and make use of it. If it's not a text file, it can be really hard to figure out what's in there, because all you see is a lot of arbitrary zeroes and ones. It's meaningless if you don't know how it's organized: All ones and zeroes look alike. Each file format is different, because the needs of different programs will differ: A zip file needs information about filenames, paths, original file sizes and dates, and a whole lot of arcane stuff related to compression. A bitmap file such as a jpeg or a tiff doesn't need anything about filenames, paths and dates; a jpeg has compression, but it's a completely different kind of compression from zip compression. And so on. The two simply have nothing meaningful in common.

So, we need documentation. A smart, experienced programmer with a hell of a lot of time to kill can work it out ab initio with nothing but the program that generates the file (for making simple test cases with known properties): I once wasted a fun weekend reverse-engineering the easy bits of the arj format, just for the hell of it. It was slow and painful, and I never even got close to the compression itself. Of course, I'm not real smart. A really smart programmer (or one with really unthinkable time to kill) could probably dispense with the test cases.

On the whole, reverse-engineering is a pain in the ass and we only do it when a) we absolutely must, or b) we're just doing it for fun. We prefer documentation. Naturally, the documentation's never complete anyway and there's always a little reverse-engineering to do.

So here it is. Just recently, somebody noded uuencode, but without hard detail on the algorithm. I thought, "well gee, that's yummy stuff." It's a clever and very simple algorithm, so I thought I'd do a quick writeup trying (however spastically) to explain it. Then I couldn't remember how the damn thing went, so I went to Google to have a look around for it, and look what turned up:

http://www.codemanual.net/main/file_formats/file_formats[1].html

Jumping Jesus on a polo mallet! They've got more file format documentation than you can shake a stick at. It's not quite the Holy Goddamn Grail, but they've got (or claim to have) all manner of goodies: .wav, .arj, pdf, .rtf, .gif, elf, half-a-dozen obsolete dos/windows exe formats (and something only a year old on the current "Portable Executable Format", though MSDN's got that pretty well covered for a change), MIDI, and so on. Four pages of links to little zip files (yep, they've got zip too). Is it all crap? Good question. I only checked one, "url" files (a goofy thing probably unique to windows, for storing a URL in -- you guessed it! -- a "link" file that you can click on), and it was accurate. Of course, a retarded child could reverse-engineer that one since it's just text anyhow. How d'you think I knew it was accurate, eh? I'm not going to sit here at midnight before I've had my dinner and compare their "Portable Executable" format docs in detail with what I've got already, sorry. That's a buttload of struct members to squint at. So caveat emptor, but it looks like fun.

Is it a bit windows-centric? Hm, well... yeah, but not exclusively so.

And here's another and even better one, thanks to Certified Geek who clued me in:

http://www.wotsit.org/

8/07/2001 Thanks also to Certified Geek for a correction on codemanual's URL, which had changed.

uuencode	Jumping Jesus on a Pogo Stick	Avi	PDF
file format	ARJ	GIF	documentation
EXIF	.pif	.ini files	Josef Sudek
Ico	The perfect node	bitmap	PE
Brighton	ape	brother	executable
source code