Ok, Microsoft. Really, what the storming, steaming poop-soaked hell? MS Exchange 2003, their flagship mail server, has a bug. Now, this shouldn't surprise anyone. Of course, the nature of the bug is a bit more surprising.
Last night, our Exchange server fell over and died, apropos of nothing. We restarted the services, they choked again. So, we rebooted. For about 5 minutes it worked, and then someone sent an email and it went into a mass vomit frenzy again and failed. All the while, it hasn't emitted a single useful error. (Even something as general as "Ack, snark detected while processing outbound mail queue" would have been useful!) Repeat this useless cycle twice.
Finally, utterly fed up, we turned to El Goog. First suggestion - information store corrupted, reinstall, restore mailboxes from backup. The admin chanted the spell of YUCK! But nothing happened. Second option - corrupt message in queue. Well, that's odd enough - corrupt how, I wonder - but at least it's painless. NOT! So, we poked around the outbound mail queue. Sure enough, there's a message that arrived right about the time the rope-pee frenzy began. Nuke that, try again. Still pukes.
Argh. So, back to the Great IP Oracle we go. Same guy who suggested the corrupt message says, "Oh, sometimes nuking the message doesn't do it. You have to nuke the queue directory and create a new one." Oh.
Wait, what? How does this make sense? There's no other regular or hidden files in the directory, the corrupt message is deleted. Why should this be necessary? Nevertheless, we try it, just to wave a dead chicken over the possibility before breaking out the backup tapes, which would entail a day of lost mail and at least a day of downtime. Lo and behold, it worked!
So my question is, WTH? Is Exchange referring to messages in queue by fucking sector addresses or something? Rebooting the server would clear any open files, deleting the file means it's no longer accessible by name. So what the fuck is it doing here? Why would recreating the queue dirs by hand do a blesséd thing?
It's quite bad enough that it gets wedged and dies on a corrupt email message (which is TEXT, I might add. Even any attachments are uuencoded or BASE64-encoded first!) - but it's indefensible that it keeps getting wedged on the same thing after it's been deleted and the server's been rebooted. How the fuck is it still accessing the corrupt data? And if there's a copy of it copied into the Exchange DB blob, then why does rebuilding the queue dirs get rid of it? Only thing I can think of is that the new creation time on the dirs prompts Exchange to discard cached copies, but if that's the case, why isn't there a simple option to force that behavior? Also, is NTFS really so slow that sucking a message from an outgoing queue, as text files, into the database, actually makes sense and is a net win? I can't quite imagine that. And if it's something about FS access being that slow in general, I'd expect things like sendmail and Postfix to do much the same. (They don't, and AFAIK can't be made to.) Is this just Microsoft being gratuitously bogus?
You know, on second thought, that doesn't even cover all the WTFs here. First off, if using the filesystem is so slow, why do it? Why is there even a queue dir if it stores it all in the DB anyway? (That's assuming it does, because the other option I can think of - addressing the queue files by sector not name - is something even more terrifying, from the Deep Dark Days of DOS.) Second, how does it get corrupted? If the client sends you something that doesn't look valid, drop it on the floor and make the client resubmit. If it's valid coming in, then how is integrity lost writing it to disk? Is ZFS really the only filesystem that checks this kind of thing? Third, even if it is corrupt on disk, why doesn't the queue runner process just spit out "Ack, could not make sense of queued message $FOO, skipping", and move on to the next one? If the data can tie the queue runner in knots, that's a really bad design! On top of this, why can't the Exchange System Attendant at least emit a useful error before facefaulting? How hard is it to detect that the queue runner's jumped off into east hyperspace and is no longer responding in any kind of useful way? I mean, glorking the whole mail server is clearly bad and wrong, but why not at least say "Outbound queue runner died, shutting down Exchange", before eating yourself?