A Brief Review of E2's Server Errors
If you have used Everything2 for any length of time you have probably discovered that fanciful bit of nonsense known as the server error. Cryptic, technical, and unwelcome messages requiring (often indifferent) higher powers to decipher and address. They're sort of like tax law in this regard.
Basically, with E2, there are two flavors of server errors: the 500 and the 503.
The 503 means Apache is down. This is the inevitable result of a cascading failure that looks something like this:
- Our database design has historically been pretty minimal, and we have a lot of SQ queries that could (and should) be optimized but are not. To put it in perspective, most queries we run clock in at less than a tenth of a second (quite a few at less than a hundredth.) But the queries at Voting Oracle, for example, take about 7 seconds to run, which is a lifetime in terms of database queries. And since we use MyISAM tables instead of InnoDB (the InnoDB engine not being available in MySQL when E2 was first written back in the early 17th century), this invokes a table lock on the vote table. So anybody else trying to vote, or read Voting Oracle, or do anything that involves voting, has to wait their turn. This causes lag. However, once that Voting Oracle query clears, everything catches up pretty readily.
- However. Sometimes a query is run that takes 30 seconds or more. Sometimes a minute. Sometimes more. And depending on the tables it is locking up, this causes MySQL to basically come to a standstill.
- Meanwhile, people who have a frozen page naively hit refresh or try to load up two other pages while they're waiting. The child processes in Apache start loading up, and then ...
- 503! Until Apache restarts itself, anyway.
503s have been caused by other things: memcached filled up a hard drive with its log once, our Apache .conf mysteriously vomited at certain URL calls for awhile, and we've been DOSed on at least one occasion by a (I think benevolent) webcrawler.
A 500 error, on the other hand, is different. The server is just fine, but the request you're attempting is no good. There are typically 5 kinds of 500 errors, which we will now explain.
- No Default Pages Loaded. Our most popular 500 error, this is the error message you get when you choose a displaytype on a node that does not have a htmlpage associated with that particular displaytype and node. For example, if you go to a nodelet node (say Epicenter, node_id 262) and add "&displaytype=printable" to the URL, you'll get a 500 error. This happens mostly for webcrawlers.
Solution Set the displaytype to "display" if an htmlpage cannot be found. Implemented February 19, 2008.
Solution Get better try-and-catch built into the nodegroup process.
- Malformed XML. Generally our XML parsers and renderers are pretty well. The most common infringer is the catbox, people's Unicode and obscure characters (and our attempts to modify what people use from time to time) causing occasionally malformed XML and thus 500 errors on the sites that read that XML.
Solution: Work harder to conform to XML spec. But really, not that big a deal.
- "Near Matches" searching. This pulls up errors every once in awhile, and I'm not sure how to duplicate it. But there is some magic entry somewhere that makes the SQL query it runs go bad, which causes the whole page to crash and burn. And it actually happens kind of frequently, so if you experience this error, please let me know.
- Sex crawlers. I'm not sure what causes this one, exactly. But people using some form of adult content search engine are coming across our writeups regarding sex (mostly newsgroup-y stuff like bestiality and animation porn) and then from there submitting a form POST (?), which throws up a 500. Needless to say, Googling this is too embarrassing to be helpful.
So there you have it, the common 500 errors. Generally they occur do to lack of appropriate error catching and handling in our code. On the positive side, there used to be a lot more such errors and they have been addressed on an ad-hoc base and to say that we only have 3 or 4 common causes of 500 errors is not so bad.
Hope this was enlightening!