Apologies that this root log took so long in coming. I'm going to try to document here mostly the things that happened during July concerning the server downtime and the recovery up until the point when I'm writing this. Although coding slowed down for June, stuff still happened, and I've asked OldMiner to document that, which he should post soon hereafter.

Downtime and recovery

The exact chronology of events is fuzzy in my head. I know I have the order in which things happened correct, but exactly when, I don't know.

Summary

The actual complete downtime was three or four days around July 7, 2009. I know I managed to bring back the site to a flakey status on Friday the 10th, remember it being a Friday quite well, because I stayed very late at the office at work until I knew the site could at least get back on its feet, however shakily. It took maybe ten more days to get the site in a usable state, and I've been tweaking it almost daily since then until this noding.

The dirty details

The beginning

Out of necessity this may get a bit technical, but even despite the jargon, you might get the general idea of what happened.

The site went down due to overheating. The AC in the university server room where e2 was hosted malfunctioned. Now, e2 was hosted on basically three machines:

web2: Load balancer, web frontend, runs cronjobs (maintenance tasks like updating New Writeups and cleaning out node row)
web5: Main webhead, runs all the Perl, serves requests for web2
db2: Database server, hosts the actual data, doesn't do much else.

In addition we have

web4: What we use for development, completely sandboxed from everything else, running its own database and cronjobs
web6: A webhead supposedly with identical function to web5 which wasn't doing much of anything other than contributing to global warming, not to mention server room warming.

All of them were running various Ubuntu versions in various states of update. The proposed scheduled downtime that probably jinxed us to endure unscheduled downtime had for a goal bringing all of the Ubuntu versions up to the latest server version, Hardy Heron.

So clampe received notice that not all was right in the server room. He went in. Blown fuse or something. He powered down all the servers. And then...

web5 and db2 didn't come back.

So we were out of a database server and the main webhead we had. nate remotely rerouted power to web6 but without a database, all it could serve was Nate's word galaxy. web4 was also powered down, at the time unclear whether it had suffered damage too or not, meaning we were running on pretty much nothing.

Emeritus commander-in-chief nate decided that we needed on-site backup or else we'd never recover from this. Taking nothing but some rations, a jackknife, a standard-issue potato gun, and bare equipment donated by Kurt, nate infiltrated the tremendous heat of the server room through the ventilation ducts, suspended horizontally from a safety harness so as to not touch the Hades the place had become. alex and I offered remote backup.

nate soon confirmed that indeed our main webhead and database server were out of commission, their motherboards fried. Performing careful surgery, he removed the RAIDed hard drives from the database server and backed up our database, reportedly also backed up the database to DVD. He brought web4 back online, much to my relief, which wasn't fried. I quickly proceeded to backup our development work to our spare web6, in case anything else might happen. nate worked most of that day getting the basic layout ready for e2 recovery. db2 was replaced with kurt1, our new database server, and web5 got replaced with tcwest, our new webhead, both running Fedora and the temporary replacement equipment that Kurt had temporarily donated.

In the meantime, we had already planned to redirect people with the Word Galaxy to an IRC room in case they wanted to know more about the server outage or just plain see some familiar faces for comfort. While nate was working, alex and I did what we could with the Word Galaxy and checking things over in IRC. Once she seemed networthy, the rest could be done remotely. nate retreated from the site and left most of the remaining work to me and alex.

Except for the RAID¹ that we lost in db2, our replacement hardware was superior to what got burned. More RAM, more disk space, faster processors.

First attempts at recovery

The new servers had to be readied for serving e2. I worked on installing ecore on tcwest, our replacement webhead (which all the other servers still called "web5"). This involved fetching all of the necessary Perl modules from CPAN, and for both of us (alex and I), getting acquainted with package management in Fedora. Ecore's configuration had to be adapted to the new server and the Perl configuration also took a little while. alex handled routing web2's to serve requests from both tcwest and web6. In addition, alex handled most of the Apache webserver configuration in web5, with just a little help from me for pointing out the e2-specific configurations and our URL-rewriting rules. web6 only needed a small ecore update since she was already more or less configured to handle serving e2.

The MySQL database configuration in kurt1 was much more difficult . All three of us tried a couple of stock configurations for MySQL, but when we tried to bring up the site with them, she keeled over in minutes and we had tons of locked queries. We were also running MySQL 5.1 instead of 5.0, which although declared stable, most server distributions still ship 5.0 in stable and 5.1 as experimental. This is a consequence of Fedora being primary a desktop distribution, not server.

At any rate, we had no better idea of how to restore site operation, so we decided to see if switching database engine would help. This was part of the scheduled upgrades anyways. So after a few tries due to various bugs, I stayed up late one night in the office at work, because I miscalculated how long it would take to switch engines and unable to leave the operation in course. After several hours, we switched away from the MyISAM engine to the InnoDB engine which is supposed to handle locks much better but has different assumptions of operation. InnoDB handles concurrent queries better without locking, but has other issues we would soon discover.

This was about three or four days of downtime. In the meantime, the IRC channel became a refugee camp. Around the third or fourth night, after the move to a new database engine was complete, I hit the gas, started the cronjobs, redirected web2 to serve ecore pages, and with much difficulty, we got the website to 88 miles per hour and our first few pageloads started trickling in.

Improving pageloads, losing and recovering one engine

Once the website was up and at least staying up, it looked like most of our work was done, and I suggested that the IRC refugee camp could be liberated. However, pageloads were unbearable for about ten days or so. We tried everything we could think of, and I spent a lot of time more or less randomly tinkering with our MySQL configuration.

Somewhere along the line, I also tried to synchronise ecores and homenodes images between tcwest and web6 so that it would be easier to keep ecore updated with our work and you wouldn't get a semirandom homenode image depending on which of the two webheads served your request. Unfortunately, core temperature in web6 was probably too high and she shut down most likely to overheating. She was the one hosting the single ecore and single set of homenode images, so that incurred a couple of hours of downtime. alex rerouted all power to tcwest and questioned the wisdom of serving ecore and homenode images from a single server. nate rebooted web6 remotely, observed she was still webworthy, and I begrudgingly kept the ecore separate, but stubbornly insisted on mounting homenode images from one place only, still web6.

Back to MySQL and InnoDB configuration, I tried increasing memory usage of InnoDB to allow it to load as much of the database as possible into RAM to alleviate the biggest bottleneck in almost anything: hard disk input/output. This had a marginal improvement in pageloads and 503's became more rare, but load times were still unacceptable and occasionally web2 would give up waiting for tcwest or web6 to respond, hence 503.

I spent a lot of time reading about InnoDB and MySQL optimisation, trying various things, but I was out of my element. We all were. I asked around for help. Reasoning that since our hardware was in fact better than what we burned except for the lost RAID, I couldn't give up in finding the magic combination that could bring pageloads back to what they were before the server crash. Finally, after asking around for help wherver I could, I found an IRC chap by the name of Raymond DeRoo who was kind enough to teach me a few basic optimisation tricks. Turns out that kurt1 had tons of sleeping processes hogging MySQL connections, so the first thing Raymond suggested was to reduce the timeout on sleeping connections from 8 hours to 3 seconds, and this had a very noticeable effect of bringing down pageload averages from a few minutes or longer to just a few seconds. He offered some advice on how to setup our my.cnf for improved performance as well as teach me how to track down and monitor bad queries. Some queries that worked well with MyISAM are terrible for InnoDB, so with the tricks learned, I spent the next week tracking down and modifying or sometimes outright killing bad queries. OldMiner proved to be an invaluable sounding board and also squashed a few bad queries of his own. After a few iterations of doing this, site is slow, find bad query, kill bad query, we arrived at where we are now. During this process, pageloads were generally acceptable enough that edev could now offer some much-needed help.

So it looks like the Sourceforge netops' wisdom, as retold by nate, worked: we fixed our fucking code, and pageloads seem to be good.

From here on

For the past few days, it seems to me like every pageload is very good, and subjectively to me, it looks that they're even better than what they were before the crash. I'm hoping that we killed all the major sources of slowdown, but it's almost certain that more still exist. Work in this direction should still be relevant. Also, long pageloads can result in database burps, writeup reputations not being accrued properly, votes or cools not registering, and that sort of thing. We should work to reduce the possibility of this happening.

I've been keeping tabs on core temperatures on all of our servers as well as load. They're mostly doing ok, but web6's core is constantly at an alarming 70 C, so it's no surprise she went offline. Thankfully, she's not currently mission-critical at this point, so in case we lose her to heat as well, the site shouldn't suffer considerably.

Additionally, nate has hinted that we will probably get replacement hardware again near the beginning of August. He describes kurt1 and tcwest as "loaners", so they're not intended to be permanent replacements. I am thus loosely documenting the work necessary to bring e2 up to speed here in hopes that changing hardware again can go a bit smoother next time.

Other odds and ends

Amidst bringing the site back up to speed, a few bugs arose due to various changes. For one, we had to use the development ecore that we had to patch as quickly as possible. I think most of this instability is now behind us. Part of this work had the side effect of making our URLs a bit more readable, as they should be in almost all cases, but also had the side effect of temporarily disabling things like favouriting noders or bookmarking nodes.

Additionally, since site stability seemed dubious, under GrouchyOldMan's insistence I coded up node backup. No more offsite unmaintained clients for backing up nodes! It seems to work, but it's possible it's still a little broken.

About bugs, let me remind you that all software has bugs. Bugs are a fact of life. We can't get rid of all of them, but we can squash the biggest ones. Please help us do so by reporting bugs to e2 bugs whenever you encounter them. Don't tolerate them, don't hesitate to report them because you think they only affect you because they shouldn't be tolerated and they probably affect other people besides you. We can't guarantee that we'll be able to squash all the bugs, but we should always try.

A personal note

I am going to be taking an extended vacation from e2, starting this Saturday, my birthday, one day after sysadmin day. I am aiming for four or five months, however long I can manage or need. The grand goal is to not come back to e2 until 2010, but I can't guarantee I'll be able to stay away from e2 for this long. I'm already suffering premature withdrawal symptoms. ;-)

Problem is that I've been feeling way too responsible for the site, and I need a little time to cool off from e2 and also dedicate more time to number theory and similar endeavours. It's not my job to fix e2, just something I do on my spare time, so it shouldn't be something that makes me feel ultimately responsible for site operations. As soon as I get permission from alex and/or nate, I'd like to give fellow splat OldMiner a stroll through the backend servers so that we can still have someone reliable fixing backend stuff and familiar with server layout. OldMiner has privately agreed with me to pick up the pace from where I leave off, so this gives me peace of mind.

Of course, this shouldn't mean that you should now direct all coding issues to OldMiner. The primary venue of communication with coders is still e2 bugs and suggestions for e2 at least until we get the ticketing system ready. Even if one person happens to hog the code like I did since February, we still have a team of people available, and we should exploit everyone's strengths if we are to bring e2 out of late 20^th century web development practices.

For the rest of this week, I want to go on a bug-squashing spree and clean out as much of e2 bugs as I can. On the Sabbath, I rest. I will come back to keep working on e2. I can't stay away from this site for too long, but I need some vacations.

¹ RAID: Several hard drives, in this case two, working in tandem as if they were one, for speed and backup burposes.

root log: August 2009	Editor Log: July 2009	root log: June 2009	number theory
August 1, 2009	Dream Log: July 28, 2004	Editor Log: August 2010	Editor Log: August 2009
goatse	Uberman's Sleep Schedule	Raid

root log: July 2009