Unplanned Downtime
There's been quite a bit of asking about the down time Everything2 experienced early this month. Everything2 was effectively unreachable for about half the day on March 10th. This downtime was unplanned and appears to have been a result of hardware failure. I have had no contact with the people who have physical access to the machines, so I'm not really sure what's up.
What I can tell you is that we lost one machine which E2 was running on and gained another. Plus one server, minus one server, E2 should be at the same capacity as before, but...The new machine has an issue which is why we're getting some HTTP 500 error messages. If we get our one dead server back or resolve the issue on the new server, things should be peachy. Meanwhile, they're 80% peachy.
For the time being, if you get a 500 error, please just reload the page.
The People
If E2 goes down, it's a decent idea to try to reach somebody associated with E2. E-mail to webmaster@everything2.com will get to some important people (myself not included). If you have a means to reach alex, Oolong, avalyn, myself, or nate, we all can remotely access the server to check on problems. My AIM is listed on my homenode and may be the single most reliable way to get ahold of me. #everything2 occasionally has an admin in it, but it's less active than back in the good, old Swappian days. Others than those listed above have access to our servers as well, but are less likely to be contactable.
More Technical Rambling
For those of you interested in the more technical details of the matter, please continue reading. Who knows, maybe you can help us out here?
E2's configuration looks roughly like this:
+- web3 ---------+
everything2.com / \
----> web2 -*----*--- tcwest(web5) ---*----> db2
\ \ /
\ +- web6 ----------+
\
+----------------------- web4
devel.everything2.com
All requests come through web2. Requests destined for production (www.)everything2.(com|org|net) get load balanced between our webheads: web3, tcwest, and web6. You'll note that tcwest has two names — it inherited an IP address from the old web5, and since that name was written in all of the machines' hosts files, tcwest gets called by multiple things.
If the webheads (web6, tcwest, and web3) can't be reached reliably, web2 serves up the "hamster negotiations" page with a link to the #everything2 chatroom.
In this case, I noticed the site was serving the hamster page a few minutes after it went belly up, somewhere around 4am Colorado time. I logged into web2 and tried to rectify things. web2's disk had been remounted read only with errors detected. I attempted to fsck and acknowledged far too many repairs to count. I then *gulp* rebooted the machine. Not a good move. I should have waited until we had somebody on site before potentially losing contact with the gateway to E2. I sent out an e-mail to a bunch of the admin types. A few hours later, someone on site rebooted web2. I was asleep at this point.
alex logged in once this was done and realized that web6 was down, in addition to the disk weirdness on web2. He got the site back up running on just tcwest. Running on just one machine, the site was expectedly slow and often dropped requests on the floor.
Until the downtime on March 10th, tcwest and web6 were the only two machines in our load-balanced pool. Once I woke up and read that web6 went down, I configured and added web3.
web3, web6, and db2 are all new boxes (Quad Xeon X3360, 8GB RAM) that nate kindly got for E2 after the downtime in July. You may recall the new web6 and db2 were put into rotation in October after a short period of (planned) downtime. web3 probably could have been added much sooner to the webhead pool, but I did not move quickly on that matter.
The Scourge of the 500 Errors
kthejoker previously spoke about 500 and 503 errors. Both of these errors are generated by Pound running on web2 when it has trouble reaching the webheads. 500 errors generally indicate that the destination Apache process crashed before it served its response. 503 errors generally indicate that load is overwhelming our servers.
When we were running on only tcwest, we were getting very slow responses and 503s out the ying yang. Now, with web3 & tcwest we have faster responses but also the occasional 500. (Less than 1% of requests are getting a 500 error, but I'd much rather this was in the 0.01% region.)
I've established that the only Apache crashes we're seeing are on web3 and, further, that these crashes always happen when accessing MySQL. Not a single 500 has been generated for requests for static content. An example backtrace from an Apache crash looks like this:
#0 0x00007f6b9364efbe in mysql_send_query () from /usr/lib/libmysqlclient_r.so.15
#1 0x00007f6b9364f029 in mysql_real_query () from /usr/lib/libmysqlclient_r.so.15
#2 0x00007f6b7aa75dc1 in mysql_st_internal_execute () from /usr/lib/perl5/auto/DBD/mysql/mysql.so
#3 0x00007f6b7aa764dc in mysql_st_execute () from /usr/lib/perl5/auto/DBD/mysql/mysql.so
#4 0x00007f6b7aa7cc07 in XS_DBD__mysql__st_execute () from /usr/lib/perl5/auto/DBD/mysql/mysql.so
#5 0x00007f6b8a13e9fe in XS_DBI_dispatch () from /usr/lib/perl5/auto/DBI/DBI.so
#6 0x00007f6b8e0b66d0 in Perl_pp_entersub () from /usr/lib/libperl.so.5.10
#7 0x00007f6b8e0b4972 in Perl_runops_standard () from /usr/lib/libperl.so.5.10
#8 0x00007f6b8e0b22c8 in Perl_call_sv () from /usr/lib/libperl.so.5.10
#9 0x00007f6b8e391244 in modperl_callback () from /usr/lib/apache2/modules/mod_perl.so
#10 0x00007f6b8e391954 in modperl_callback_run_handlers () from /usr/lib/apache2/modules/mod_perl.so
#11 0x00007f6b8e391f4f in modperl_callback_per_dir () from /usr/lib/apache2/modules/mod_perl.so
#12 0x00007f6b8e38b9a0 in ?? () from /usr/lib/apache2/modules/mod_perl.so
#13 0x00007f6b8e38bb59 in modperl_response_handler_cgi () from /usr/lib/apache2/modules/mod_perl.so
#14 0x00007f6b95200c13 in ap_run_handler () from /usr/sbin/apache2
#15 0x00007f6b952043af in ap_invoke_handler () from /usr/sbin/apache2
#16 0x00007f6b95211aa0 in ap_internal_redirect () from /usr/sbin/apache2
#17 0x00007f6b8de01bd5 in ?? () from /usr/lib/apache2/modules/mod_rewrite.so
#18 0x00007f6b95200c13 in ap_run_handler () from /usr/sbin/apache2
#19 0x00007f6b952043af in ap_invoke_handler () from /usr/sbin/apache2
#20 0x00007f6b95211c7e in ap_process_request () from /usr/sbin/apache2
#21 0x00007f6b9520eab8 in ?? () from /usr/sbin/apache2
#22 0x00007f6b952085e3 in ap_run_process_connection () from /usr/sbin/apache2
#23 0x00007f6b95216f65 in ?? () from /usr/sbin/apache2
#24 0x00007f6b949441ad in dummy_worker () from /usr/lib/libapr-1.so.0
#25 0x00007f6b947083ba in start_thread () from /lib/libpthread.so.0
#26 0x00007f6b94474fcd in clone () from /lib/libc.so.6
#27 0x0000000000000000 in ?? ()
Why the heck is it doing this? Good question. I don't know. I've updated Apache and MySQL to their most recent packages. Research suggests that this sort of thing has happened sometimes when MySQL and Apache are compiled with version mismatches, so I recompiled Apache from a source package. No improvement seen.
web6 theoretically had the same configuration as web3, and web6 didn't have the same problems. It'd be nice to compare configuration on the two servers to rectify the issue, but with web6 offline, this isn't possible.
This is an open issue we're working on resolving. Feel free to /msg me if you have a suggestion.
Takeaways
-
Every time something like this happens, there's a lot of "shouldn't E2 be running..." The benefit of having a bunch of geeks for some of our most vocal users. Although my knee-jerk reaction to these statements is to perceive it as negative criticism, they are driven by a desire to make things better. And that is a good motivation.
Some perspective on those who want to help out: All of our servers need to fit into a relatively small space in a professor's office. This means anything which generates more heat (including an extra server) is a liability to all of E2. Further, E2's resources as far as on-site maintenance are even more limited than our software resources. Even if you have hardware for E2 to use that would be perfect, are willing to give it to E2 for free, and you could ship it to where it needs to be, we still might not have the resources to plug it in and test it for weeks or months.
-
Along those lines, though, it's clear web2 should be made more highly available. Since the loss of that one machine takes out E2, either it or its disks should be redundant. I don't know its disk configuration, but we should add a RAID setup with mirroring if possible (if not already present). That seems like it'd be easier than replicating web2 and having a means to switch over.
-
It's helpful to have error reports, even if you think we might already know about a problem, as it's possible we thought we fixed it, or never really knew about it. But, for this one — Yes guys, I know we have 500s. You can stop reporting that one for now.
-
We really need web6 back up. Even if the present Apache issue on web3 is fixed, E2 can only barely be run on one of web3 or web6, and if web3 goes down at this point, we're in trouble.