Hey everybody, my name is Jay, and I'm a professional coder type. There's a long story about me getting my start screwing up E2 back in the day, but the short version is that I'm back.
- I've pushed a change live that should reduce the amount of load on the webheads significantly. The story is that we were never really caching the result of Everything::NodeCache::deriveType properly, so that we had to make tons and tons of calls to it to rebuild each type every pageload even though we were supposed to be using staticNodetypes. One line fix. Devel::NYTProf was a great help in figuring this out, and the change is now live. I've got a local vagrant VM setup, installed with Chef, that made this possible. I'll be sharing the base image and the configurations in a few days once I'm confident it's correct and usable.
- Jan 3 (Early) - I've updated dom01 to have SizeLimit installed on it, and it seems to be working well, even though I'm not happy with how long it is lasting, nor the speed of the leak. I'm going to make this change globally live to protect us against swap thrashing, and then move on to figuring out what is causing the issue. Also, since we're not running memcached's I'm going to change up that part of the configuration for now I think.
Tue Jan 3 11:04:47 2012 (4517) Apache2::SizeLimit httpd process too big, exiting at SIZE=301712/300000 KB SHARE=4352/0 KB UNSHARED=297360/0 KB REQUESTS=81 LIFETIME=4511 seconds
Tue Jan 3 11:05:02 2012 (4511) Apache2::SizeLimit httpd process too big, exiting at SIZE=304576/300000 KB SHARE=4340/0 KB UNSHARED=300236/0 KB REQUESTS=86 LIFETIME=4629 seconds
Tue Jan 3 11:06:59 2012 (4524) Apache2::SizeLimit httpd process too big, exiting at SIZE=313300/300000 KB SHARE=4364/0 KB UNSHARED=308936/0 KB REQUESTS=81 LIFETIME=4779 seconds
- Jan 3 (Late) - I've pushed the Apache2::SizeLimit stuff to production and the webheads are greatly happy with it has sit down. There is still a bit where occasionally processes will spin out of control, and I'm not sure what causes that, as it seems somewhat random. I'm hoping that as we continue cleanups it should go away. Also, I think that the apache2 sizelimit stuff is going to take care of the 503s nearly entirely. If it does not, I'll have to add in more profiling.
- Renamed the boa/images/People.gif to People.gif.bak so case-sensitive platforms won't complain. LMK if that breaks anything. There were 6 or so of these.
- Removed INDEX.PL (all caps) for the same reason as above
- Removed the memcached configuration from the initEverything calls, since we aren't using memcacheds right now. We'll revisit that later if need be, and if hardware permits. This represents about a 3-4% speed increase in NYTProf not having to make the call.
- I've changed the haproxy config to point to /node/fullpage/haproxyok, instead of /title/ENN, because that was crazy. It had to render ENN every time, including all of the links, which was poisoning the caches
- I've found the infinite loop to be caused by a bot in at least one circumstance, but I'm not sure the underlying technical cause of it
- Update: I have a 100% reproducible case that causes the infinite loop. If I have time, I'll pull dom02 out of rotation to debug it. Funny thing is that it's a poisonous node. No, I won't tell you which one.
- Made a change to nodeName to never reference keyword settings, as it's not something we have or use in e2, and I'm not worried about general engine cases anymore
- Infinite loop problem solved! 100 Worst Britons was one of the triggers. A bot found it while I had the forensics running. It was a bad starter <td> tag. I've patched it for now, but I'd like to know the actual cause of it.
Jan 5 -
- Changed the configuration parser to happen once per apache restart instead of once per page load. It's a simple change and one less file read per pageload.
- Did some investigation into the speed stuff with vars, as that is a huge load. Found some performance there:
edev: vars unescape inlining. Should be about several percent faster pageload overall.
- Pre-loaded CGI.pm in an effort to stop it from trying to be all smart about loading its symbols. It might be a smidge more memory, but I want to get out of calling Exporter::import where I can
Jan 8 -
- Nuked the schema dbtable, because it doesn't exist in the db, and it is fowling up my bootstrapping code
- Huge progress made on bootstrapping code, such that it's almost ready for public consumption. Had to create a dbtable entry for rating, as it doesn't have one, yet it is required by nodetypes in the database
- Been a busy few days in infrastructure land, but here's where we're at. The code is mostly imported over at github. It's not really fit for public contribution yet, but if you want to keep an eye on it, it's easy to guess the URL for it. Bootstrapping continues along and the node export is pretty well polished. There are a few items off of the top of my head that need to be properly done up, like the stored procedures need to have some kind of node sanity built around them, and we still need some code to properly diff and handle deltas and database transitions, but we'll get there. Unlike rails which needs to handle the general case, we can get away with just ordering some ALTER TABLE statements and calling it a day
- Fixes in GH for Vagrant 0.9, which I did not have installed (I had 0.8.10), and stylesheet gen was not being run on install
- Still working on infrastructure, which is important because we need to have very tight data control over how we manipulate nodes out of source-control. I'm exercising the code and finding minor issues as I go. I've created a default mapping branch for generally untested vagrant-specific fixes, until I can get WebDriver tests up and running. After we get this infrastructure stuff together, making rapid changes should be a lot easier.
- Working on some business-end type stuff now, but I am going to need to change over the adsense account, so if you see an ad error anytime shortly, please let me know.
- More work on ecoretool.pl, starting the very careful process of the import procedure, which will allow rapid, sane interation over the database
- Still totally buried around coding infrastructure. The good news is that I am making pretty good headway. I am in the process of going through each of the types and reducing the node pieces down to the most vital bits of non-duplicated information and storing that to XML, along with rules on how to rebuild and insert the structures.
Overall, several problems remain:
- Somewhere we are leaking memory. The Apache2::Sizelimit "solves" it for now, in that it preserves the machine, but I'll need to do some really in-depth looking at of the NodeCache/NodeQueue to see what's wrong. I can reproduce the problem in my local virtual machine, so it's a codebase issue, but I don't know what it is. 500 Guest user page loads shows a noticeable resident increase (even though the nodecache is full). It happens quicker than that in production with a greater variety of pageloads and users, so I'm assuming it has something to do with NodeCache churn, but no smoking guns yet
- My gut says that we can buy some performance on object copy by not having an _ORIGINAL_VALUES hash item only to have it in place to compare on updates. That is being cached and is effectively halving the performance of the nodecache.
- deriveNode might want to use Clone.pm instead of the raw copy, but that's a really small performance pickup