Hey folks. It's been a busy couple of months while my wife and I (mostly my wife), had a child, and I have been working on a longer-term technical debt project that stands in front of our redesign. I talked a bit about this in the State of the Site writeup I did a few months back.
Lessons learned on the road to UTF8
You can hit up a bunch of different websites about some of the technicals here, but this a cleanup so many years in the making. It's taken me a while to really grasp some of the technicals for how and why our database situation is the way it is.
Currently, if you take a look at the HTTP headers that go back and forth between your browser and the site, you'll notice that we always answer in iso-8859-1. This is a fancy way of saying certain english and swedish characters are here. We've long worked around this limitation by using html sequences and the hex of the code. For instance, #928 produces: Π We don't store these characters natively in most cases, which means you can't search for them
Here's where it gets weird. Mysql will let you place data in the database as whatever you'd like. Even if you have a field that wants latin1 data, you can jam utf8 data in there, and it won't really complain much. You just can't pull it out, or select on the bytes, and it ends up as question-marks on the display side. Any homogenization got taken care of when the data got consumed by a utf8 client, namely mysqldump.
Why now? Well there are a large number of problems associated with this dual encoding strategy. The first of which is that json, on which this site heavily relies, is always unicode (utf*). This is part of the reason why some of our outputs behave strangely. Also, just having one good and holy encoding to rule them all just makes sense.
I went through a lot of trouble doing everything I can to verify that the procedure was automated, hit every table, fixed the database and handled all of the corner cases. On my laptop, the convert script takes about 6 hours to go, and that includes creating character-length verification tables to make sure that we don't truncate anything on conversion.
Side effects of utf8
- Due to the collation, the site will treat 'jaybonci' and 'jÅybonci' as the same.
- You should now be able to directly enter japanese and non-western characters with ease, and it should always display properly
The downtime to address this might last as long as 6 hours, though I think it's probably going to be more like 3. I'm scheduling it for 10pm EST on Sat Apr, 20.
And not as a vote of timidity, but I'll be taking a backup ahead of time.
- UTF8 conversion done
- Vagrant updated to new vagrant version, 1.12
- Removed unMSify from Everything::HTML's parseLinks.
- Removed unMSify from devparselinks
- Worked locally for a bit (Note to self: SELECT n.node_id, n.title, n2.node_id, n2.title FROM node n, node n2 where n.title=n2.title and n.type_nodetype=n2.type_nodetype and md5(LOWER(n.title)) != md5(LOWER(n2.title)) INTO OUTFILE '/tmp/e2_different_titles.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n';)
- Same fix for linkstylesheet
- Let yandex back in after talking with their search relations team a bit.
- Took out the captcha stuff everywhere
- Fixed a problem with zen stdcontainer that caused the base href to be set incorrectly in the development environment.
- Nuked test newwriteups, as I am doing a sweep for items that expect $NODEs to be HASHes.
- Took out the 'stars' stuff from zen writeup display page
- Fixed a check in writeup maintenance create where we were looking for NODEs to be a HASH
- Removed the hash check in addwriteup
- Neutered the security hole that is displayNODE. Replaced the HASH check with a check that will also check for Everything::node
- Fixed a paging bug in Drafts which caused a server error in Development.