SEO analysis of Everything2.com
The internet is a big place, and Everything2 is a big website - but there are a lot of reasons for why Everything2 isn't getting the levels of traffic it deserves. The below is a search engine optimization analysis for Everything2.com, done on 12 November 2008. Hopefully some of the suggested changes can be implemented in order to make E2 a higher-traffic site all around.
It's worth noting that I don't have access to E2's Google Analytics account or its server logs, so this is an analysis based on the same information as you and I can see - it could be made a lot more accurate with further information.
Everything2 historically had a lot of uptime problems, including being down for several hours each night for back-up purposes; in the past year, however, the site has had relatively few technical issues, and the ones who were there aren't significant enough to cause major problems to search engines, even though they might occasionally be annoying to, say, E2 addicts.
HTML / CSS compliance
Seeing as how E2 is a site which allows its users to upload their own articles in HTML (and how many users don't close their tags, use them wrong, or have other habits which aren't 100% compliance-accurate), it's unrealistic to expect full compliance.
Having said this, the standard E2 theme (i.e. the one seen by non-signed-in users and search engine spiders) should be as close as possible to being HTML and CSS compliant.
Problems found on the home page:
- E2 reports HTML 4.0 Transitional as its document type, yet the system identifier doesn't match.
- E2 is using short tags (<foo /> instead of <foo></foo>), which doesn't sit well with HTML 4.0 transitional
- The home page has closing tags without opening tags
- E2 is wrapping hidden input fields with div tags, which doesn't seem to make sense
See the W3 validator for more.
The problem with not being HTML compliant is that while the page may look correct in most browsers, it may still throw search engines a wobbly, confusing them to the point that parts of the site don't get spidered correctly.
Impact: 7/10 / Rating: 4/10
HTML document length
Because E2 allows users to enter nigh-on any amount of text (there's not many write-ups which exceed the maximum character limit), it's highly likely that many documents within E2 exceed the maximum recommended HTML. Despite of this, most of the important HTML is near the top of the document, which means that this isn't a huge deal, overall
Impact: 1/10 / Rating: 7/10
E2 doesn't use Meta description tags, which is a missed trick: these are often used by search engines to give a preview of a document to users, enticing them to come visit the page. Having said that, try a search on dance on E2 in Google, and you see that Google is already correctly picking up this information. For correctness, it's worth adding it in, however: for every page, there should be a <META description="Page summary (up to 20 words) goes here"> - especially for special pages such as 'about us' pages and the home page.
Page summaries for all content-driven pages could be made up of the 20 first words of the first write-up on that page, which would cover off most of the write-ups.
Impact: 4/10 / Rating: 7/10
Impact: 2/10 / Rating: 9/10
Menu and navigation system
Unavoidably, a site with well over 3M unique URLs will be dependent on being crawled extremely deeply. E2's unique link and link policy means that, through the use of user-selected hard links and user-generated soft links, all of E2 is pretty well connected to each other, even without a navigation system itself.
An illustration on how well this works - let's pretend to be a search engine, and click on the first content link we come across in an user journey throughout the site, from the home page:
Korean War - World War II - Germany - Denmark - findings: - Slogging through the molasses of sleep-deprived conscious thought - I, Insomniac - 1997 - Princess Diana - Diana, Princess of Wales - courtesy title - title - tilde - approximately
And so on, and so forth. Bearing in mind that a search engine doesn't follow only the first link, but the first 20 or so links it finds, it's pretty safe to assume that it will eventually find every writeup which has been decently link and linked - so as long as the E2 community continues to encourage this behaviour, we're in good shape.
Impact: 10/10 / Rating: 10/10
E2 doesn't have a site map, but it's difficult to see how it could make itself a good one anyway. There has been some research done into this recently, and a lot points to that site-maps in general can do more harm than good on extremely big sites (which E2 most certainly is), and given that the inter-write-up navigation is so strong (see the above point), I see no good reason to implement one.
Impact: 0/10 / Rating: 0/10
Unique and unambiguous URLs
Ah, now this is probably one of E2's biggest problems, as it suffers from a tremendous amount of content duplication. Take the node Fuck, for example: there are 5 write-ups inside this node, which means that you can get to each individual write-up in several different ways (by visiting Fuck or Fuck (idea), for example). This is a bad thing, however: as a search engine, would you send a person searching for fuck to the node or the individual write-up?
On top of this redundancy, however, is the URL policy E2 has adopted. For Fuck, all of the following are valid:
but it gets worse: Everything2 lives on a series of different URLs. In fact, all of the following are perfectly valid:
which means that when all's said and done, there are 30 ways of getting to the same piece of text - a huge problem.
Some would argue that there are advantages to having this many different domain names (such as if you want to have multiple E2 accounts), but from a search engine's point of view, this gets confusing. A search for Site:everything2.* intitle:Butterfinger McFlurry@Everything2.* with omitted results included gives 17 results. A good number here would be 1.
There are a couple of options for fixing this: We could include a robots.txt which is served based on which URL is served. If the domain is http://everything2.com, for example, serve an 'allow /', if it's anything else, serve a 'disallow /'. This won't make any difference to users of the site, but will keep bots who comply with the robots.txt directives out of all domains but the dot-com-without-www's.
The second option is to create a 100% unique permalink for each write-up, and redirect users who try to access it from any other URL to the canonical permalink. This means that if you go to any of the combinations of above, you'll be 301 redirected to, say, http://everything2.com/e2node/fuck.
The secondary problem here is what to do about individual write-ups versus nodes. Blogs often solve the problem by having a cut-off point: If you're reading a front page, you get the first hundred or so words of a blog entry, and then have to click through to the individual blog post to see the whole thing. Search engines understand that one is an index, but the other is the finished product.
On E2, such an approach would be unpopular (I'd hate it), but also not really workable. The other option is to let spiders index the parent node (say, fuck), but put a no index META tag in the header of all individual write-ups: <meta name="robots" content="noindex" />. This is transparent to site users, so we could still browse the site as usual, but robots would turn back from the individual write-ups.
Seeing as how the softlinks and hardlinks of an individual write-up are already present in the individual write-up I'd argue that the individual write-ups are redundant from a search engine point of view. This would probably reduce the number of indexed pages on Everything2.com significantly (and we'd lose the long tail of searches which might apply marginally better to an individual write-up than to the parent node), but would significantly improve the way a search engine would perceive the nodes - which are now seen as comprehensive collections of unique information.
Impact: 9/10 / Rating: 2/10
URL structure and permalinks
The URL structure of E2 has changed quite significantly not that long ago, but it still doesn't quite cut it, and I think it wouldn't be too much work to settle on a proper RESTful architecture for the URLs.
Currently, we have a few different systems in use.
which is much better than http://everything2.com/index.pl?node_id=25927 , but the latter is still a valid URL, synonymous with the URL above, so should be redirected to the permalink.
Having said that, http://everything2.com/e2node/Seven-words-you-can-never-say-on-television would be a much better URL still, as it's a lot more readable than the URL-encoded spaces, even though it might present some issues with keeping some nodes apart (know it all is an encouragement on a commercial banner, while know-it-all is nearly exactly the opposite).
The URL for tea would be:
Oolong's write-up in Tea would be:
http://everything2.com/e2node/tea/by-oolong (but this would be hidden from search engines)
Relevance and copy content
Since almost all of E2 is user-contributed, and since style guides for E2 have been well and truly rebuffed (trust me, I was there for the last attempt, and it wasn't pretty... Wrinkly, if he was still around, would back me up on this one), it's just something we have to live with. The biggest problem here is that E2 has too many semantic mark-up styles.
Ideally, one page should contain exactly one H1 header tag, fewer than three H2 header tags, and a limited number of H3...H6 tags. Sadly, there's no way of policing this, but the home page currently contains three H1's, four H2's, and seven H3s - not too bad, but could be better.
Having said that, E2 is a very yummy site to search engines because of it's extremely loud signal-to-noise ratio - I'm not talking about IWRTFTSIYDSU antics, but to the fact that there's very little CSS and images, and a vast amount of text on this place. For that reason alone, it should be the absolute king of content on the internet.
Impact: 8/10 / Rating: 9/10
E2 doesn't really do news, but it might be worth starting - create a blog-style paginated page with the first 150 characters of all write-ups. Treat logs (dream-, editor-, root-, and day-) as separate news streams, and write-ups as a last news stream. Kind of like a 'new write-ups on steroids'. Promote these streams on the homepage (in a tabbed box, perhaps?), and the resulting pages would be a fantastic way for search engines to keep track of what's new - and hell, I might be interested myself.
It would be relatively trivial to hack a Wordpress-like front-end to use the E2 database as a back-end. Fragment cache and memcache the hell out of it, and let it loose - you'd be amazed how many people prefer to interact with a site in this way.
Impact: 6/10 / Rating: n/a
Everything2 has been around for so long, and has such a good place in search engines, that a separate linking strategy would almost be redundant. Having said that, it would be quite interesting to see the impact of the new social bookmarking functionality across E2.
There's a lot of great content on E2 which can (and does) make a big splash in the outside world. Standing on a mountain top (...) with a baseball bat by sam512, for example, would be a huge hit among some social networks. Personally, I would have suggested to use ShareThis instead of re-inventing the wheel - and encourage people to link to the parent node rather than to individual write-ups - but what we've got now is a great start.
The lack of outbound links from E2 could be an issue over time, but introducing external links wouldn't be in the spirit of E2, so I'd advice against introducing them.
Impact: 7/10 / Rating: 8/10
Go to http://everything2.com/asdasd could be a better 404, although E2 makes a very good effort to help users elsewhere on the site, using the Findings: page.
Impact: 3/10 / Rating: 9/10
So, what are we going to do about it...?
I realise I've thrown down a rather big gauntlet to a team of volunteer coders, but then, perhaps an increased number of visitors to E2 increases the Google Adsense revenue (which means we can spend more money on whatever it is we're spending money on) and the influx of baby noders would make it all worth it.
Also, as with everything else, we don't have to do everything. My list of priorities would be (written up as user stories):
E2 SEO product backlog user stories
As a user, I would like all URLs to be canonical, easy to read, and unique, so I know what the 'correct' URL is at all times.
As a search engine, I would like to be told which domain I should be spidering, and which ones I should ignore, so I can tidy up my search index and send more traffic your way
As a search engine, I would like all content you tell me to spider to be unique, so please tell me which pages not to index
As a user, I would like a 'what's new' section, where I can browse all writeups by date and type, so I don't risk to lose a single editor log or write-up.
As a search engine (and as users on a low-bandwidth connection), I would like all unused HTML to be removed from the source code.
As a search engine, I would like E2 to use META CONTENT tags, so I can present searchers with an accurate summary of what the page is all about. (corollary: Encourage users to continue adding a short summary at the top of their write-ups)