PageRank is the ranking system utilized by the Google search engine. The algorithm was developed by Sergei Brin and Larry Page at Stanford University between 1995 and 1998 and described in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine.
The Logic Behind The Algorithm
The idea for PageRank comes from the standard practice in academia that the best way to judge the relative merit of a particular paper is in the number of citations that paper has from other research papers. Furthermore, the citations have more value if they come from papers that themselves are frequently cited.
Brin and Page realized that in many ways, hyperlinks embedded in web pages are much like citations; they often link to more relevant information on a particular subtopic, much as citations in academic papers do. Thus, Brin and Page sought to model this phenomenon in web pages. This required the development of an algorithm that would come to be known as PageRank.
The PageRank Algorithm In Detail
The PageRank algorithm consists of two distinct functions that are used to calculate the numerical level of quality and quantity of links to a particular page.
The first function, which we will call C(), is the number of outgoing hyperlinks from a given page. So, for example, if page A has six links, then C(A) = 6. This factor helps reduce the effect of pages with huge numbers of links.
The second function is the actual PageRank function, which we will call PR(). The calculation of PR() for a particular page returns that page's PageRank value.
This function also includes a dampening factor, designated d, that has a value somewhere between 0 and 1. Google uses a value of d of approximately 0.85.
So, the PR() for a given page A looks like this:
PR(B1) PR(B2) PR(Bn)
PR(A) = (1 - d) + d * ( ------- + ------- + ... + ------- )
C(B1) C(B2) C(Bn)
where B1, B2, and so on to Bn are all of the pages with a link to page A.
Whenever a page (or a series of pages) are added to a collection of pages, the PageRank of all of the pages must be re-calculated. Usually, the PageRank of all of the new pages are calculated using the original values of the old pages, then the old pages are recalculated in reverse chronological order, so that newer pages have their PageRank recalculated first.
Those with mathematical backgrounds will note that PageRank is basically just the eigenvalue of a particular page.
Benefits of PageRank
The primary benefit of PageRank is that it provides a quantitative method for comparing the relative usefulness of web pages. By describing the web in a numerical fashion, one can use strict numerical comparison to compare pages, which makes finding useful resources very fast.
Another major benefit of PageRank is the simplicity of the calculation. It only requires a small amount of arithmetic to calculate the ranking of a particular page which then gives an easily usable numeric value by which one can compare a page to others.
Drawbacks of PageRank
One drawback of PageRank is that it is, by its nature, partial to older web pages that have been around for a while and are well-linked. Newer pages that surpass the older page often take a long time to exceed the original because the older page has a large number of established links. Thus, in practice, Google utilizes other algorithms such as text analysis to provide a bit of a "boost" to newer pages.
Another drawback is that useful long resources often have little impact on another page's PageRank. If someone builds a long, well-written resource on a particular topic, the linked pages will often only get a tiny boost because of the large number of links from the new resource. One potential method of getting around this is capping the value of C() at a certain maximum; whether Google is actually doing this is unclear.
An Example of PageRank
Let's say we're building a fresh version of Google, and we're going to start with just six pages. Let's call them A through F. These pages contain the following links:
A B C D E F
- - - - - -
B C B A C E
C D D B
D F C
E E
F
We are going to add the pages to our index in alphabetical order, and we decide that d has a value of 0.85. So, we start with A:
PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
On our first calculation, we don't know of any pages that link to A, so the value is set to 0.15, which is the minimum PageRank a page can have. Now we calculate B, and then using that result, we go backwards and recalculate A:
B is linked to from A, so
PR(B) = (1 - 0.85) + 0.85 * (PR(A) / C(A))
= 0.15 + 0.85 * (0.15 / 4)
= 0.15 + 0.85 * 0.0375
= 0.15 + 0.031875
= 0.181875
A isn't linked to from A or B, so
PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
At this stage, B has a higher PageRank than A because A links to B, but no known pages link to A yet. Let's repeat this with C, then re-calculate B, then A.
C is linked to from A and B, so
PR(C) = (1 - 0.85) + 0.85 * ((PR(A) / C(A)) + (PR(B) / C(B)))
= 0.15 + 0.85 * ((0.15 / 4) + (0.181875 / 3))
= 0.15 + 0.85 * (0.0375 + 0.04546875)
= 0.15 + 0.85 * 0.08296875
= 0.15 + 0.0705234375
= 0.2205234375
B is linked to from A and C, so
PR(B) = (1 - 0.85) + 0.85 * ((PR(A) / C(A)) + (PR(C) / C(C)))
= 0.15 + 0.85 * ((0.15 / 4) + (0.2205234375 / 2))
= 0.15 + 0.85 * (0.0375 + 0.11026171875)
= 0.15 + 0.85 * 0.14776171875
= 0.15 + 0.1255974609375
= 0.2755974609375
A is still not linked to from any known page, so
PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
B is the most popular page, due to the fact that both A and C are both linked to it, and both have relatively few links.
At this point, the cycle would continue calculating D's PageRank, then E's, then F's. Once the PageRank values are completed, these are used to actually rank the search results, so that all pages that match a given term are then returned based on their PageRank.
PageRank At Work In Google
Google takes the above algorithm and applies it to more than three billion web pages. Their exact methodology of retrieving pages based on their term and ordering them based on PageRank is a trade secret, but it is known that they utilize a farm of several thousand Linux machines in order to execute the massive amount of arithmetic involved in the calculations, and then later operations needed to assemble the data structures so that all of the PageRank data is actually usable.
In other words, Google is an example of an extremely wide-scale application of the PageRank algorithm.