display | more...

The ranking system used by Google search engine. It was invented by Sergei Brin and Larry Page of Stanford University in 1998. This system models a random surfer on the Web. This random surfer starts in a random page, takes one URL from it and follows that link, that is the basic idea.

If you have some background on probabilities, you can think of this as a Markovian process in which the states are pages, and the transitions are all equally probable and are the links between pages. As you can also see if "Markovian process" tells you something, that if a page has no links to another pages, it becames a sink and therefore makes this whole think unusable, because the sink pages will trap the random visitors forever.

However, the solution is quite simple. If the random surfer arrives to a sink page, it picks another URL at random and continues surfing again. To be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually q=0.15.

So, the equation is as follows:

Pagerank(i) = (q/N) + (1-q) Sum(j={pages that point to i}; Pagerank(j))

It's worth noticing, and that's why the Pagerank is so appealing in terms of elegance (which is, of course, the whole point of doing Mathematics), that the Pagerank values are the eigenvalues of the modified adjacency matrix.

This algorithm really rocks, because it's fast to calculate (only a few iterations are needed) and in practice it gives good results. The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have too many links. That's why it has to be combined with textual analysis or other ranking methods.

PageRank is the ranking system utilized by the Google search engine. The algorithm was developed by Sergei Brin and Larry Page at Stanford University between 1995 and 1998 and described in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine.

The Logic Behind The Algorithm

The idea for PageRank comes from the standard practice in academia that the best way to judge the relative merit of a particular paper is in the number of citations that paper has from other research papers. Furthermore, the citations have more value if they come from papers that themselves are frequently cited.

Brin and Page realized that in many ways, hyperlinks embedded in web pages are much like citations; they often link to more relevant information on a particular subtopic, much as citations in academic papers do. Thus, Brin and Page sought to model this phenomenon in web pages. This required the development of an algorithm that would come to be known as PageRank.

The PageRank Algorithm In Detail

The PageRank algorithm consists of two distinct functions that are used to calculate the numerical level of quality and quantity of links to a particular page.

The first function, which we will call C(), is the number of outgoing hyperlinks from a given page. So, for example, if page A has six links, then C(A) = 6. This factor helps reduce the effect of pages with huge numbers of links.

The second function is the actual PageRank function, which we will call PR(). The calculation of PR() for a particular page returns that page's PageRank value.

This function also includes a dampening factor, designated d, that has a value somewhere between 0 and 1. Google uses a value of d of approximately 0.85.

So, the PR() for a given page A looks like this:

```                         PR(B1)    PR(B2)          PR(Bn)
PR(A) = (1 - d) + d * ( ------- + ------- + ... + ------- )
C(B1)     C(B2)            C(Bn)
```

where B1, B2, and so on to Bn are all of the pages with a link to page A.

Whenever a page (or a series of pages) are added to a collection of pages, the PageRank of all of the pages must be re-calculated. Usually, the PageRank of all of the new pages are calculated using the original values of the old pages, then the old pages are recalculated in reverse chronological order, so that newer pages have their PageRank recalculated first.

Those with mathematical backgrounds will note that PageRank is basically just the eigenvalue of a particular page.

Benefits of PageRank

The primary benefit of PageRank is that it provides a quantitative method for comparing the relative usefulness of web pages. By describing the web in a numerical fashion, one can use strict numerical comparison to compare pages, which makes finding useful resources very fast.

Another major benefit of PageRank is the simplicity of the calculation. It only requires a small amount of arithmetic to calculate the ranking of a particular page which then gives an easily usable numeric value by which one can compare a page to others.

Drawbacks of PageRank

One drawback of PageRank is that it is, by its nature, partial to older web pages that have been around for a while and are well-linked. Newer pages that surpass the older page often take a long time to exceed the original because the older page has a large number of established links. Thus, in practice, Google utilizes other algorithms such as text analysis to provide a bit of a "boost" to newer pages.

Another drawback is that useful long resources often have little impact on another page's PageRank. If someone builds a long, well-written resource on a particular topic, the linked pages will often only get a tiny boost because of the large number of links from the new resource. One potential method of getting around this is capping the value of C() at a certain maximum; whether Google is actually doing this is unclear.

An Example of PageRank

Let's say we're building a fresh version of Google, and we're going to start with just six pages. Let's call them A through F. These pages contain the following links:

```A B C D E F
- - - - - -
B C B A C E
C D D B
D F   C
E     E
F
```

We are going to add the pages to our index in alphabetical order, and we decide that d has a value of 0.85. So, we start with A:

```PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
```

On our first calculation, we don't know of any pages that link to A, so the value is set to 0.15, which is the minimum PageRank a page can have. Now we calculate B, and then using that result, we go backwards and recalculate A:

```B is linked to from A, so

PR(B) = (1 - 0.85) + 0.85 * (PR(A) / C(A))
= 0.15 + 0.85 * (0.15 / 4)
= 0.15 + 0.85 * 0.0375
= 0.15 + 0.031875
= 0.181875

A isn't linked to from A or B, so

PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
```

At this stage, B has a higher PageRank than A because A links to B, but no known pages link to A yet. Let's repeat this with C, then re-calculate B, then A.

```C is linked to from A and B, so

PR(C) = (1 - 0.85) + 0.85 * ((PR(A) / C(A)) + (PR(B) / C(B)))
= 0.15 + 0.85 * ((0.15 / 4) + (0.181875 / 3))
= 0.15 + 0.85 * (0.0375 + 0.04546875)
= 0.15 + 0.85 * 0.08296875
= 0.15 + 0.0705234375
= 0.2205234375

B is linked to from A and C, so

PR(B) = (1 - 0.85) + 0.85 * ((PR(A) / C(A)) + (PR(C) / C(C)))
= 0.15 + 0.85 * ((0.15 / 4) + (0.2205234375 / 2))
= 0.15 + 0.85 * (0.0375 + 0.11026171875)
= 0.15 + 0.85 * 0.14776171875
= 0.15 + 0.1255974609375
= 0.2755974609375

A is still not linked to from any known page, so

PR(A) = (1 - 0.85) + 0.85 * ()
= 0.15
```

B is the most popular page, due to the fact that both A and C are both linked to it, and both have relatively few links.

At this point, the cycle would continue calculating D's PageRank, then E's, then F's. Once the PageRank values are completed, these are used to actually rank the search results, so that all pages that match a given term are then returned based on their PageRank.