Googlewhacking to estimate the number of pages indexed by Google (idea) by Gorgonzola

Cute, but mathematically and statistically unsound. When you try to say

g(x,y) = x_f*y_f*T

you assume that g(x,y) is independent of both g(x) and g (y), that is, it assumes a web page author puts Y in a page regardless of whether he or she has put X in it or not. This simply isn't true! Most web pages are built to convey ideas, and words appear together in pages if there is a memetic connection between them.The stronger the connection, the more likely they are to appear together. A page which contains tectonic is much more likely to contain subduction than meringue.

So, lt's try the technique on some unrelated words:

g(aardvark) =    238,000
g(blunderbuss) =  16,800
g(carioca) =     323,000
g(aardvark,blunderbuss,carioca) = 29

T = 2.39e14

a factor of 100,000 over Google's claims. It might be interesting to notice that all 29 pages with all three words are word lists, and so the words are as "unrelated" as they can get. Of course, I'd have been delighted to find "A carioca paused to discharge a blunderbuss at a passing aardvark, but missed and resumed samba-ing down the street." in Rio Expresso!

So, you might think the results would be more accurate when words are more or less randomly distributed through webpages, or at least words without semantic content. We'll try the technique with the most common words in the English language:

g(the) = 2.89e9
g(a) =   1.77e9
g(an) =  384,000,000
g(of) =  1.92e9

g(the, a, an, of) = 10,900,000

T = 1.7e34

Not even the Defense Department has servers with that sort of capacity. So, something's really wrong here.

Now, to our original purpose: estimate the number of pages indexed by Google. Notice the results for "the": 2.89e9. It's reasonable to expect "the" to appear in every Web page written in English, and in quite a few not written in English. Assuming that the number of pages not written in English is less than the number of pages written in English, Google probably caches somewhere between 3 and 5 billion pages.

googlewhack	Using google cache to scan a web page for relevance to your research	Statistics every writer should know	Why Japanese TV mosaics out handcuffs
Words may sound funny if you repeat them aloud too many times	whack off	Whack whack	Everything Statistics - September 29, 2001 (3)
Is 196 a palindromic number?	You have far too much time on your hands	Using Google for cultural anthropology	Dave Barry
Saving Quicktime movies from a web page	Things that I learned from reading erotic stories involving latex	Confessions of an Anaesthetist	Typing www into Google
Google Zeitgeist	We got the kind of games you can't rent at Blockbuster	Thinking too much	computer statistics
Humorous Writings of E2	A complete backup of the entire Internet	HOWTO: Build a lasting peace in the Middle East