Inspired by the recent googlewhacking fad, I thought that by using the overlap of two searches, you should be able to estimate the size of the thing you are searching in. This assumes lots of unreasonable things about distribution, both in what you search for, and what exists on the web.

Here is the technical part. The results are at the end.

let T = the total number of pages indexed by Google.
define g(x) = number of hits google finds for x.
define g(x,y) = number of hits google finds for x y.
define g(x,y,z) = number of hits google finds for x y z.
etc.
let xf be the frequency that the term x appears on any page indexed by google (it's from 0 to 1). i.e. if the term "x" appears on 50% of all pages, then xf = 0.5


So,
g(x) = xf * T
g(x,y) = xf * yf * T
g(x,y,z) = xf * yf * zf * T

Solve for T in the second equation:
T = g(x,y) / (xf * yf)
T = g(x,y) / ((g(x)/T) * (g(y)/T)) (since xf = g(x)/T)
T =g(x,y) / ((g(x) * g(y)) / T2) (simplifying)
T =g(x,y) * T2 / (g(x) * g(y)) (simplifying more)

Solve for T and you get
Result 1
T = g(x) * g(y) / g(x,y).

Since you get a value of g by using google, you can get trial values for T whenever you want!

You can do a nastier version of the above to find that with n terms x1, x2, ... , xn

Result 2
in that case,
T = ((g(x1) * g(x2) * ... * g(xn)) / g(x1,x2, ... ,xn))^(1/(n-1))

Test Results!

Ok, now the tests! I will do 3 tests with pairs, and 3 tests with triplets. I will get the test words by doing "random node" and taking the first word besides "the" in the node title.

T is the The number of pages indexed by Google.
g(x) is the number of hits for "x" on google.

Test 1
g(house) = 46,100,000
g(Estrangle) = 94
g(house,Estrangle) = 24
T= 1.8e8

Test 2
g(Hostess)= 450,000
g(gene)= 6,240,000
g(Hostess, gene) = 7,310
T = 3.8e8

Test 3
g(Toronto) = 7,050,000
g(Brassclaw) = 836
g(Toronto,Brassclaw) = 2
T = 2.9e9

Test 4
g(Unlikeness)= 3890
g(Ophidiophobia) = 956
g(Adobe) = 7,160,000
g(Unlikeness,Ophidiophobia,Adobe) = 0
T = ?

Test 5
g(condom) 689,000
g(open) = 60,200,000
g(Failing) = 2,200,000
g(condom,open,Failing)= 5410
T = 1.3e8

Test 6
g(Two) = 112,000,000
g(Genesis) = 2,880,000
g(brilliant) = 2,560,000
g(Two,Genesis,brilliant)= 43,000
T = 1.4e8

Results I got 5 results for T: 1.8e8 3.8e8 2.9e9 1.3e8 and 1.4e8. The average is 7.6e8 = 760,000,000.

After doing all this, someone told me that google actually claims to index 2e9 = 2,000,000,00 pages. So, the accuracy is not too horrible.

I am sure this has already been done in real math, and has a real name. Does anyone know it?

I know that google publicizes the number of pages they claim to index. However, this technique can be used on other search engines as well, whose credibility is in doubt.

Cute, but mathematically and statistically unsound. When you try to say

g(x,y) = xf*yf*T

you assume that g(x,y) is independent of both g(x) and g (y), that is, it assumes a web page author puts Y in a page regardless of whether he or she has put X in it or not. This simply isn't true! Most web pages are built to convey ideas, and words appear together in pages if there is a memetic connection between them.The stronger the connection, the more likely they are to appear together. A page which contains tectonic is much more likely to contain subduction than meringue.

So, lt's try the technique on some unrelated words:

g(aardvark) =    238,000
g(blunderbuss) =  16,800
g(carioca) =     323,000
g(aardvark,blunderbuss,carioca) = 29

T = 2.39e14

a factor of 100,000 over Google's claims. It might be interesting to notice that all 29 pages with all three words are word lists, and so the words are as "unrelated" as they can get. Of course, I'd have been delighted to find "A carioca paused to discharge a blunderbuss at a passing aardvark, but missed and resumed samba-ing down the street." in Rio Expresso!

So, you might think the results would be more accurate when words are more or less randomly distributed through webpages, or at least words without semantic content. We'll try the technique with the most common words in the English language:

g(the) = 2.89e9
g(a) =   1.77e9
g(an) =  384,000,000
g(of) =  1.92e9

g(the, a, an, of) = 10,900,000

T = 1.7e34

Not even the Defense Department has servers with that sort of capacity. So, something's really wrong here.

Now, to our original purpose: estimate the number of pages indexed by Google. Notice the results for "the": 2.89e9. It's reasonable to expect "the" to appear in every Web page written in English, and in quite a few not written in English. Assuming that the number of pages not written in English is less than the number of pages written in English, Google probably caches somewhere between 3 and 5 billion pages.

Log in or register to write something here or to contact authors.