Googlewhacking to estimate the number of pages indexed by Google

Inspired by the recent googlewhacking fad, I thought that by using the overlap of two searches, you should be able to estimate the size of the thing you are searching in. This assumes lots of unreasonable things about distribution, both in what you search for, and what exists on the web.

Here is the technical part. The results are at the end.

let T = the total number of pages indexed by Google.
define g(x) = number of hits google finds for x.
define g(x,y) = number of hits google finds for x y.
define g(x,y,z) = number of hits google finds for x y z.
etc.
let x_f be the frequency that the term x appears on any page indexed by google (it's from 0 to 1). i.e. if the term "x" appears on 50% of all pages, then x_f = 0.5

So,
g(x) = x_f * T
g(x,y) = x_f * y_f * T
g(x,y,z) = x_f * y_f * z_f * T

Solve for T in the second equation:
T = g(x,y) / (x_f * y_f)
T = g(x,y) / ((g(x)/T) * (g(y)/T)) (since x_f = g(x)/T)
T =g(x,y) / ((g(x) * g(y)) / T²) (simplifying)
T =g(x,y) * T² / (g(x) * g(y)) (simplifying more)

Solve for T and you get
Result 1
T = g(x) * g(y) / g(x,y).

Since you get a value of g by using google, you can get trial values for T whenever you want!

You can do a nastier version of the above to find that with n terms x₁, x₂, ... , x_n

Result 2
in that case,
T = ((g(x₁) * g(x₂) * ... * g(x_n)) / g(x₁,x₂, ... ,x_n))^(1/(n-1))

Test Results!

Ok, now the tests! I will do 3 tests with pairs, and 3 tests with triplets. I will get the test words by doing "random node" and taking the first word besides "the" in the node title.

T is the The number of pages indexed by Google.
g(x) is the number of hits for "x" on google.

Test 1
g(house) = 46,100,000
g(Estrangle) = 94
g(house,Estrangle) = 24
T= 1.8e8

Test 2
g(Hostess)= 450,000
g(gene)= 6,240,000
g(Hostess, gene) = 7,310
T = 3.8e8

Test 3
g(Toronto) = 7,050,000
g(Brassclaw) = 836
g(Toronto,Brassclaw) = 2
T = 2.9e9

Test 4
g(Unlikeness)= 3890
g(Ophidiophobia) = 956
g(Adobe) = 7,160,000
g(Unlikeness,Ophidiophobia,Adobe) = 0
T = ?

Test 5
g(condom) 689,000
g(open) = 60,200,000
g(Failing) = 2,200,000
g(condom,open,Failing)= 5410
T = 1.3e8

Test 6
g(Two) = 112,000,000
g(Genesis) = 2,880,000
g(brilliant) = 2,560,000
g(Two,Genesis,brilliant)= 43,000
T = 1.4e8

Results I got 5 results for T: 1.8e8 3.8e8 2.9e9 1.3e8 and 1.4e8. The average is 7.6e8 = 760,000,000.

After doing all this, someone told me that google actually claims to index 2e9 = 2,000,000,00 pages. So, the accuracy is not too horrible.

I am sure this has already been done in real math, and has a real name. Does anyone know it?

I know that google publicizes the number of pages they claim to index. However, this technique can be used on other search engines as well, whose credibility is in doubt.

googlewhack	Using google cache to scan a web page for relevance to your research	Statistics every writer should know	Why Japanese TV mosaics out handcuffs
Words may sound funny if you repeat them aloud too many times	whack off	Whack whack	Everything Statistics - September 29, 2001 (3)
Is 196 a palindromic number?	You have far too much time on your hands	Using Google for cultural anthropology	Dave Barry
Saving Quicktime movies from a web page	Things that I learned from reading erotic stories involving latex	Confessions of an Anaesthetist	Typing www into Google
Google Zeitgeist	We got the kind of games you can't rent at Blockbuster	Thinking too much	computer statistics
Humorous Writings of E2	A complete backup of the entire Internet	HOWTO: Build a lasting peace in the Middle East

Googlewhacking to estimate the number of pages indexed by Google

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups

Login
Password

Googlewhacking to estimate the number of pages indexed by Google

Sign In

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups