Search Engine Mechanics - Everything2.com

How They Work & Which Is Best

Disclaimer: This was a uni assignment and is an original piece of work by me - it isn't cut & pasted :)

Ever since the Internet has existed, people have had a need for some facility to efficiently find the information they are after. Most people use the ubiquitous search engines such as Yahoo or Alta Vista and many articles have been written on how to effectively search the Web. However, very few articles are written on how search engines work. Why is it important that a person knows the mechanics of a search engine?

The method with which a search engine uses to catalogue the Internet determines how comprehensive and accurate the results of your searches will be. It is important to remember that not all search engines are equal in performance, accuracy, or comprehensiveness.

This node describes in simple terms the major techniques used by search engines to classify sites. Please note that this is not a highly technical discussion - it is enough to get a grasp on the major differences between search engines though.

Web Directories
Yahoo is one the most well known search engines on the Internet although it is more a subject guide or directory than a search engine. The difference being that Yahoo visits each site in its directory whereas other search engines create their site lists through statistical analysis of web pages.

Yahoo has a series of pre-defined categories under which as site is listed. For instance, a site that tracks weather in Australia is classified under REGIONAL: COUNTRIES: AUSTRALIA: NEWS AND MEDIA: WEATHER.

When Yahoo becomes aware of a web site, one of its reviewers visits the site and determines which category best describes that site and which categories that site should be cross-referenced with. So far, Yahoo has classified 200 000 sites under about 20 000 categories.

The use of these categories means that Yahoo’s results are divided into topics which makes it very easy to find what you are looking for.

However, Yahoo has several shortcomings. Since reviewers are used to determine the ‘best-fit’ category, this means that the classification of sites is a subjective process.

For example, when the Messianic Jewish Alliance of America submitted its URL to Yahoo, it was classified under SOCIETY AND CULTURE: RELIGION: JUDAISM. This was understandable given that the site mentioned the Star of David and had articles on Israel.

However, Messianic Jews, while born to Jewish mothers and hence by definition are Jews, actually believe that Jesus Christ is the Messiah.

When ‘true’ Jews saw this, they bitterly complained to Yahoo claiming that the MJAA were actually Christians. Yahoo then reclassified MJAA under Christianity which upset the MJAA’s. Finally, a compromise was reached with a new category being created and MJAA being classified under SOCIETY AND CULTURE: RELIGION: CHRISTIANITY: MESSIANIC JUDAISM.

Another shortcoming of Yahoo is its inability to keep pace with the rate at which new Web sites are appearing. With an estimated over a million site on the Web, Yahoo’s 300 000 catalogued sites falls short of the total number of sites.

Hiring more reviewers to catalogue more sites introduces a new set of problems. More reviewers means that the classification process becomes even more subjective and therefore Yahoo would become less consistent.

If Yahoo maintains the same number of reviewers, it will be unable to keep up with the rapid growth of the Web, and hence will no longer be able to cover the breadth of the Internet.

Inverted Indexes
Besides Yahoo, some of the other major search engines are Lycos, Alta Vista, and Hotbot. These search engines use an inverted index to classify each sites on the Web.

An inverted index is simply a large table where the rows represent documents and columns represent words. For example, consider a document called cat.txt which contains the phrase, ‘The cat sat on the mat’ and another document called dog.txt which contains the phrase, ‘The dog ate the cat’, then an inverted index of these documents would be:

	cat	dog	sat	mat	ate	the	on
Cat.txt	1	1	1		1	1
Dog.txt	1	1			1	1

A search for the word ‘dog’ results in the ‘dog’ column being examined and the documents with a binary ‘1’ in the cell being returned.

The inverted index also allows Boolean expressions to be used. For example, if a search for ‘cat’ is performed, the search engine would produce both cat.txt and dog.txt. Using a Boolean expression such as, ‘cat AND NOT dog’, only cat.txt will be produced.

The advantage of inverted indexes is their speed as they only need search columns to get their results. The use of automated software programs, called spiders or robots, enables the Web to be indexed automatically which means that the search engines that use inverted indexes often have indexed most of the Web.

Inverted indexes also have a number of handicaps. One disadvantage is that it cannot perform proximity searches on words eg. ‘cat NEAR sat’.

Another disadvantage is that the context of the document is not provided.

Specifically, inverted indexes cannot handle homonyms or synonyms. Homonyms are words that are spelled the same but have different meanings eg photo film and film of oil. Synonyms are words that are spelled differently but have the same meaning eg film and movie.

Context Searching
One search engine, Excite, has been developed to solve the problem of context searching. Excite uses an inverted index but performs an additional step when classifying sites. After creating the inverted index, Excite groups sites that have similar profiles. This way, even if one site uses the word ‘film’ and another uses ‘movie’, these sites will be grouped together because they will have many other words that are similar. In this way, Excite overcomes the two major hurdles of searching for sites.

The classification categories under which Excite stores documents is based on a statistical analysis of the document. This way, the classification scheme can be created from the bottom up and not imposed from the top down like Yahoo.

However, despite its key features, Excite still has a number of failings. First, its statistical analysis techniques are still relatively simple and are prone to errors and second, Excite still doesn’t provide the level of context found at Yahoo such as cross-referencing.

Page Ranking
There is a relatively new search engine on the Internet – Google. This uses a system of page ranking to achieve accurate search results (Google calls this technology PageRank – highly imaginative name ... not). From Google’s own web pages, here is their explanation of how their system works:

PageRank capitalizes on the uniquely democratic characteristic of the web by using its vast link structure as an organizational tool. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. Google assesses a page's importance by the votes it receives. But Google looks at more than sheer volume of votes, or links; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

These important, high-quality results receive a higher PageRank and will be ordered higher in results. In this way, PageRank is Google's general indicator of importance and does not depend on a specific query. Rather, it is a characteristic of a page itself based on data from the web that Google analyzes using complex algorithms that assess link structure.

This technique tends to generate very accurate results.

Which Is Best?
There is no universal best search engine – however, depending on what you are searching for, there is an optimum search engine.

Yahoo and other subject guides are a good place to start researching any topic. General understanding of the topic is aided by the context of the document and cross-referencing which is a feature of subject directories.

However, once the basic information has been found, the researcher should switch to a search engine which uses an inverted index or page ranking. With knowledge of the limitations of these search engines, a researcher can tailor their searches to find information not catalogued in the subject guides.

In general, subject guides are the best place to start researching a topic, but searches should also be made with inverted index search engines to ensure most of the available documents on the Internet are found.

Reference
Steinberg, S. (1996) “Seek and ye shall find (maybe)” p 108 Wired May 1996
Berkeley Digital Library (1997) “Internet Search Tool Details” http://sunsite.berkeley.edu/Help/searchdetails.html
Eagan, Bender (1996) “Spiders and Worms and Crawlers, Oh My: Searching on the World Wide Web” http://192.114.206.1/~natang/meagan.html

Reference Web Sites
In addition to the above documents, several web sites were visited with multiple documents being used from these sites.

SearchEngine Watch: www.searchenginewatch.com
Yahoo: www.yahoo.com
Alta Vista: www.altavista.com
HotBot: www.hotbot.com
Excite: www.excite.com
Netscape: home.netscape.com
Google: www.google.com

Google	A Faded Red Restaurant Light	Search engine optimization	Information Retrieval
Bayesian Network	Changes: a look at "Design" and "In White"	Surviving an FBI Lock-In Trace	Teoma
universal	robots.txt	search engine	Altavista
synonym	Robots	excite	submarine patent
Nobody likes me, everybody hates me, think I'll go eat worms	Newton's Third Law of Motion	Messiah	Facility
Internet	Classification	Think tank