Information Retrieval - Everything2.com

Information Retrieval is the area of computer science devoted to the indexing of large bodies of textual material.

It was developed in the 50s and 60s, mainly to address the needs of library automation, and this clearly shows in the approach and techniques.

IR concentrates on the issues of how to find meaningful index keywords and how to organise them. It believes in the 'magical' approach: users without deeper knowledge of how the system organises its information will enter simple queries, the system processes these queries very intelligently, and provides exactly the answers the user didn't even know s/he was looking for.

IR relies on quantitative methods (statistics, 'ranking' of results) to achieve its aims.

Queries apply to a pool of items (either: electronically available documents, or just bibliographic descriptions of these documents, perhaps adorned with keywords and other information); for each query, the system assigns a degree of relevance to each item, then applies ranking (sorting by relevance).

Given the relevance relation between n individual index terms and documents, we can think of documents as points in n-dimensional vector space and use the distance in this space to determine the extent to which documents are related. In the same way we can determine correlations between index terms. This vector space model can therefore be used to navigate the document space or index term space; an approach completely opposite to the explicit navigation links provided by hypertext.

Two quality measures are associated with query results: precision, which is the percentage of returned documents that are indeed relevant to the user's real information need, and recall, which is the percentage of relevant documents returned by the query. Mmm ... I see fallout being mentioned as a third measure, but you'll have to look that up yourself.

The 'real information need' of a user can only be measured afterwards, if at all. These measures are useful for the empirical validation of a system's performance.

IR is what makes the simple search in Altavista so bloody irritating: you enter a keyword and it returns scores of documents that don't even contain the keyword. It's the "we're so much smarter than you are, we can't even tell you how" attitude.

Techniques for language analysis are also considered part of IR. They mainly deal with words and their morphology: stemming associates word forms (such as plurals) with a common base form, stop lists remove too-common words.

Stemming is really a primitive form of thesaurus-based indexing.

Both the quantitative approach and the stress on empirical validation have a strong 50s feel to them. In this sense, IR is not part of computer science, but rather a precursor: computer scientists operate with a discrete ('qualitative'), structure-based, formal language based approach, as exemplified by database theory. Today, in the Web age, integration of the two approaches is necessary and inevitable.

A popular introductory textbook is C.J. van Rijsbergen's "Information Retrieval", available on the Web (use Google).

Latent Semantic Indexing	inverted file	stemming	precision and recall
The Most Difficult Way	The Streets	Search Engine Mechanics	Harry Tuttle
data mining	Brazil	United States Census	Gerard Salton
permuted index	Query	Precision	Librarianship
Gravamen	s/he	IR	csh
Google	Fallout	munge	set theory