display | more...

When you are designing a search engine or constructing an index or concordance, you might start with the assumption that you want full-text indexing -- that is, you want every word in the corpus to have an index so that you can find all the places where it occurs.

Unfortunately, some words are so common as to be not worth indexing. Most people would not want a concordance to include every instance of the word the, or other such articles. When indexing text on a specific subject, other words might also not be worth indexing: if I were indexing a collection of texts on Florida history, for example, I would probably not index the word Florida. Not only would it appear frequently, but the fact that a sample of text contains the word Florida would also not say anything remarkable about it. Indexing it would thus be a waste of a lot of time.

The opposite problem comes into effect when a common word is the main way of describing something. If you want Andy Warhol's novel A, Debora Gregor's poetry collection And, or (in some cases) Stephen King's novel It, you'll likely have trouble. The best way around this, usually, is to search by the author's name and find the book in a list of everything by that author. (Some search engines have ways around it: search for +a in Google, put quotes around "a" in some other databases.)

Such a word is generally called a stopword. Expert searchers (reference librarians, etc.) know which terms are likely to be stopworded and which aren't; they sometimes consult a list of stopwords for a particular database to refine a search.

Log in or register to write something here or to contact authors.