The Linguistic Data Consortium is a an open consortium
of universities, companies, and government research laboratories which compiles and distributes various linguistics database
s for research and development purposes. Founded in 1992 with a grant from the Advanced Research Project Agency
), the LDC is located at the University of Pennsylvania
When natural language processing and related technologies became popular in the 1970s and 1980s, it was realized that in order to develop any sort of sophisticated program relating to language processing, massive amounts of linguistic data was necessary in order to train the programs. Everything from language translators to grammar checkers need to be trained on a corpora of real world data in order to learn the patterns and nuances of a particular language. Meanings change based on context and pragmatic considerations, and pronunciation varies based on accent and inflection, just to name a couple of examples. These massive databases are extremely expensive to create and maintain because they require a systematic collection and cataloguing from various different forms of communication. The task of building such a database is too large for any one company to take on, and once they have developed it privately, only they have access to this data. In the interest of promoting general research in the area, the LDC was founded to assist these efforts by developing databases for different languages and to act as a public repository for them.
The LDC builds its databases primarily through recorded telephone conversations, as well as parsing and analyzing text corpora such as newspapers and newsgroups. It produces audio files of taped recordings, transcriptions, tone data, word frequency analyzations as well as linguistically annotated transcripts (nouns, verbs etc are marked). Example lexicons include:
The lexicons and corpora are available on CD-ROM and through its sales the LDC has managed to actually become fully self-supporting. In addition to these databases, the LDC has produced several resources which are available free of charge including the Open Language Archives Community (an organization which keeps track of which resources are available at which institutions around the world) and Transcriber, a tool for assisting the creation of speech corpora that was released under the GPL.
Through the efforts of the LDC and language researchers, much progress has been made. In the US, speech recognition word error rate has been cut in half every two years for the past six years. Similarly, the performance of (text) message understanding and retrieval systems, measured in terms of metrics such as precision and recall, has improved at a rate between 20-50% per year.