In linguistics the lexicon is the mental repository of what may informally be called words: the arbitrary links between sound and meaning. These are different for each language, and each speaker of a language internalizes their own list of words they know and how they understand their meanings and use.

The lexicon is part of what is stored in long-term memory. Some linguistic properties are predictable, and need not be stored individually: for example, regular plurals and past tenses can be generated when needed from other elements. The elements and at least some of the rules differ from language to language, so these need to be stored too. The lexicon normally refers to stored static elements, not to rules of derivation.

It also refers to the linguistic aspect of knowledge, not to a more general encyclopaedic knowledge. We know a vast number of things about cats, but those pertinent linguistically are that it is pronounced [kæt], that it is a count noun, that it has a regular plural, and a small amount of semantics: that it is animate but non-human implies the grammatical fact that we can optionally use it to refer to it, or she or he. Sentences are produced and understood using only this lexical information. A great many more inferences are possible at the level of pragmatics, outside the strictly linguistic domain, using our general knowledge of cats.

Contents

So instead of the informal 'words' we need to specify more precisely what kinds of things appear in the lexicon. There are several abstract terms for members of the lexicon: lexical item, or lexeme, or listeme, possibly with differences between them that don't concern me at this level of discussion. Then we need to say what properties are stored at the linguistic level and how they are connected.

First there need to be elements smaller than what are normally thought of as words. Morphemes build up into words. Derivational morphemes are those like un- and -able in undecipherable, and inflectional morphemes are the grammatically active ones like those in walks and walked. Derivational morphology is only partly predictable, and we get patterns like impress ~ impression shared by a large number of words, sometimes in more altered forms like describe ~ description, but where the final result is still a word that needs to be remembered in the lexicon. In contrast, inflectional morphology is usually freely predictable, so the lexicon (like a printed dictionary) doesn't need to store all the regular forms, though it does need to have entries for irregular forms like sang.

On the other hand, some entities larger than a traditional word also need to be stored. While compounding is sometimes transparent, so that we know what a wild horse is just from the two components, there are huge numbers of expressions that we do need to remember, such as gift horse and wild rice. Ray Jackendoff has proposed theories of semantics and the lexicon in which the lexicon is supposed to include all linguistic expressions held in long-term memory, regardless of size. This would include not just idioms like get over it but those that are of or approach full sentence size, such as get (your) knickers in a twist and the shit hits the fan, which behave in some ways like fully inflected grammatical constructions, but tend to have strange syntactic properties such as not permitting the usual range of grammatical variation: you can't normally say *the fan was hit by the shit.

Jackendoff would also include memorized quotations in the lexicon. This is getting perilously far away from the instantaneous-access component of the modularity hypothesis for language, and is straying into slow-access encyclopaedic information. It might depend on the quotation. Intuitively there seems to me to be a gap between the linguistic fragments 'to the manner born' and 'more honoured in the breach than the observance', which I can access and use pretty much instantly, and the matrix quotation that includes them both. Even though I know it well, I can't recall or use it in ordinary speech, or make puns or other variations on it, anywhere near as easily as its two component phrases. So I think there might be a limit to how fully any textual memory is integrated into the memory part of the language module.

Connexions

A lexical item has a sound and a meaning. These are present at the interface to the phonetic and logical apparatus. Between them mediates the syntax, converting phonetic forms to logical forms and conversely by constructing syntactic representations that differ from language to language. Items must have some syntactic information in them: the part of speech, subcategorization properties (such as think requiring an animate subject and a propositional object), and irregular or suppletive allomorphy, at least.

The traditional picture of how these get into sentences has been called 'syntactocentric': a deep structure is composed of choices from the lexicon, then operations such as transformation or movement relate this structure to logical and phonetic structures, but without (during the syntactic phase) using any logical or phonetic features. This has been compared to each word carrying two locked suitcases through its syntactic processing.

Jackendoff's alternative is that each of semantics, phonology, and syntax is a generative system, and they work in parallel, connected across the lexicon. A lexical item is a correspondence between elements of the three systems.

The word /kæt/ in English will adhere to the language's constraints on nuclei and codas of the syllable, will receive stress and aspiration, and will engage in voice harmony with the plural morpheme, in one system, that of phonology. This is almost completely independent of the syntactic fact that it is an animate non-human count noun; and with whatever genuinely semantic information needs to be associated with it within the language module. However, larger-scale phonological processes such as what intonation to give it in a question do seem to depend on knowledge of the syntactic structure.

A morpheme like the English regular plural /z/ ~ /s/ ~ /əz/ belongs in the lexicon. It is common to treat the plural cats as lexical, that is one of the elements that is inserted into the beginning of the syntactic derivation, but another possibility is to treat this part of morphology as a part of the syntax, the morphosyntax, created by a kind of generative process. On the other hand some would say that morphosyntax is 'in the lexicon', and that therefore the lexicon is itself generative. Related but different considerations apply to common and perhaps lexicalized formations like catlike, to parallel nonce-formations like caterpillar-like, and to semi-productive patterns of formation like impress ~ impression ~ impressionable, where some of the semantics needs to be remembered.

Anomalous contents

If lexicon is a kind of memory, anything produced on the spot, like caterpillars or caterpillar-like, should not be counted as being in it. Chomsky's theoretical position is that nothing redundant should be in it: only what cannot be predicted. But in fact it is quite possible that early acquisition of regular forms could be a kind of memorization, so that lexicalized hands or walked are not displaced by subsequent inference of the morphemes generated from them and the newly-internalized rules for creating them.

There could be lexical items that are defective in some one of the three kinds of feature: hello and gosh having no syntax, and purely syntactic agreement and case marking having no semantics: indeed, theoretical syntacticians do posit a feature [-interpretable] on some grammatical markers, requiring them to be checked off and stripped out at some point in the syntax, after they've given rise to their phonetic form but before they reach interpretation of their logical form. They also posit elements having no phonetics, so-called empty categories, to give more consistent syntactic structures where there is no overt word. These too would have distinct properties noted in the lexicon.