The complete Chomsky hierarchy consists of the following sets of languages, described by the appropriate grammar, and accepted by the machine:
- The regular languages, which are described by left- or right-linear grammars and can be recognized by a finite automaton. Noam Chomsky called these Type 3 grammars, and were part of his original hierarchy.
- The deterministic context-free languages, which are described by a subset of the context-free grammars and can be accepted by a deterministic pushdown automaton. This was not a part of Chomsky's original hierarchy, but comes up in compiler design (languages in this class are used by tools such as Yacc to generate parsers) and language theory often enough to be very important.
- The context-free languages, which are described by context-free grammars and can be accepted by a (nondeterministic) pushdown automaton. Noam Chomsky called these Type 2 grammars.
- The context-sensitive languages, which are described by context-sensitive grammars and can be accepted by a linear bounded automaton. Noam Chomsky called these Type 1 grammars.
- The recursive languages, which can be recognized by Turing machines that halt eventually on all inputs. This is not a part of the original hierachy, and it is in general technically impossible to tell them apart from the recursively enumerable languages (this would amount to solving David Hilbert's famous Entscheidungsproblem, see the description number wu for a proof as to why).
- The partial recursive languages or recursively enumerable languages, which can be enumerated by arbitrary Turing machines, and described by semi-Thue grammars, that are also known as type 0 (Chomsky's original name for them), unrestricted, or phrase-structure grammars.
Every formal language in the list is a proper subset of all the languages that come after it in the list, every grammar can express constructions of any of the grammars before it, and every machine in the list can be simulated by the machines that come after it. Turing machines can be constructed, of course, that are capable of simulating any arbitrary Turing machine as well (universal Turing machines).
Natural languages are in general not context-free, but linguists have found that while the context-sensitive languages seem to be a large enough class to describe natural languages, they also seem to contain a much bigger class of formal languages than that. Even worse, CSL's can take up to exponential time to simulate on ordinary computers (this is true even on a hypothetical computer endowed with nondeterminism), making them totally unworkable for practical use. Ongoing research on computational linguistics has thus focused on formulating other classes of languages that are "mildly" context-sensitive and can be simulated in polynomial time, such as the tree adjoining grammars, coupled context-free languages, and linear context-free rewriting systems. The languages described by these formalisms properly lie between the context-free and context-sensitive languages.
The tree adjoining grammars mentioned above are enjoying a renaissance of research because not only do they seem to be useful as a tool for the study of natural language, but they also seem to be natural for characterizing the structure of RNA secondary structures in molecular biology and virology.
Thanks to alfimp for suggestions on the relationship between natural languages and the Chomsky hierarchy, and for looking for some resources on the topic.