the expressive power of regular expressions

The expressive power of regular expressions is the set of languages they can express. This is relevant when comparing different kinds of regular expressions, or when comparing them to other formalisms that express languages.

Many different variants of regular expressions exist but the kind that is used in theory only has literal characters, choice, and Kleene star. For example,


  a u b^*
    # the string "a" or any string consisting of "b"s

  (a u b)^*
    # any string consisting of "a"s and "b"s

  (ab^*)^*
    # the same

Comparing to other formalisms, this type of regular expression can express exactly the class of languages expressed by finite state automata: the regular languages. There is no difference in expressive power between them. There is, however, a significant difference in compactness: some classes of regular languages can only be described by automata that grow exponentially in size, while the required regular expressions only grow linearly. (Although I haven't checked, the reverse is probably true as well.)

We can also study expressive power within the formalism. As the example shows, different regular expressions can express the same language: the formalism is redundant. To what extent can this redundancy be eliminated? Can we find an interesting subset of regular expressions that is still fully expressive? Kleene star and choice operator are obviously required, but perhaps we can restrict their use.

This turns out to be a surprisingly difficult problem. As simple as the regular expressions are, it turns out there is no method to systematically rewrite them to some normal form that minimizes the star depth. They are not finitely axiomatizable. So we have to resort to other methods.

This leads to the star height problem.

Many tools that support regular expressions (grep, Perl, PHP, many editors, etc.) support extensions to the syntax. These extensions allow more compact expressions and sometimes serve to optimize recognition; but they rarely add expressive power of the language. The most notable exception is the backreference:


  (ab+)\1
    # an egrep (or Perl) regexp describing {xx | x in {a,b}^*}

Note that the bracketed part can be arbitrarily long, which means that left-to-right recognition of this language requires takes an arbitrary large amount of memory, while it only takes a fixed amount of memory for regular languages, since they can be recognized by finite state automata. This is, in fact, one way to define the regular languages. (If backreferencing is limited to expressions that describe only strings up to a certain maximum length, it remains limited to regular languages.)

star height problem	Mastering Regular Expressions	Kleene star	"a^n b^n" language
Leaning Toothpick Syndrome	regex	finite state automaton	regular expression
steps to UNIX familiarity	Differential equation	we're cool	punctuation etymology
Myhill-Nerode Theorem	monitor bezel	Regular expressions for computation theory	Miles, baby, hundreds of highways and truckstops I've used.
Of Lawyers and Artists	Myhill / Nerode Theorem	pumping lemma	.NET
asterisk	formal language theory	Redundant