The Pumping Lemma for Context-Free Languages

If a language is a context-free language (CFL), then there exists a number called the pumping length such that any string in the language which has length equal to or greater than the pumping length can be divided into five pieces which satisfy the following conditions:

First, let's define some variables...
A is the language
p is the pumping length
s is a string in the language A made up of 5 substrings such that s = uvxyz

  1. For all i ≥ 0, uvixyiz is in the language A
  2. |vy| > 0
  3. |vxy| ≤ p
Or for those of us who prefer plain English:
  1. The v and y parts of s can be removed or repeated any number of times. As long as they are both removed or both repeated the same number of times, the resulting string will be in the language A. For example uxz, uvvxyyz, and uvv...vxyy...yz are all in A.
  2. v and y can not both be the empty string (ε); however, one of them can be.
  3. If you add the lengths of v, x and y, the sum can not be greater than p.

The first condition is really the heart of the lemma. Without the second condition, the lemma would be true for all languages, therefore useless. The third condition is not necessary, but it makes using the pumping lemma easier.

The Proof

We must show that for a CFL A, any string string s which is long enough can be pumped, and the resulting string will also be in A.1

Assume that s is a very long string (we'll be more specific about what "very long" is later). If G is a context-free grammar (CFG) which generates A, then s is derivable from G, and therefore has a parse tree.2 Since the right-hand side (RHS) of rules in a CFG are of finite length, strings generated by G which are beyond a certain length require increasingly taller parse trees. More simply, very long strings lead to very tall parse trees.

Let V be the set of variables for G. If the height of the parse tree for s is greater than |V| + 1, then by the pigeonhole principle, we know that in the longest branch of the parse tree, at least one of the variables is repeated. This is intuitive, since all but the last node in a parse tree are variables, then the longest branch has more than |V| variables. If we have only |V| different ones, then at least one of them is repeated.

So, now that we know there is at least one repetition of a variable in the longest branch of the parse tree. Call this variable R. The portion of the parse tree between the first and second appearance of R can be removed (pumping down) or repeated (pumping up) any number of times. This will result in different strings without invalidating the parse tree. This is easier to visualize with the following figure:

S is the start varaible

The original tree: (The colons represent an unknown number of intermediate steps in the branch)

     
          S
         /:\
        / : \
       /  :  \
      /   R   \
     /   /:\   \
    /   / : \   \
   /   /  :  \   \    
  /   /   R   \   \ 
 /   /   / \   \   \
/___/___/___\___\___\
  u   v   x   y   z

The pumped down tree:

     
          S
         /:\
        / : \
       /  :  \
      /   R   \
     /   / \   \
    /   /___\   \
   /   /  x  \   \    
  /   /       \   \ 
 /   /         \   \
/___/           \___\
  u               z

The pumped up tree:

     
          S
         /:\
        / : \
       /  :  \
      /   R   \
     /   /:\   \
    /   / : \   \
   /   /  :  \   \    
  /   /   R   \   \ 
 /   /   /:\   \   \
/___/___/ : \___\___\
  u   v/  :  \y   z
      /   R   \
     /   / \   \
    /___/___\___\
      v   x   y

Now that it is reasonably obvious that the concept is sound, we show how to calculate the pumping length and prove the three conditions.

Calculating the Pumping Length

Given G, call b the maximum number of symbols on the RHS of a rule in G. We are only concerned with grammars with b ≥ 2.3 Since a node in the parse tree can have at most b children, a parse tree of height h can have at most bh leaf nodes. Since all leaf nodes are terminals, this is the maximum length of the string generated by a parse tree of height h.

Recall that |V| is the number of variables in G. If we set p to be b|V|+2, then we know the height of the tree is at least |V| + 2, therefore at least one variable repeats, and the string can be pumped.

Condition 1

Given that s is a string in A of length ≥ p, we will show how to pump s. Let τ be a parse tree for s. Since |s| ≥ p, we have already shown that τ has height ≥ |V| + 2, and therefore at least one variable is repeated in the longest branch of τ. Call this repeated variable R. Furthermore, choose R to be a variable that is repeated among the last |V| + 1 variables on the branch.

Divide s into uvxyz according to the first figure. The two instances of R each have a subtree which generates part of s. The upper instance generates vxy, and the lower instance generates just x. Since both subtrees are generated by the same variable, we can substitute one for the other, and still have a valid parse tree. Replacing the upper subtree with the lower subtree (as in the second figure) generates the string uxz. Replacing the lower subtree with the upper subtree (as in the third figure) generates uvvxyyz. This can be done repeatedly to generate all strings of the form uvixyiz for i > 1. We know that all these strings are in the language A because they are generated by G.

Condition 2

If both v and y were the empty string, then the lemma would hold true for any p and any string. This is true since the strings ab, aεb, and aεε...εb are all equivalent. Any number of empty strings can be inserted anywhere in any string without changing it. We include this condition because the pumping lemma is useful for proving that languages are not context-free. To do that one must show that the pumping lemma fails.

Condition 3

We must show that |vxy| ≤ p. In the parse tree for s (the first figure), the upper instance of R generates vxy. Since we chose R to be a variable which is repeated among the last |V| + 1 variables on the path, we know that the subtree which generates vxy has height ≤ |V| + 2. As we previously proved, the length of the longest string that could be generated by a tree of this height is b|V|+2, which is the value we chose for p.


1. "Pumping" refers to changing the value of i in the expression uvixyiz. Pumping up means making i greater than 1. Pumping down means making i 0.
2. By the definition of CFLs, such a grammar is guaranteed to exist.
3. Grammars with b < 2 can not generate strings of more than 1 symbol.