The inverse function theorem is the foundation stone of calculus on manifolds, that is, of multivariable calculus done properly. It says that if f: RnRn is continuously differentiable, and the derivative Df(x) at a point x is an invertible matrix, then f itself is actually invertible near x, and the inverse is also continuously differentiable. Succinctly put, when a function is smooth enough, infinitesimal invertibility implies local invertibility. The chain rule then forces the derivative of f-1 to be the right thing, that is Df-1(f(x)) = Df(x)-1. You may remember from one-variable calculus a rule of the form (f-1)′(y) = 1 / f′(f-1(y)). The inverse function theorem is the correct generalization of that rule to several variables.

Here is a setup for a formal statement and proof of the inverse function theorem. You don't need to understand every word to get the proof, but (except for Banach spaces) the following notions should at least be familiar.

  • X and Y are Banach spaces. (If you don't know about Banach spaces just read both X and Y as Rn. The inverse function theorem holds for maps of Banach spaces using exactly the same proof as for Rn, so we might as well use that generality.)
  • U is an open neighborhood of x0 ∈ X; f: U → Y is a function, and y0 = f(x0).
  • The derivative Df(x0) of f at x0, if it exists, is a member of the Banach space L(X, Y) of continuous linear operators from X to Y. If there exists T ∈ L(X, Y) such that

    ||f(x) - f(x0) - T(x - x0)||Y = o(||x - x0||X) --- that is, ||f(x) - f(x0) - T(x - x0)||Y / ||x - x0||X → 0 as x → x0 ---

    for x in a neighborhood of x0, then we say that f is differentiable at x0 and T is the derivative Df(x0). (In the case X = Y = Rn, every linear map from X to Y is continuous, and L(X, Y) is just the space of all n-by-n matrices.)

  • Since the derivative Df takes values in a Banach space, we can ask whether it is continuous. If Df: U → L(X, Y) is continuous, we say that f is continuously differentiable, or C1 for short.
  • Of course, Df may also be differentiable, and we may get a continuous D2f: U → L(X, L(X, Y)), in which case f is said to be C2. (Actually, D2f always lies in the subspace S of L(X, X; Y) := L(X, L(X, Y)) consisting of bilinear maps which are symmetric in their two arguments; you may know this fact as the equality of mixed partial derivatives.) In general the kth derivative of f at a point is a symmetric k-multilinear map on X with values in Y. We say that f is C or smooth if it is Ck for every natural number k. (There are a few contexts in which "smooth" means only C1 rather than C.)
Now we can give the statement:

Inverse function theorem. Suppose f: U → Y is C1. Say that g: V → X is a local inverse for f at x0 if

  • V is an open neighborhood of y0 = f(x0), and g is C1 on V;
  • there is a smaller neighborhood x0 ∈ U' ⊂ U so that f(U') ⊂ V and (g o f)|U' is the identity map 1U' (g is a left inverse of f near x0);
  • there is a smaller neighborhood y0 ∈ V' ⊂ V so that g(V') ⊂ U and (f o g)|V' is the identity map 1V' (g is a right inverse of f near y0).
Then for such a local inverse g to exist, it is necessary and sufficient that the derivative Df(x0) ∈ L(X, Y) be bijective (a linear homeomorphism); and in this case g is unique.

A pedant might insist that g is only unique in the "sheaf-theoretic" sense that any two choices g1 and g2 coincide when restricted to the intersection of their domains --- since f winds up having a local inverse over any sufficiently small neighborhood of x0 and pedantically speaking two functions with different domains are unequal. This is strictly true but it's morally not the point. If you don't understand the significance of this remark, ignore it.

In fact the two conditions can be separated: f has a local left inverse at x0 iff Df(x0) has a left inverse A ∈ L(Y, X) (that is, A Df(x0) = 1X), and f has a local right inverse at x0 iff Df(x0) has a right inverse B ∈ L(Y, X) (Df(x0) B = 1Y). (In case X is finite dimensional --- and only in this case --- A exists iff Df(x0) is injective, and B exists iff Df(x0) is surjective.) However uniqueness no longer holds in the one-sided case.

Proof of the theorem.

There are an awful lot of words in this proof because I'm trying to explain the motivation for what we do. If you want the concise and elegant version, read the reference I'm expanding on for this writeup, that is Theorem 1.1.7 of The analysis of linear partial differential operators by Lars Hörmander.

1.   Necessity is obvious from the chain rule: If g is a local inverse for f at x0, then the equations

(g o f)|U' = 1U'   and   (f o g)|V' = 1V'

imply (taking derivatives) that

Dg(y0) Df(x0) = 1X   and   Df(x0) Dg(y0) = 1Y

and this says exactly that Df(x0) is invertible in L(X, Y), with inverse Dg(y0). --- The other direction is the meat of the theorem:

2.   (If you get lost skip to 3 below.) First let's simplify the problem a bit. Notice that if gL is a local left inverse and gR is a local right inverse for f at x0, then for y in the intersection of their domains,

gL(y) = (gL o f o gR)(y) = gR(y);

hence gL = gR on a smaller neighborhood of y0, and this function is a local two-sided inverse for f. Thus it's enough to prove separately that each local one-sided inverse exists.

Next, observe that if A is a left inverse for Df(x0), if we set F = A o f, then by the chain rule

DF(x0) = DA(f(x0)) Df(x0) = A Df(x0) = 1X

since the derivative of a continuous linear map is itself. Now if F has a local left inverse G near x0 (F and G are both maps X → X) then G o F = G o A o f = 1 in a neighborhood of x0; thus defining g = G o A gives a local left inverse for f itself. Similarly, if B is a right inverse for Df(x0), put

F = f o B;  DF(x0) = Df(x0) B = 1Y;

and if G is a local right inverse for F near y0 (now F and G are maps Y → Y) then F o G = f o B o G = 1 shows that g = B o G is a local right inverse for f. What we have done is reduce the problem of constructing a local left or right inverse for f, to that of constructing a local left or right inverse for a map F whose derivative is known to be the identity (on either X or Y, it works the same).

3.   Now let's adjust our notation a little bit to the simplified situation: we have a C1 function F: Z → Z, with F(x1) = y1 ∈ Z, and DF(x1) = 1Z. Here Z is either X or Y, x1 is either x0 or Bx0, and y1 is either Ay0 or y0, according as we chose F = A o f to get a local left inverse for f, or F = f o B to get a local right inverse for f. By the last paragraph, we are reduced to proving that in this case F has a local two-sided inverse at x1. Any norm ||  || without a subscript is the norm on Z, ||  ||Z.

To get a local inverse we first need f to be locally injective near x1. Because Df(x1) = 1Z, and Df is continuous (f is C1), there must be a small neighborhood of x1 where Df is almost 1Z: choose δ > 0 such that

||Df(x) - 1Z||L(Z; Z) < 1/2     when   ||x - x1|| ≤ δ.

Suppose x and y are two points in this ball B(x1; δ). Then applying the mean value theorem to the function g(x) = f(x) - x gives

||f(y) - f(x) - (y - x)|| ≤ ||y - x|| sup0<t<1 ||Dg(x + t(y - x))||L(Z; Z).

Since Dg(x) = Df(x) - 1Z, and we just said that ||Df(x) - 1Z||L(Z; Z) < 1/2 for every point in B(x1; δ), what this says is that

||f(y) - f(x) - (y - x)|| ≤ ||y - x|| / 2,   i.e.,   ||f(y) - f(x)|| ≥ ||y - x|| / 2.

In particular, for x, y ∈ B(x1; δ), if x ≠ y then f(x) ≠ f(y). That is, f is locally injective near x1. This pattern of argument may seem complicated but is quite fundamental.

Now we can attempt to solve the equation f(x) = y for x, given y near y1; the local injectivity of f tells us that if we find one solution for x it's the only solution. We do this by iterative approximation. Fix y ∈ B(y1; δ/2), and define x2, x3, ... ∈ B(x1; δ) by

xk+1 = xk + y - f(xk).

We show by induction that ||xk+1 - xk|| < 2-k δ, and consequently (by the triangle inequality) xk+1 ∈ B(x1; δ) for each k. First of all

||x2 - x1||Z = ||y - y1|| < 2-1 δ,

and then by the mean value theorem inequality above,

||xk+1 - xk|| = ||xk - f(xk) - (xk-1 - f(xk-1))|| ≤ ||xk - xk-1|| / 2 < 2-k δ.

But this tells us that {xk} is a Cauchy sequence, and since Z is complete there is a limit x ∈ B(x1; δ). By continuity of f, x is a fixed point of our iteration:

x = x + y - f(x),   i.e.,   f(x) = y.

So we have constructed a function g(y) = x, defined for ||y - y1|| < δ/2, which is a local inverse for f.

If you have recently studied metric spaces you may recognize that I have essentially repeated the proof of the contraction mapping theorem. The construction, not the theorem per se, is what's important.

4.   It remains to prove that g is actually C1 near y = y1 (it would be no good if our smooth function had a rough inverse). Choose two points y, y + k ∈ B(y1; δ/2), and write g(y) = x, g(y + k) = x + h. We know that f is differentiable at x, so that

k = f(x + h) - f(x) = Df(x)h + o(||h||).

What we really want is the reverse, where Dg(y) ought to be Df(x)-1:

h = g(y + k) - g(y) = Df(x)-1k + o(||k||).

The first equation is equivalent to

h = Df(x)-1k - o(||Df(x)-1||L(Z; Z)||h||);

since we know ||Df(x)-1||L(Z; Z) < 2 for every x ∈ B(x1; δ) (from ||Df(x) - 1Z||L(Z; Z) < 1/2), it suffices to prove that a function which is o(||h||) is also o(||k||). But again our mean value theorem relation gives

||k - h|| < ||h/2||,   hence   ||h||/2 < ||k|| < 2||h||

which shows that h and k have the same asymptotic order near zero, thus that

g(y + k) - g(y) = Df(x)-1k + o(||k||),   i.e.,   Dg(y) = Df(g(y))-1.

Since Df(g(y))-1 is continuous in y (f is C1, g is continuous, and the inversion map is smooth), this shows that g is C1 near y1, which completes the proof.     ///


Reference: Lars Hörmander, The analysis of linear partial differential operators, volume 1, theorem 1.1.7. Springer-Verlag 1983, 1990.