Data flow analysis is a method of approximating a program's runtime behaviour at compile time, usually used by the compiler to check for programming errors or to find potential optimizations. The analysis is based on the control flow graph of the program (or function); the idea is to recursively annotate each node with elements of a lattice, representing (inexact) facts about the program's execution that can be determined statically. In a forwards analysis, we define a function combining the states of the predecessors of a node, and another function to find the state at a node in terms of the program statement it represents and the combined states of its predecessors. We also need to define the state at the entry node of the graph. In a backwards analysis, we do pretty much the same thing, but in the other direction, starting from the control flow graph's exit node, and combining each node's successors' states.

Suppose, for a minute, that you're a compiler. You have a pretty tedious life, and want to do some analysis to spice things up. What kind of facts might you be interested in? Well, you might want to keep track of the possible values a variable may take; if you see that it turns out to be constant, you might want to inline its value wherever it is used. Or, you might want to perform a backwards analysis keeping track of live variables, which will certainly be read before they are next updated. If you discover that a variable is not live in a particular block of code, you need not keep memory hanging around for it.

The combining function is written as JOIN(v), where v is a node of the control flow graph, corresponding to a statement of the program in question. The state function at a particular node v is written 〚v〛, but not all fonts have those double-square-bracket characters, so I'll write [[v]] for the purposes of this node.

JOIN(v) is going to be either the meet () or join () of the lattice you're working in, applied to the states of v's predecessors (or successors, in a backwards analysis; some people refer to this function as SPLIT in that case). It's usually pretty obvious which one to use. If a given property must hold at any predecessor to hold at this node, then use ∧; conversely, in a may analysis a property need only hold at one predecessor to propagate to this node, and ∨ is your poison.

A straightforward example

I mentioned earlier analysing the possible values of variables, but let's consider a slightly simpler case: keeping track of which variables have definitely been initialized. This is a forwards analysis, since the fact at each node depends on the facts of the previous nodes in the program. The lattice we're working is the power set of the set of all variables in the program; that is, the fact at each node is a subset of the program's variables. The necessary definitions pretty much write themselves. At the start of the program, no variables have been initialized:

[[ entry ]] := {}

This is a must analysis, since a variable must be initialized on all paths to a given statement to be guaranteed to be initialized at that statement. So, we need to take the intersection (the meet of this lattice) of the sets of variables known to be defined at a node's predecessors:

JOIN(v) := ⋂{ [[w]] | wpred(v) }

At an assignment to a variable, we add that variable to the set of identifiers. So if v corresponds to a statement id = E; for some identifier id and some expression E, we have:

[[ v ]] := JOIN(v) ∪ { id }

At any other statement, we just propagate the set of initialized variables forward:

[[ v ]] := JOIN(v)

And we're done! Let's take a look at a ridiculous example program, and its beautiful ASCII-art control flow graph:

int x;
int y = 17;
int z = 0;
while (y > 0)
{
    z = z + (y * x);
    x = 3;
    y = y - 1;
}

                               +-------+
                               | entry |
                               +-------+
                                   |
                                  \|/
+-----------+  +------------+  +-------+
| int z = 0 |<-| int y = 17 |<-| int x |
+-----------+  +------------+  +-------+
  |    +--------------------------------------------+
 \|/  \|/                                           |
+--------+  +-----------------+  +-------+  +-----------+
|  y > 0 |->| z = z + (y * x) |->| x = 3 |->| y = y - 1 |
+--------+  +-----------------+  +-------+  +-----------+
    |
   \|/
+------+
| exit |
+------+

The naïve method of finding the fixed point of this analysis is to initialize every node's data to {}, then repeatedly evaluate [[ ]] at each node until nothing changes. Whatever technique is used, we converge to:

int x;                  // {}
int y = 17;             // {y}
int z = 0;              // {y, z}
while (y > 0)           // {y, z}
{
    z = z + (y * x);    // {y, z}
    x = 3;              // {x, y, z}
    y = y - 1;          // {x, y, z}
}

(Pretty much the only part of this analysis worth pointing out is that, because we're taking an intersection in JOIN, [[ y > 0 ]] does not contain x, despite x being in the set of variables certainly defined at one of its predecessors. This is correct: we are interested in whether a variable must be defined at each node, not whether it may be defined.) With these annotations, your compiler can now raise a warning that x is used uninitialized in the loop's body, and you can fix the bug!