abstract syntax - Everything2.com

Building a compiler is a lot of hard work, with many design decisions and tradeoffs. One problem in particular is as follows: plain text in a source code file is not very easy for a computer to manipulate at all. For one thing, it's got lots of stuff like comments and whitespace that mean nothing to a computer. For another, source code is generally written in such a way that it's easy for the programmer to understand, not the computer. For a comparison of terms, what you actually type into your editor and save as source code is called the concrete syntax.

One step in compiling a program is breaking up the source code into tokens, a process variously called lexing or tokenizing. But all this does is take the textual representation of symbols, numbers, and strings and convert them into some sort of internal representation. For example, the string "x + 5" might get turned into the internal representation ID(x) PLUS NUMBER(5). We've converted it into another form, but it's still just the same basic thing. The computer will have just as hard a time dealing with that special representation as it would with the string you typed up.

Enter the parser. It takes all those tokens and makes sure that the order they come in obeys the rules of the programming language. You're probably wondering what all this has to do with the title of the node - abstract syntax. Well, as a side effect of the parser checking the token stream against the language's rules, it creates what is known as an abstract syntax tree. This tree is very special. It's yet another intermediate representation of the actual source code, but it doesn't have any of those handy things programmers like (comments, white space, parentheses). It means the same thing, but it's been boiled down to just the essence of the original text.

Why? Well, abstract syntax is easier for the computer to handle. It's not a big text string. It's not a big text string represented as a bunch of tokens. The abstract syntax is a tree structure. The tree represents the structure of the program in such a way that later stages of the compiler can understand and easily manipulate it, but still keeps the same meaning. See, we've got a lot of complicated things to do later on in the compiler. It's helpful to have the source code represented in a nice tree structure. Programmers like trees.

Now you wouldn't want to program directly in abstract syntax. That'd be weird. The effect is that you end up with a program that has a very high amount of symbols explaining the structure compared to the amount of actual program text. For example, what if we were to use a bunch of parentheses to show the structure of our program? If this sounds familiar to you, then you've programmed in LISP, Scheme, or similar languages. The reason they've got all those parentheses everywhere is because you write directly in the abstract syntax.

Allow me to show an example of what a very short program may look like in its abstract syntax form. My compilers class is developing a compiler in ML for the Tiger language, which was designed by Andrew Appel. It's sort of a C/Pascal/ML hybrid syntax. Here's a very short Tiger program (it looks stupid because it's a test case for order of operations):

let
   var a := 47
in
   a + 15 + a - 1
end

And once we've run that through the first few stages of the compiler, we end up with the abstract syntax tree. Now, the abstract syntax tree doesn't really look like anything. However, if we walked the tree and generated a little bit of text at each node for debugging purposes, it might look something like this:

LetExp([
 VarDec(a,true,NONE,
  IntExp(47))],
 OpExp(MinusOp,
  OpExp(PlusOp,
   OpExp(PlusOp,
    VarExp(
     SimpleVar(a)),
    IntExp(15)),
   VarExp(
    SimpleVar(a))),
  IntExp(1)))

Like I said, it looks weird and wouldn't be any fun to program directly in. However, that means something to the compiler. You can see the tree structure by the levels of indentation. The top level LetExp is the root node of our tree, and it has a VarDec as one of its children and an OpExp as another child. These in turn are composed of other small parts. There are many, many other ways to generate textual representations of an abstract syntax tree. This just happens to be the one chosen for my project.

Software is free speech	Syntactic ambiguity	Lex	compiler
side effect	ASN.1	Kubla Khan	parser
X.509	NASM	Syntax	tiger
source code