Tagged integers are a technique (or the product of a technique) often used by implementers of interpreted languages to defray some of the high run-time cost of dynamic typing. It only helps with integers, but that can be a heck of a big help.

In a statically typed language, type information is bound to variables rather than to values. If p is declared as a pointer to FOO, that's all there is to the "FOO pointerness" of p; the actual value of p is thirty-two (or however many) zeroes and ones, and you can't be sure just by looking at the value itself whether it's your shoe size as an integer, the first four bytes of your shoe size expressed as a double-precision floating point number, the address of your shoe size expressed in some particular way (or a different one), or the address of the first character in your mother's maiden name spelled backwards in Unicode (or was that EBCDIC?).

In dynamically typed languages1, it's just the opposite: Variables have no type; it's their values that have type.

How do you implement such a type system? Your first thought is something like this: A variable is a pointer to a struct. The struct has a value and a type:

struct value_t {
    unsigned char type;

    //  Maybe a reference count? No, let's not worry  
    //  about garbage collection in this writeup.

    union {
        int i;
        double d;
        const char * str;
        struct hash_implementation * hash;
        struct array_implementation * ary;
        //  ...and so on.
    } v;
};

In C++, you might define an abstract base class with virtual functions and implement types as subclasses thereof, or you might do a COM-style "interface" thing so you can pass pointers to data objects into extensions in dynamically-linked libraries2 — and escape the one-base-class-to-rule-them-all straitjacket.

This is great in theory. It's simple and orthogonal. Trouble is, it's not very practical. When you add 2 + 3, you're dynamically allocating three objects on the heap, initializing them, and handing them over to your garbage collector, which will eventually chuck them out in yet another flurry of function calls. That's a lot of overhead, both cycles and memory. Dynamic typing is the Right Thing for some languages, even taking the cost into account, but you'd still like to do it as cheaply as you can.

This brings us, at last, to tagged integers.

First, let's consider the properties of a pointer p to an object no less than four bytes in size. That object will on any normal architecture be aligned on a four byte boundary. What that means is that p mod 4 equals zero, for any valid value of p.

Therefore: If your data objects are always at least four bytes long (which is easy to guarantee, since you're the one defining them), you can on most platforms safely assume that the two lowest-order bits of any valid data-object pointer will be zero3. Those two bits are, effectively, "out of band". They're up for grabs.

So here's what you do: Define a type which is just large enough to hold either a pointer to a garbage-collected data object, or a thirty-one bit integer left-shifted by one bit, with the lowest-order bit set to 1. That lowest-order bit is the "tag" in "tagged integer". Write some macros:

typedef unsigned long DATA_REFERENCE;

#define IS_DR_INT( dr )     (((DATA_REFERENCE)(dr)) & 1)
#define IS_DR_PTR( dr )     ( (dr) && ! IS_DR_INT( dr ) )

#define MAKE_DR_INT( n )    ( 1 | ( (n) << 1 ) )

/* Assume that GetGCObjIntValue( dr ) is a function which converts 
   arbitrary garbage-collected objects to integers. */

/* Note that we cast the tagged integer to int before right-shifting 
   it. This is because if you right-shift a signed int in C, you get 
   an arithmetic shift: The most significant bit is preserved as it 
   was, and that's the sign bit. When you right-shift an unsigned 
   int, that's a logical shift: A zero is shifted in on the 
   left and your sign bit bites the dust (I'm talking about 
   two's-complement integers here; I haven't thought about 
   one's complement). 

   Consider how the exact same bits can have two different meanings 
   depending on what we tell the compiler to pretend they are: C is 
   a statically-typed language. 
 */
#define DR_TO_INT( dr )     ( IS_DR_INT( dr ) \
                                ? ( ((int)dr) >> 1 ) \
                                : GetGCObjIntValue( (gc_object *)dr ) )

The above example is pretty simple. We only used one of our two spare bits for a tag, because we'd like to keep our integers as big as possible, but you do remember Huffman coding, don't you? Things don't all have to be the same size. We can do more with what we've got.

Let's stick with the same meaning for the zeroth bit: If it's nonzero, the other 31 bits are a 31-bit signed integer. However, if the zeroth bit is zero and the next higher bit is not zero, then it can't be an integer, and it still can't be a pointer. We've invented a third class of matter. You could use that for unique constants like NaN, boolean true and false, undefined (which in some dynamically typed languages is distinct from null), and so on. You could even rope off the lowest six or so bits of what's left as a secondary tag, and invent a whole range of different twenty-or-so-bit integer types: Relative addresses within your byte code, maybe, or whatever else floats your boat. A very specialized language might not altogether inconceivably find some use for an efficient 24-bit fixed point data type devoted to representing shoe sizes of dwarves.

Before you go too far overboard, you should consider just how deeply nested the ternaries are going to be in all those macros, but you'll have a tough time wasting all the cycles you've saved by keeping integers out of the garbage collector.

No, this is not an elaborate joke. People really do stuff like this. Forth and Smalltalk implementations often do it, GNU Guile does it4, Mozilla's JavaScript VM does it. Do a quick Google search for JSVAL_TAGBITS, if you're curious. Indeed, it appears 5 that Sun's SPARC processors have special instructions for addition and subtraction of tagged integers.






1 Not all languages fit neatly into this little taxonomy (as a general rule, very few interesting things fit neatly into any taxonomy; taxonomies are like that). C++ has optional run-time type information for class instances, but not for primitive types like int or double. In addition, C++ virtual functions offer limited late binding of functions (again, no primitive types need apply, because in C++ they don't have member functions).

Perl has some static typing: Mandatory Hungarian Line Noise forces a distinction between array variables, hash variables, and scalar variables at compile time (yeah, Perl 5 is compiled to byte code, and Perl 6 will be too). However, a scalar variable can hold a reference to a value of any type, and that type can be examined at run time. Scalars can also "hold" string values and a couple of different kinds of numeric valuesa. From there, things begin to get complicated, and the documentation is widely available inscribed on clay tablets in Linear B. An hour contemplating the Perl type system will send you howling back to the simple, comforting lucidity of Koenig lookup as fast as you can crawl.

a Just the usual int-or-double-where-appropriate thing. As a user, you can't generally tell them apart.

2 See name mangling and other issues. Many C++ features are implemented differently by different compilers. If you want to link binaries generated by compilers from different vendors, you'll have to stick to a small and rigidly-defined subset of C++'s full feature set. It turns out that vtables are common enough to be considered part of this subset for most practical purposes. Not coincidentally, it also turns out that Microsoft's so-called "Component Object Model", considered apart from its Byzantine support API, pretty much is that binary-compatible subset of C++. Least coincidentally of all, Mozilla's XPCOM is the same thing all over again, though rendered incompatibly different from COM by its own API stuff (pick a standard, any standard!) That very same subset happens to be perfectly compatible with old-fashioned C code, too. Where all this leads is that not everything about COM necessarily or completely sucks.

3 Platforms that don't enforce that kind of alignment can be dealt with in two ways: First, you can sullenly refuse to port to that platform, or just start talking about baseball when anybody mentions it. Second, you can write a custom allocator which gives you the alignment you want.

4 They call these things "immediates", for obvious reasons. They seem to use the bottom two or three bits for a tag (mod, like, eight, y'know): Anything ending in ...10 is a 30-bit int, anything ending in ...100 is a character or "immediate code" (?!), anything ending in ...000 is a pointer to a Scheme object on the heap. The following was posted to the Guile development mailing list on Thu, 16 May 2002 09:49:48 +0200 by somebody called Martin Grabmueller (http://mail.gnu.org/archive/html/guile-devel/2002-05/msg00047.html):

Immediates can be determined by checking the bits #1 and #2. If one of them is 1, it is an immediate, otherwise it is a heap pointer.

...

Immediate integers are 30-bit two's-complement integers (aka fixnums), encoded as immediates for performance reasons. Access to them is very fast, and creating them does not involve allocation of heap storage.

5 http://compilers.iecc.com/comparch/article/91-04-088
http://www.internet-encyclopedia.org/wiki.phtml?title=SPARC






Thanks to ariels for hassling me about the bit-shifting/sign-preservation/logical shift vs. arithmetic shift quagmire, which I had semi-wrong the first time around. If you want to know about shifting one's complement integers, I'm betting he's the guy to ask.