In many computer languages, character strings and data buffers can theoretically be infinitely long (in practice the length is limited by the computer's available storage). This unlimited-length property is accomplished by using null termination - a special symbol, typically the single byte ASCII 0 (NUL), which marks the end of the string.
Integer values, on the other hand, are rather strictly limited to specific fixed lengths of 8, 16, 32, and on some platforms 64 bits. This sets 264 = 18446744073709551615 (18 billion billion) as the maximum possible built-in integer representation. This is arguably a very large number, but for certain applications, cryptography being a perfect example, still inadequate. Ideally one would like a built-in, possibly even hardware-supported datatype which could, with the same efficiency as null-terminated strings, contain any size number from 0 to infinity (limited, of course, by available storage in practice).
Compound Number
One possible extensible representation might use a concatenation of two numbers: the first, A, would be a traditional fixed-length datatype indicating the number of bits (or bytes, or other length metric) spanned by the variable-length second, B, which contains the actual value being represented. Methods similar to this are in common use today. However, even when the prefix uses the maximum available 64-bit built-in representation, B still faces a hard limit at 2(264). This huge number would occupy the entire addressable memory of most modern computers, but the null-terminated strings still win the race with infinity. Even very modest increases in the prefix size increase the limiting value tremendously, but the fact of that limit remains unless the prefix itself is unbounded.
Let us not make the mistake of assuming that some arbitrarily huge number of today will be sufficient for the exponentially more powerful computers of tomorrow. Instead, let's adapt the null-terminated approach directly to numbers, and guarantee ourselves the ability to represent numbers of unlimited size, should the need arise.
Null-Terminated Number
Character strings are made up of a series of single-byte characters; each character is represented by a numeric value between 0 and 255. Binary numbers can similarly be broken down into chunks of equal size. These chunks can be as small as one bit, or they can be arbitrarily large. A chunk n bits wide can take on values between 0 and 2n - 1; one of these values must be used as the terminating symbol, T. We could choose any value we like for T, but let's use binary zero, for simplicity, clarity, and consistency with the existing NUL=0 convention.
- If n is 1, the set of possible symbols is simply 0 and 1, and T is 0.
- If n is 2, the set of possible symbols is 00, 01, 10, and 11, and T is 00.
- If n is 3, the set is 000, 001, 010, 011, 100, 101, 110, and 011, and T is 000.
- ... And so on.
Lets use an example 16 bit binary number x = 1011 0101 1100 0001. To maintain the parallels with string values and addressing, I will be using the convention that individual bits are read from and written to a continuous, bitwise-addressable memory, ordered left to right in increasing magnitude. The leftmost bit is address 0 and has a value of 1, and the rightmost bit is address 15 and has a value of 32768 (in other words, little-endian byte order and bit order - the way it SHOULD be! but I digress...) Our first stab at a null-terminated representation of x, which we will call y, will be naively obtained by simply appending our chosen T to the binary representation of x. Here's the result for various choices of n:
- n = 1, y = 1011010111000001 0
- n = 2, y = 1011010111000001 00
- n = 3, y = 1011010111000001 000
- n = 4, y = 1011010111000001 0000
We immediately observe that there is a problem with this approach - the binary representations of many valid numbers contain sequences of repeating zeroes. Choosing a different terminating symbol won't help; no matter what symbol we assign to T, there will be valid numbers which contain that bit pattern.
Null-terminated data buffers experience the same problem. Perfectly valid data contained in such a buffer might include many instances of NUL. The solution is to 'escape' these occurances - a method wherein additional symbols are inserted to differentiate between a data 0 and a terminating NUL. We can use a similar strategy here.
Every time a chunk of x matches T, we will need to mark it as a data chunk in y. The real terminating symbol will be marked differently. Since there are only two possibilities, we can append a single bit to the ambiguous chunk to accomplish this - we'll use 1 to indicate a data chunk, and 0 to indicate the terminator. Here is the revised strategy, with chunks separated for readability.
-
n = 1
x = 1 0 1 1 0 1 0 1 1 1 0 0 0 0 0 1
y = 1 01 1 1 01 01 1 1 1 1 01 01 01 01 01 1 00
-
n = 2
x = 10 11 01 01 11 00 00 01
y = 10 11 01 01 11 001 001 01 000
-
n = 3
x = 101 101 011 100 000 100 (padding out irregular chunk with zeros)
y = 101 101 011 100 0001 100 0000
-
n = 4
x = 1011 0101 1100 0001
y = 1011 0101 1100 0001 00000
This strategy is unambiguous, is compatible with two's complement signed values using sign extension, and satisfies our requirement for representing unlimited values.
Practical Considerations
The examples above are very simple for the sake of clarity. A practical implementation would probably enforce byte-aligned chunks by using a larger 8-tuple chunk size, to take advantage of available hardware optimizations. Also, it would probably use full-byte control codes following an inline T - this would prevent single-bit offsets from spoiling the byte alignment, and provide the flexibility to define a sophisticated system of escaped control codes ('chaining' a single value across non-contiguous memory areas is one application for advanced control codes that comes to mind; I am sure there are others). Because of the overhead implicit in checking control codes, a null-terminated datatype would be best suited for use as, for example, the relatively small prefix A in the compound number design described earlier. The much larger value B would have the ordinary flat binary representation which could be processed with far less overhead.
The only question remaining is: is it actually useful for anything? One possible advantage to such a datatype, assuming all the requisite arithmetic operations were implemented for it, is that it could never overflow. The same concepts could very easily be applied to create an infinite-precision floating-point datatype that would similarly never underflow. Lastly, such a datatype would be crucial in navigating an infinitely large address space. You are invited to suggest your own uses.