display | more...

In many computer languages, character strings and data buffers can theoretically be infinitely long (in practice the length is limited by the computer's available storage). This unlimited-length property is accomplished by using null termination - a special symbol, typically the single byte ASCII 0 (NUL), which marks the end of the string.

Integer values, on the other hand, are rather strictly limited to specific fixed lengths of 8, 16, 32, and on some platforms 64 bits. This sets 264 = 18446744073709551615 (18 billion billion) as the maximum possible built-in integer representation. This is arguably a very large number, but for certain applications, cryptography being a perfect example, still inadequate. Ideally one would like a built-in, possibly even hardware-supported datatype which could, with the same efficiency as null-terminated strings, contain any size number from 0 to infinity (limited, of course, by available storage in practice).

Compound Number

One possible extensible representation might use a concatenation of two numbers: the first, A, would be a traditional fixed-length datatype indicating the number of bits (or bytes, or other length metric) spanned by the variable-length second, B, which contains the actual value being represented. Methods similar to this are in common use today. However, even when the prefix uses the maximum available 64-bit built-in representation, B still faces a hard limit at 2(264). This huge number would occupy the entire addressable memory of most modern computers, but the null-terminated strings still win the race with infinity. Even very modest increases in the prefix size increase the limiting value tremendously, but the fact of that limit remains unless the prefix itself is unbounded.

Let us not make the mistake of assuming that some arbitrarily huge number of today will be sufficient for the exponentially more powerful computers of tomorrow. Instead, let's adapt the null-terminated approach directly to numbers, and guarantee ourselves the ability to represent numbers of unlimited size, should the need arise.

Null-Terminated Number

Character strings are made up of a series of single-byte characters; each character is represented by a numeric value between 0 and 255. Binary numbers can similarly be broken down into chunks of equal size. These chunks can be as small as one bit, or they can be arbitrarily large. A chunk n bits wide can take on values between 0 and 2n - 1; one of these values must be used as the terminating symbol, T. We could choose any value we like for T, but let's use binary zero, for simplicity, clarity, and consistency with the existing NUL=0 convention.

  • If n is 1, the set of possible symbols is simply 0 and 1, and T is 0.
  • If n is 2, the set of possible symbols is 00, 01, 10, and 11, and T is 00.
  • If n is 3, the set is 000, 001, 010, 011, 100, 101, 110, and 011, and T is 000.
  • ... And so on.

Lets use an example 16 bit binary number x = 1011 0101 1100 0001. To maintain the parallels with string values and addressing, I will be using the convention that individual bits are read from and written to a continuous, bitwise-addressable memory, ordered left to right in increasing magnitude. The leftmost bit is address 0 and has a value of 1, and the rightmost bit is address 15 and has a value of 32768 (in other words, little-endian byte order and bit order - the way it SHOULD be! but I digress...) Our first stab at a null-terminated representation of x, which we will call y, will be naively obtained by simply appending our chosen T to the binary representation of x. Here's the result for various choices of n:

  • n = 1, y = 1011010111000001 0
  • n = 2, y = 1011010111000001 00
  • n = 3, y = 1011010111000001 000
  • n = 4, y = 1011010111000001 0000

We immediately observe that there is a problem with this approach - the binary representations of many valid numbers contain sequences of repeating zeroes. Choosing a different terminating symbol won't help; no matter what symbol we assign to T, there will be valid numbers which contain that bit pattern.

Null-terminated data buffers experience the same problem. Perfectly valid data contained in such a buffer might include many instances of NUL. The solution is to 'escape' these occurances - a method wherein additional symbols are inserted to differentiate between a data 0 and a terminating NUL. We can use a similar strategy here.

Every time a chunk of x matches T, we will need to mark it as a data chunk in y. The real terminating symbol will be marked differently. Since there are only two possibilities, we can append a single bit to the ambiguous chunk to accomplish this - we'll use 1 to indicate a data chunk, and 0 to indicate the terminator. Here is the revised strategy, with chunks separated for readability.

  • n = 1
    x = 1 0 1 1 0 1 0 1 1 1 0 0 0 0 0 1
    y = 1 01 1 1 01 01 1 1 1 1 01 01 01 01 01 1   00
  • n = 2
    x = 10 11 01 01 11 00 00 01
    y = 10 11 01 01 11 001 001 01   000
  • n = 3
    x = 101 101 011 100 000 100 (padding out irregular chunk with zeros)
    y = 101 101 011 100 0001 100   0000
  • n = 4
    x = 1011 0101 1100 0001
    y = 1011 0101 1100 0001   00000

This strategy is unambiguous, is compatible with two's complement signed values using sign extension, and satisfies our requirement for representing unlimited values.

Practical Considerations

The examples above are very simple for the sake of clarity. A practical implementation would probably enforce byte-aligned chunks by using a larger 8-tuple chunk size, to take advantage of available hardware optimizations. Also, it would probably use full-byte control codes following an inline T - this would prevent single-bit offsets from spoiling the byte alignment, and provide the flexibility to define a sophisticated system of escaped control codes ('chaining' a single value across non-contiguous memory areas is one application for advanced control codes that comes to mind; I am sure there are others). Because of the overhead implicit in checking control codes, a null-terminated datatype would be best suited for use as, for example, the relatively small prefix A in the compound number design described earlier. The much larger value B would have the ordinary flat binary representation which could be processed with far less overhead.

The only question remaining is: is it actually useful for anything? One possible advantage to such a datatype, assuming all the requisite arithmetic operations were implemented for it, is that it could never overflow. The same concepts could very easily be applied to create an infinite-precision floating-point datatype that would similarly never underflow. Lastly, such a datatype would be crucial in navigating an infinitely large address space. You are invited to suggest your own uses.

I feel compelled to give a little explanation on the mechanism for reading a null-terminated number. Since I had to think for a moment about whiteire's writeup before it made sense (and it does), I think other people can benefit from my reflections.

We are given two pieces of information, n, which is the size of a chunk, and a long sequence of zeroes and ones. It is our job to determine in this long sequence what the numbers are, the whole point being that we want to be able to tell when one number ends and another begins. All we can do with the stream of zeros and ones is read one of them at a time. Once we have read them, we proceed to the next. Remember that we are reading our bits as part of a number, from least significant bit to most significant bit, all the time keeping a sharp lookout for the terminating symbol.

The simple algorithm for reading a single number from the binary sequence proceeds as follows:

  1. Read n bits. If the n bits are all 0, then proceed to step 2. Otherwise, record the n bits as part of the desired number. Repeat step 1.
  2. Read one more bit. If this bit is 1, then record n zero bits. Return to step 1. Otherwise, this bit is 0, and we have found the end of this number. Stop, and output the number.

Now, example! Everyone loves examples. Let us suppose that n=2 and that we are reading the following stream:


I have labelled the bits with consecutive letters of the Roman alphabet for clarity. Bit K is 0 and bit H is 1. Let's begin. Remember, our chunks consist of two bits. I have written below in all its ASCII glory what the algorithm would do.

Our number so far    | Bits still to be read | Next operation:
                     |ABCDEFGHIJKLMNOPQRSTUVW| Read two bits.
AB                   |CDEFGHIJKLMNOPQRSTUVW  | No 00 found.
11                   |001101010010001010110  | Read two bits.
AB CD                |EFGHIJKLMNOPQRSTUVW    | 00 found.
11 00                |1101010010001010110    | Read next bit: E
AB CD                |FGHIJKLMNOPQRSTUVW     | Bit E was 1.  Continue
11 00                |101010010001010110     | Read two bits
AB CD FG             |HIJKLMNOPQRSTUVW       | No 00 found.
11 00 10             |1010010001010110       | Read two bits.
AB CD FG HI          |JKLMNOPQRSTUVW         | No 00 found.
11 00 10 10          |10010001010110         | Read two bits.
AB CD FG HI JK       |LMNOPQRSTUVW           | No 00 found.
11 00 10 10 10       |010001010110           | Read two bits.
AB CD FG HI JK LM    |NOPQRSTUVW             | No 00 found.
11 00 10 10 10 01    |0001010110             | Read two bits.
AB CD FG HI JK LM NO |PQRSTUVW               | 00 found.
11 00 10 10 10 01 00 |01010110               | Read next bit: P
AB CD FG HI JK LM NO |QRSTUVW                | Bit P was 0.  STOP
11 00 10 10 10 01 00 |1010110                |

The algorithm has finished. The null-terminated number we read from the sequence in binary is 11001010100100, which read from left to right in decimal is


            A       B       F       H       J       M  
            1*2^0 + 1*2^1 + 1*2^4 + 1*2^6 + 1*2^8 + 1*2^11

or 1 + 2 + 16 + 64 + 256 + 2048 = 2387. Observe that bits Q through W are still in our sequence. They may form part of another number, or they may be other sort of data altogether. We don't care. We have already found the number we were searching for, thanks to the null-terminated encoding of our numbers.

This procedure is meant for humans who want to write down the number (in binary) when all they have is a long sequence of zeros and ones being shot at them, one bit at a time. As far as the practical applications that whiteire mentions, this algorithm is probably useless for computers. Machines would have to code their arithmetic around the specifications of null-terminated numbers and hardware. The whole point is that it can be done, very efficiently, and with no theoretical limit beyond hardware constraints on the bound of numbers representable this way.

The only source of redundant information that could be avoided by traditional methods or the compound numbers that whiteire mentions is one extra bit each time n zeros appear in a single chunk. Since this only happens, on average, once every 2n times we read a chunk, for a random number made up of c chunks we end up wasting about c/2n bits of space. For large n (n = 8 sounds about right) and not too many chunks, the amount of wasted space with this method is not much.

Log in or register to write something here or to contact authors.