unicity distance

In cryptography, we assume that an intruder knows what method has been used. So the only remaining unknown is the key. The usual way of preventing this key from being found is to create an encryption system which has such a large number of keys that it becomes impossible to try them all out.

But suppose it were possible to brute force through any number of keys. What would happen then? Consider the Julius Caesar cipher, which you get by shifting each letter by k places (so if k is 5, then a becomes f and y becomes d). In this cipher, we might encode a word and end up with ixm. By trying out the 26 possible keys, and assuming that the original plaintext is in English, we can eliminate the various kzo's and jym's, to be left with three possible original words: ape, pet and lap. So in this instance, we were not able to find the key and break the cipher. If we take a longer text, however, such as jgnnq, the only one of the possible plaintexts which makes sense is hello, so k was 2 and the code has been broken.

Now, take a slightly more complicated cipher, such as monoalphabetic substitution (each letter is replaced by another letter of the alphabet). In this case, ixm has around 500 meaningful decryptions (almost all English three-letter words). If we take a longer fragment, such as yjb bfd xb ubb we are reduced to only two reasonably likely decryptions: the end we see and men not in ann. Clearly, the longer the fragment, the less it is likely that there be more than one meaningful decryption. Once a cryptogram reaches that length, a brute force mechanism will be able to crack the code, given time. This length is the cipher's unicity distance.

How can we estimate this length?

First, we need a measure of the number of meaningful decryptions of an n-letter long ciphertext. This is the number of n-letter sequences in the target language which are grammatically correct. As we will see later, it is useful to express this in terms of entropy. If a language has a given entropy per character of H_lang, then the number of legal n-letter strings is 2^nH_lang. Furthermore, the number of possible strings (whether legal or not) is |C|ⁿ, where |C| is the size of the character set. Now, suppose we try a decryption of a ciphertext with some random key. Because we could end up with any of the |C|ⁿ strings of length n and because the number of those strings which are meaningful is 2^nH_lang, the probability of getting a meaningful decoding is:

2^nH_lang      2^nH_lang
------- = --------- = 2^{n(H_lang - log₂|C|)}
 |C|ⁿ     2^{n log₂|C|}

If we have a keyspace of entropy H_K, the number of keys is 2^H_K. For each of these keys, we know the probability of getting a meaningful decryption, so we can compute the expected number of meaningful decryptions as

2^H_K2^{n(H_lang - log₂|C|)} = 2^{n(H_lang - log₂|C|+1/nH_K)}

For this value to be approximately 1 (ie to have only one meaningful decryption and break the code), we need to have

n(H_lang - log₂|C| + 1/n H_K)= 0.

We can solve this for n, which will be the unicity distance of the cipher:

         H_K 
n = ------------
    log₂|C| - H_lang

A few examples

So, assuming that the entropy of English is about 1.5, and using a set of 26 characters, the lower term of the fraction is about 3.2 for any cipher. Then, using a Caesar cipher, which has 26 keys, we see that the unicity distance is about two characters. For polyalphabetic substitution, the keyspace entropy is about 88 (there are ~2⁸⁸ keys), yielding a unicity distance of 28 characters. Any ciphertext which is that length or longer probably only has one meaningful decryption.

This notion allows us to determine ciphers which are theoretically unbreakable, such as the one time pad. Simply put, in the above equation, the size of the keyspace was finite and independent of n. If, for a message of length n, we take a key of length n, we see that the number of possible keys is infinite; H_K then become infinitely large and so does the unicity distance. A brute force method will yield all strings of length n, including all legal strings.

This notion is really purely theoretical. It deals with the possibility of a ciphertext-only attack. In practice, such an attack is computationally very difficult and there are therefore several much more promising channels to be followed in order to compromise a cryptosystem, such as trying to discover the key either by intercepting it or by finding out where it is physically stored. While, in theory, infinite unicity distance guarantees unbreakability of the code, in practice, it does not allow for human error and, what is more, is not needed to provide a cryptosystem which is computationally unbreakable.

How to dispose of a Bible	Everything2 Crypto Project	Cryptography	entropy
One Time Pad	Caesar's Cipher	information theory	Eating a live sea urchin while it's still squirming
Claude Shannon	Driving in the Mountains	Cryptomnesia	getting fit
side-channel attack	keyspace	Caesar shift cipher	monoalphabetic substitution cipher
substitution cipher	Log	imprison	cipher
Purgatory	expectation	DES

How can we estimate this length?

A few examples

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups

Login
Password

unicity distance

How can we estimate this length?

A few examples

Sign In

Recommended Reading

About Everything2

User Picks

Editor Picks

New Writeups