display | more...

Let's see why.

Suppose we (Bob) want to send to Alice over the Atlantic the super secret PHP Manual which is 4.675.494 bytes long. We have the freeware gpg which we will use for encryption. It implements most of the most known encryption algorithms including the new AES (Rijndael). It is obvious that files that are human readable have a great amount of redundancy, since human language is highly redundant (that's why you can sometimes hear half a sentence and fully understand it). Redundancy is easy to find out, using a compression tool, like bzip2. The higher the compression factor, the higher the redundancy. Let's see the manual's byte-frequency analysis:

dogganos@postmortem:~/temp/testbed$analyse_file PHP_Manual.html 1
data size is 4675494 bytes.

9 -> 0.025 %
10 -> 6.309 % ######
32 -> 10.766 % ##########
34 -> 2.801 % ##
35 -> 0.622 %
36 -> 0.168 %
37 -> 0.050 %
38 -> 0.755 %
39 -> 0.074 %
40 -> 0.414 %
41 -> 0.414 %
42 -> 0.020 %
43 -> 0.012 %
44 -> 0.399 %
45 -> 0.467 %
46 -> 0.829 %
47 -> 2.158 % ##
48 -> 0.477 %
49 -> 0.592 %
50 -> 0.431 %
51 -> 0.344 %
52 -> 0.176 %
53 -> 0.076 %
54 -> 0.231 %
55 -> 0.053 %
56 -> 0.108 %
57 -> 0.055 %
58 -> 0.079 %
59 -> 0.876 %
60 -> 4.077 % ####
61 -> 1.422 % #
62 -> 4.077 % ####
63 -> 0.025 %
65 -> 2.003 % ##
66 -> 0.760 %
67 -> 1.081 % #
68 -> 1.412 % #
69 -> 1.323 % #
70 -> 0.434 %
71 -> 0.255 %
72 -> 0.605 %
73 -> 1.079 % #
75 -> 0.024 %
76 -> 1.347 % #
77 -> 0.315 %
78 -> 0.580 %
79 -> 0.404 %
80 -> 1.280 % #
81 -> 0.034 %
82 -> 0.659 %
83 -> 1.494 % #
84 -> 1.543 % #
85 -> 0.130 %
86 -> 0.473 %
87 -> 0.091 %
88 -> 0.046 %
89 -> 0.026 %
91 -> 0.038 %
92 -> 0.030 %
93 -> 0.039 %
95 -> 0.562 %
97 -> 2.814 % ##
98 -> 0.966 %
99 -> 1.924 % #
100 -> 1.268 % #
101 -> 4.817 % ####
102 -> 1.528 % #
103 -> 0.728 %
104 -> 1.158 % #
105 -> 3.058 % ###
106 -> 0.052 %
107 -> 0.168 %
108 -> 1.647 % #
109 -> 1.106 % #
110 -> 3.887 % ###
111 -> 2.875 % ##
112 -> 1.530 % #
113 -> 0.119 %
114 -> 2.790 % ##
115 -> 3.034 % ###
116 -> 3.860 % ###
117 -> 1.481 % #
118 -> 0.351 %
119 -> 0.446 %
120 -> 0.247 %
121 -> 0.571 %
122 -> 0.054 %
123 -> 0.020 %
125 -> 0.020 %

Standard deviation of byte frequencies is 0.010914

dogganos@postmortem:~/temp/testbed$


From the 256 bytes, just 89 are used in the manual. That's because in a human readable document, only the human readable characters are used, and that's the alphabet, the numbers, space, carriage return, etc. All the rest of the bytes have a zero frequency. That's why the frequencies have a relatively (as we will see below), high standard deviation, because they differ much from each other.

Now, let's encrypt the manual as is, uncompressed:

dogganos@postmortem:~/temp/testbed$gpg -z 0 -c PHP_Manual.html
dogganos@postmortem:~/temp/testbed$l
total 9516
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001 PHP_Manual.html
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48 PHP_Manual.html.gpg
dogganos@postmortem:~/temp/testbed$

The "-z 0" switch instructs gpg not to use compression. The resulting file PHP_Manual.html.gpg is thus slightly bigger. But let's analyse it:

dogganos@postmortem:~/temp/testbed$analyse_file PHP_Manual.html.gpg
data size is 4676722 bytes.

0 -> 0.389 %
1 -> 0.392 %
2 -> 0.390 %
3 -> 0.390 %
4 -> 0.391 %
5 -> 0.389 %
6 -> 0.394 %
7 -> 0.395 %
8 -> 0.389 %
9 -> 0.390 %
10 -> 0.393 %
11 -> 0.388 %
12 -> 0.391 %
13 -> 0.387 %
14 -> 0.389 %
15 -> 0.392 %
16 -> 0.390 %
17 -> 0.389 %
18 -> 0.391 %
19 -> 0.389 %
20 -> 0.391 %
21 -> 0.386 %
22 -> 0.384 %
23 -> 0.395 %
24 -> 0.389 %
25 -> 0.393 %
26 -> 0.389 %
27 -> 0.393 %
28 -> 0.393 %
29 -> 0.392 %
30 -> 0.395 %
31 -> 0.395 %
32 -> 0.391 %
33 -> 0.389 %
34 -> 0.390 %
35 -> 0.395 %
36 -> 0.394 %
37 -> 0.390 %
38 -> 0.389 %
39 -> 0.385 %
40 -> 0.393 %
41 -> 0.390 %
42 -> 0.393 %
43 -> 0.391 %
44 -> 0.389 %
45 -> 0.392 %
46 -> 0.395 %
47 -> 0.389 %
48 -> 0.388 %
49 -> 0.389 %
50 -> 0.392 %
51 -> 0.389 %
52 -> 0.391 %
53 -> 0.399 %
54 -> 0.401 %
55 -> 0.391 %
56 -> 0.387 %
57 -> 0.394 %
58 -> 0.388 %
59 -> 0.390 %
60 -> 0.391 %
61 -> 0.392 %
62 -> 0.398 %
63 -> 0.387 %
64 -> 0.393 %
65 -> 0.388 %
66 -> 0.391 %
67 -> 0.391 %
68 -> 0.384 %
69 -> 0.384 %
70 -> 0.385 %
71 -> 0.390 %
72 -> 0.387 %
73 -> 0.395 %
74 -> 0.390 %
75 -> 0.391 %
76 -> 0.393 %
77 -> 0.391 %
78 -> 0.395 %
79 -> 0.392 %
80 -> 0.389 %
81 -> 0.393 %
82 -> 0.389 %
83 -> 0.388 %
84 -> 0.390 %
85 -> 0.385 %
86 -> 0.395 %
87 -> 0.389 %
88 -> 0.387 %
89 -> 0.387 %
90 -> 0.386 %
91 -> 0.390 %
92 -> 0.391 %
93 -> 0.382 %
94 -> 0.388 %
95 -> 0.390 %
96 -> 0.389 %
97 -> 0.393 %
98 -> 0.389 %
99 -> 0.386 %
100 -> 0.386 %
101 -> 0.394 %
102 -> 0.388 %
103 -> 0.386 %
104 -> 0.393 %
105 -> 0.389 %
106 -> 0.395 %
107 -> 0.390 %
108 -> 0.392 %
109 -> 0.391 %
110 -> 0.393 %
111 -> 0.392 %
112 -> 0.392 %
113 -> 0.391 %
114 -> 0.385 %
115 -> 0.398 %
116 -> 0.388 %
117 -> 0.387 %
118 -> 0.389 %
119 -> 0.392 %
120 -> 0.395 %
121 -> 0.392 %
122 -> 0.395 %
123 -> 0.392 %
124 -> 0.390 %
125 -> 0.389 %
126 -> 0.388 %
127 -> 0.387 %
128 -> 0.390 %
129 -> 0.394 %
130 -> 0.391 %
131 -> 0.395 %
132 -> 0.386 %
133 -> 0.397 %
134 -> 0.390 %
135 -> 0.389 %
136 -> 0.391 %
137 -> 0.395 %
138 -> 0.393 %
139 -> 0.396 %
140 -> 0.389 %
141 -> 0.393 %
142 -> 0.388 %
143 -> 0.391 %
144 -> 0.389 %
145 -> 0.386 %
146 -> 0.390 %
147 -> 0.390 %
148 -> 0.390 %
149 -> 0.386 %
150 -> 0.386 %
151 -> 0.391 %
152 -> 0.391 %
153 -> 0.391 %
154 -> 0.391 %
155 -> 0.396 %
156 -> 0.395 %
157 -> 0.396 %
158 -> 0.390 %
159 -> 0.390 %
160 -> 0.390 %
161 -> 0.397 %
162 -> 0.389 %
163 -> 0.388 %
164 -> 0.390 %
165 -> 0.394 %
166 -> 0.393 %
167 -> 0.391 %
168 -> 0.390 %
169 -> 0.389 %
170 -> 0.390 %
171 -> 0.391 %
172 -> 0.386 %
173 -> 0.389 %
174 -> 0.386 %
175 -> 0.396 %
176 -> 0.387 %
177 -> 0.391 %
178 -> 0.393 %
179 -> 0.396 %
180 -> 0.392 %
181 -> 0.393 %
182 -> 0.385 %
183 -> 0.389 %
184 -> 0.391 %
185 -> 0.387 %
186 -> 0.391 %
187 -> 0.390 %
188 -> 0.391 %
189 -> 0.387 %
190 -> 0.390 %
191 -> 0.393 %
192 -> 0.389 %
193 -> 0.394 %
194 -> 0.387 %
195 -> 0.389 %
196 -> 0.389 %
197 -> 0.393 %
198 -> 0.388 %
199 -> 0.394 %
200 -> 0.390 %
201 -> 0.388 %
202 -> 0.392 %
203 -> 0.396 %
204 -> 0.389 %
205 -> 0.388 %
206 -> 0.396 %
207 -> 0.385 %
208 -> 0.392 %
209 -> 0.392 %
210 -> 0.392 %
211 -> 0.390 %
212 -> 0.386 %
213 -> 0.396 %
214 -> 0.393 %
215 -> 0.388 %
216 -> 0.390 %
217 -> 0.392 %
218 -> 0.393 %
219 -> 0.389 %
220 -> 0.392 %
221 -> 0.391 %
222 -> 0.392 %
223 -> 0.388 %
224 -> 0.389 %
225 -> 0.391 %
226 -> 0.389 %
227 -> 0.383 %
228 -> 0.385 %
229 -> 0.391 %
230 -> 0.387 %
231 -> 0.395 %
232 -> 0.388 %
233 -> 0.389 %
234 -> 0.393 %
235 -> 0.394 %
236 -> 0.389 %
237 -> 0.405 %
238 -> 0.391 %
239 -> 0.395 %
240 -> 0.397 %
241 -> 0.392 %
242 -> 0.391 %
243 -> 0.393 %
244 -> 0.393 %
245 -> 0.394 %
246 -> 0.384 %
247 -> 0.391 %
248 -> 0.389 %
249 -> 0.389 %
250 -> 0.390 %
251 -> 0.392 %
252 -> 0.391 %
253 -> 0.388 %
254 -> 0.388 %
255 -> 0.389 %

Standard deviation of byte frequencies is 0.000032

dogganos@postmortem:~/temp/testbed$


We now see that all bytes are encountered in the encrypted file, at about the same frequency: 0.39 %, and in addition, the standard deviation is now minimal. That is because a good encryption algorithm produces a byte stream where each byte has a probability of 1/256 (~= 0.0039). This makes sense, because cryptanalysis exactly exploits redundancies in the plaintext (or data, or whatever). That the file looks like random data can also be verified by trying to compress it:

dogganos@postmortem:~/temp/testbed$bzip2 -k PHP_Manual.html.gpg
dogganos@postmortem:~/temp/testbed$l
total 14120
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001 PHP_Manual.html
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48 PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48 PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$

The resulting "compressed" file now became even bigger! That is good for the encryption algorithm, because it means that the data it encrypts have no redundancy. As Bruce Schneier says in his Applied Cryptography, "This makes a reasonable test of an encryption algorithm; if the ciphertext can be compressed, then the algorithm probably isn't very good".

Let's try now the opposite. First compress and then encrypt.

dogganos@postmortem:~/temp/testbed$bzip2 -k PHP_Manual.html
dogganos@postmortem:~/temp/testbed$l
total 14692
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001 PHP_Manual.html
-rwx------ 1 dogganos dogganos 581281 Oct 4 2001 PHP_Manual.html.bz2
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48 PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48 PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$gpg -c -z 0 PHP_Manual.html.bz2
dogganos@postmortem:~/temp/testbed$l
total 15264
-rwx------ 1 dogganos dogganos 4675494 Oct 4 2001 PHP_Manual.html
-rwx------ 1 dogganos dogganos 581281 Oct 4 2001 PHP_Manual.html.bz2
-rw-r--r-- 1 dogganos dogganos 581515 Dec 23 21:28 PHP_Manual.html.bz2.gpg
-rw-r--r-- 1 dogganos dogganos 4676722 Dec 21 21:48 PHP_Manual.html.gpg
-rw-r--r-- 1 dogganos dogganos 4698154 Dec 21 21:48 PHP_Manual.html.gpg.bz2
dogganos@postmortem:~/temp/testbed$

Aha! The compression really squeezed it! Now, the resulting encrypted file PHP_Manual.html.bz2.gpg gives a wanna-be cryptologist far less stuff to work with, and also saves us time/space, as it is far less time consuming to transfer/store 500 KB, than 4 MB.


The non-compressibility of random data (like ciphertext) can also be used in the following way: to detect encryption. Suppose that we snoop all packets that Eve sends to Bob, but since they have frequent correspondence, we want only to catch what is really interesting. So we suppose that they encrypt everything that they send to each other and is really interesting (say, 'We will meet at 00:45 behind the old train-station'). In that case, what we can do, is try to compress messages that we snoop, and gather the messages that do not compress significantly. They are probably encrypted*, thus interesting. We break the encryption(...) et voila!

Although, says Schneier again, it is always possible to specifically produce compressible ciphertext**... Ahhh, these mathematicians...

* Actually, these messages that do not compress significantly may also be already compressed. So, before trying to cryptanalyse them, we should try to decompress them with a variety of known compression algorithms (bzip2, zip, rar etc).

** randombit noted that an easy way to create compressible ciphertext is to take some redundant data, decrypt it with a key, and then use it as the plaintext (with the same key).

Log in or register to write something here or to contact authors.