written in English
can be considered to a certain extent as a random signal
composed of a finite number of symbol
s (mainly the letters of the alphabet
). Applying statistical
techniques related to information theory
, it is possible to compute an estimate
of the entropy
rate of English, thus enabling optimal compression
of English texts or even "simulate" it. Speech recognition
software are also based on those techniques.
The goal here is to build a stochastic model of English using empirical probability distribution for each letter, computed from text samples. We know that the frequency of letters in Enlgish is not uniform: letter E (the most common) has a frequency of about 13% while Q and Z have a frequency of only about 0.1%. One can go further and build a second-order model where frequencies of pairs of letters are compiled. For example, letter Q is almost always followed by U. The most frequent pair in English is TH, with a frequency of about 3.7%. The problem with higher-order models of English is that the number of combinations of possible triplets, quadruplets, etc. are incredibly large, so that to compile meaningful statistics about their frequencies, texts containing billions of letters would have to be processed.
Here are some examples of what "simulated" English text looks like...
Zeroth-order approximation: the symbols are independent and equiprobable.
XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGXYD QPAAMKBZAACIBZLHJQD
First-order approximation: the symbols are independent, but frequency of letters matches English text.
OCRO HLI RGWR NMIELWIS EU LL NBNESBEYA TH EEI ALHENHTTPA OOBTTVA NAH BRL
Second-order approximation: the frequency of pairs of letters matches English text.
ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE
Third-order approximation: the frequency of triplets of letters matches English text.
IN NO IST LAT WHEY CRATICT FROURE BERS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE
Fourth-order approximation: the frequency of quadruplets of letters matches English text.
THE GENERATED JOB PROVIDUAL BETTER TRAND THE DISPLAYED CODE ABOVERY UPONDULTS WELL THE CODERST IN THESTICAL IT DO HOCK BOTHE MERG INSTATES CONS ERATION NEVER ANY OF PUBLE AND TO THEORY EVENTIAL CALLEGAND TO ELAST BENERATED IN WITH PIES AS IS WITH THE
Here is the same exercise but with "word models" instead of "letter models":
First order word model: words are chosen independently but with frequencies as in English.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE
Second order word model: word transition probabilities match English text.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
Entropy of English, for the interested reader, has been estimated to be about 4.03 bits per letter using the first-order letter model, but only 2.8 bits per letter using the fourth-order one.
This is an adaptation (quote for the examples) of Elements of Information Theory, Cover and Thomas, Wiley Editor.