I know what a Markov chain is, and I have a very hard time decoding the terminology and notation above. Hopefully this writeup will help those who don't yet have a Ph.D. in Mathematics understand what a Markov chain is, and how they work.


Named after the Russian mathematician, Andrey Markov, a Markov chain is basically a structure in which future states or statuses depend only on the current state - past states are not considered when deciding what to do next. These future states are determined through probability.

While Markov chains have uses in physics, statistics, economics, biology, gambling, sports, and more, this writeup will deal with the more fun aspect - generating (semi-) meaningful/coherent sentences from existing text. In any case, the concepts are the same: Forget where you were. You are here now. Decide where to go next.

One of the most famous implementations of a Markov chain used to generate random text is Mark V. Shaney (say it out loud, and run the words together, and it sounds similar to "Markov Chainy"). Mark V. Shaney was a Usenet "user" whose posts were generated through an algorithm. Because no one on Usenet had ever seen anything like this, many people were led to believe that these submissions were just posted by real person with an unusual writing style.

How do they work?

The basic algorithm for generating text with Markov chains is actually quite simple, you can even do it on paper - no computers required:

  1. Start with some text input. It can be a single sentence, a nursery rhyme, a book, the entire works of Shakespeare - pretty much anything.
  2. Read in each word, and keep track of what word comes after it. Since words can be repeated, each word may have a list of words that follow it. For example, if your input was "the boy went to the store" you would have:
    • the: boy, store (both "boy" and "store" can follow the word "the")
    • boy: went
    • went: to
    • to: the
  3. After you are finished making your list of words, now you can output your own generated text. Pick a starting word, and randomly pick one of the words from the list that follows that word. Now with your randomly chosen word, pick a word from the list that follows it. Repeat until you feel like quiting.

Using the terminology from above, the word you are on is the current "state", and you are randomly choosing the future state by picking a valid word from the list. This simplistic algorithm can be fun to play with, but its grammar will be rather poor. To generate better output, instead of just keeping track of one word for the current state, you keep track of two words. In other words, instead of making a list of all the individual words in your input, you keep track of all the word pairs, and all the words that can follow this word pair. This is called an n-tier algorithm (where in this case, n is 2). You can keep track of as many tiers as you want, but the higher the number, the more likely your generated output will be exactly the same as your input. For this reason, 2- or 3-tiers are typically the highest a text generator will go.

Okay, I think I understand, now give me an example

(or, I don't understand this at all, give me an example)

1-tier

Alright. Let's start with a 1-tier example using Genesis 1: 1-5. For our word list we are going to ignore all punctuation.

In the beginning God created the heavens and the earth.
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
And God said, Let there be light: and there was light.
And God saw the light, that it was good: and God divided the light from the darkness.
And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

On the left are all the words used in the input, and on the right are all the words that follow each of the words. Notice that we include duplicate words - since the input uses these words multiple times, they should have a better chance of occurring in the output. I have also alphabetized the list, but that is not necessary for the algorithm (it just makes it easier to keep track of your words when you are doing this on paper instead of with a computer).

and:the, the, void, darkness, the, God, there, God, God, God, the, the, the
be:light
beginning:God
called:the, night
created:the
darkness:was, and, he
day:and
deep:and
divided:the
earth:and, was
evening:and
face:of, of
first:day
form:and
from:the
God:created, moved, said, saw, divided, called
good:and
he:called
heavens:and
in:the
it:was
let:there
light:and, and, that, from, day
of:the, God, the
morning:were
moved:upon
night:and
said:let
saw:the
spirit:of
that:it
the:beginning, heavens, earth, earth, face, deep, spirit, face, waters, light, light, darkness, light, darkness, evening, morning, first
there:be, was
upon:the, the
void:and
was:without, upon, light, good
waters:and
were:the
without:form

Now let's pick our words to generate a sentence. Since the input starts with "In" that's what I will start with. When using a computer, you would use a random number generator for chosing the next word, but I'm just going to use a counter to pick what word comes next. I'm going to start at 1, then increase by one after I pick each word. In other words, I'll look at the word "In" and pick the first word in the list (the); then I'll look at "the" and pick the second word in the list (heavens), then I'll pick the 3rd word for the next one, and so on. If there aren't enough words, I'll loop back to the first word and keep going.

This gives the output:

0  1   2       3   4        5   6   7       8   9
In the heavens and darkness and God created the waters

10  11  12       13  14   15  16
and the darkness was upon the morning

Not to bad, but it doesn't make much sense. Let's try 2-tier...

2-tier
Using the same input, but tracking pairs of words instead of single words, we get:
and darkness: was
and God: said, saw, divided, called
and the: earth, earth, spirit, darkness, evening, morning
and there: was
and void: and
be light: and
beginning God: created
called night: and
called the: light
created the: heavens
darkness and: God
darkness he: called
darkness was: upon
day and: the
deep and: the
divided the: light
earth and: the
earth was: without
evening and: the
face of: the, the
form and: void
from the: darkness
God called: the
God created: the
God divided: the
God moved: upon
God said: let
God saw: the
good and: God
he called: night
heavens and: the
in the: beginning
it was: good
let there: be
light and: there, God
light day: and
light from: the
light that: it
morning were: the
moved upon: the
night and: the
of God: moved
of the: deep, waters
said let: there
saw the: light
spirit of: God
that it: was
the beginning: God
the darkness:and, he
the deep: and
the earth: and, was
the evening: and
the face: of, of
the first: day
the heavens: and
the light: that, from, day
the morning: were
the spirit: of
the waters: and
there be: light
there was: light
upon the: face, face
void and: darkness
was good: and
was light: and
was upon: the
was without: form
waters and: God
were the: first
without form: and

In this case, we are going to start with "in the", and follow the same word-chosing process as above:

0  1   2         3   4       5   6       7   8   9
In the beginning God created the heavens and the spirit

10 11  12    13   14  15   15 16  17
of God moved upon the face of the deep

That's a bit better!

Punctuation

In the examples above, we have ignored punctuation, however to generate meaningful output that is more than one sentences long, it is a necessity. Believe it or not, without making any changes (other than NOT ignoring it), we don't have to change the algorithm in any way to get decent punctuation handling - we just need to consider punctuation as part of the word. In other words, in the example above, the first sentence ends in "earth." (with the period). We just handle that as a separate word from "earth" (without the period) in the second verse. As I said, this gives you "decent" punctuation handling - not great. If you try to do better, the algorithm gets much more complicated - especially punctuation which requires you to open and close it like parenthesis, quotation marks, brackets, etc. Basically, you need to keep track of what punctuation you have already seen and use that information to help you decide whether the words in your word list are a good choice. If not, you need to decrease their likelihood of being chosen (typically by temporarily removing them from the list of choices).

A Real World Example: E2D2

E2D2 uses 2-tier Markov chains extensively for generating things to say. While he does a lot to handle punctuation, he is especially strict (though still not perfect) with the use of square brackets to handle hard links. He uses two forms of input: an archive of the chatterbox conversation, and noder writeups.

When you say: e2d2 tell me about everything2, he will find all the chatterbox messages that mention "everything2", combine them all together, and generate a Markov chain based on all these messages. Sometimes the output will include the word "everything2", and sometimes it won't simply due to how the words are chosen.

When you say: e2d2 tell me about [everything2], he will download all the writeups in the Everything2 node, remove all the HTML (and also remove the text from certain tags that tend to cause problems), combine them all together, and generate a Markov chain based on the writeups.

Similarly, when you say: e2d2 process my writeups, he will download all of your writeups, remove all the HTML, combine them all together, and generate a Markove chain based on the writeups.

Some Markov chain programs keep a database of their word lists, however with a properly coded algorithm, this is completely unnecessary. E2D2 does all his generating in memory. If you ignore the time it takes to download the user search and individual writeups, E2D2 can typically generate 10 random paragraphs from someone's writeups in less than 1 second. Even The Custodian, who has over 1600 writeups containing over 1 million words can be processed in about 2 seconds. Usually the delay you see when you ask him something is caused by either having to download the content, or simply because he only refreshes the catbox messages every 10 seconds.

Additional real world examples include: