Markov chains are directed graphs, the nodes of which represent possible states of the world; the arrows represent possible transitions between states; probability values are attached to each transition, the probability of that transition occurring when in that state.

The purpose: given a Markov chain you can compute the probability of the world being in any given state. (Assuming the validity of your state diagram model, of course.)

See also: finite state machines/finite state automata, Petri nets.

These chains can be used as a database of n-word combinations and the words that follow them and then be used to try and create "speech" from it. This process generally produces nonsense sentences, unless very finely tweaked and with a plethora of words and statistics in the database.

A markov chain (Markov is given the ultimate mathematical dignity of having his name lowercased!) is a random process on some state space, where Xn+1 (the state at time n+1) depends only on Xn, and Pr(Xn+1=y|Xn=x) = px,y is a constant. The matrix P=(px,y)x,y is called the transition matrix of the process. It is a stochastic matrix (the sum of each row is 1), and completely determines the process.

The two important conditions on a markov chain are it being irreducible and it being acyclic (see my writeups there for details). If your markov chain is irreducible and acyclic (this makes it an ergodic process), then there exists a unique stable distribution S for the chain. In that case, there is also convergence to that stable distribution: pick an initial state according to any distribution D (for instance, you could always pick some fixed node), then run the chain; the distribution of the n'th state of the chain will converge to the stable distribution (as n tends to infinity).

Computer science makes up new names for everything. In this case, it uses the name "markov chain" in a slightly different setting. You can condition on the last k states (rather than just the last state) and pretend it's a new type of process. In fact, it's the same thing, if you just enlarge your state space appropriately. Details are left as an easy exercise for the interested reader.

(Just to discuss a Markov chain, here is a bunch of Markov chain-related information piped through a Markov chain-based ramble generator I hacked up a few days ago, with a slight change to randomly link phrases.)

Also, the rest of break-even probabilities for a Project Scoresheet inputter for this number is again starting from singles, but the batter reaches first is again starting from these lines, it seems reasonable to affect the games.

There is probably too small increase for this matrix the efforts of this analysis and attempts, 5) stolen bases. The first problem, is to try and then there may not at stolen base tries. That being able to the scoresheet.

One caveat: it makes getting the stable distribution: pick some of strategies can greatly simplify the world being said, the comparison between ballparks, teams, playing surfaces.

There are added to each batting order being analyzed. One important point in general, the offensive performance of increasing the chain.

Once the necessary refinement is not new, dating back to interpret the data, 2) obtaining sufficient data limitations the entire analytical process. A spreadsheet program, which all plays and with their own conclusions from other state. (Assuming the project director. These expected or not change, then there would really go through the Markov chain, though strictly speaking not nearly large of this task. The author input 37 Baltimore home team, Baltimore data files. All the AL, the fundamental ideas of six different types of this point, you can compute from that case, there exists a sizeable decrease for this point, you may have failed to do not.

In sudden excitement over the wonder of computer producing almost intelligible text, I went and wrote a markov-chainer of my own. After a couple of failed attempts at database managemnt, I went and learned SQL, and installed MySQL. Then I downloaded the complete workd of shakespeare and various other old literature. Then I scanned all literature I had in computerized form as fodder to this simple markov-chain rant generator. The generator itself was moronically simple, operating on the following simple rule: "count all triplets of three consecutive words in source text. Pick two random words, and from then on, each following word is random word with the relative probability the same as its frequency in source text, following the last two generated words".

The first experimental versions produced text that people could, and would, interpret as poetry. The first experimental version was hardly even intelligible and got interpreted by wannabe-poet friend in interesting ways:

and most cool froodiness to come and sage defender sky and being despised by grave and grew and suppose if
galahad oh it's against my friend and buggering up like this is your time flies over then guard oh yes
and ready at all people love will to feel touch her mind like we can't talk to take her nature
slowing down side his impending doom when she was his leg was definitely recognizable for any confidence truth suddenly his

The second production I improved a bit, mostly by removing misplaced words, adding maybe total of two words and all the punctuation, then fed it to my english assignment whose deadline was tomorrow. Got second-best grade from it, too, with comment "insightful, as usual". Located at Night Brawler.

And supposing you're not yet totally asleep, here are pieces of poetry/incoherent stories (the addition of automatical punctuation seems to have made the output relatively more coherent but equally less poetic) fresh & unedited (except for links) from the machine, with different word libraries used:

  1. Literature from 19th and 20th centuries; Can Such Things Be, Crime and Punishment, Joys of Being Woman, Through the Looking Glass and The Time Machine:

    carries with her,
    and his odd,
    and another a quiet enough spot the fork of the road? i sat, and alice knew which was not the first of the retail relation that 50 occasions the peculiar blight incurred by the spirit hath powers of the people of san francisco the next course.

    the piece most con venient to his capture and conviction. of his attention by a box of cast off became uneasy,

    uncomfortable,
    at last, in full confidence of renewed day it is interesting to examine some results to the loqua cious barber of the cabin.
    if possibly there was the first person we meet again!
    do let's pretend the glass in here and there, and sat down again,
    it had been in foreign countries and had a fire that was true!
    i'm afraid i don't care about going on long, shock headed fellow,

    he did so, long she got back to your post, as we,

    for was not possible simultaneously to beautify my brain as in regard to all its manifestations and suggestions. he was far too

  2. Neon Genesis Evangelion fan fictions: Neon Genesis Evangelion: Revelations and Neon Exodus Evangelion:

    crime to visit shinji?
    shinji asked, looking from them do to people's sensibility 12 days ago. since dr. akagi have both been told,
    said berard, who, asuka finished for him.
    hadn't even made any impatient noises,

    so all that crap about them.

    neither sea nor sky seemed to have a moment,
    then four tones,

    then looked up as the train pulled up just as the administration had agreed that keeping the quaver out of a mouth exposed by the day's events; instead he'd got that jbs2 you used to combat the seventh child?
    did i hit him so a number of switches on her feet,

    until he caught a launch of the evangelion project, replied truss as he had planted himself in the valley below,
    in the single shot echoed for several minutes of reactor failure, then turned around, each of the hatch, fast!
    observation posts alpha through epsilon report visual sightings of ground forces have just a single lonesome figure sat on the offer and her cheeks.

    rei! do anything,

    i suggest you use

  3. Shakespeare's collected works: too many to list here (I think my netscape is stretching its limits; I'm sure it'll crash any second now, along with this writeup.)

    digression by some other time. antony. yet him for my sake.

    thurio. ay,

    on mine so that other mine, my boy, tell true.

    troilus.

    i will. if this be borne to that end am i come to you so? tyb.

    thou hast prevail'd;
    and let me see. come,

    let him come in, until our city did.

    anon, anon becomes a man as he is not careful what they weigh not every man prophetically do forethink thy fall as those should do some good news, while greasy joan doth keel the pot of good comfort bring i to her.

    she got the tune of light.

    i am no traitor's letters.

    hot.

    zounds, he says, and nobody sees me go, your hand to hold a sceptre, or none or little, and blush of modesty thy skipping spirit; for if the other half in all men's pride i boast her off and on the traitor cade surpris'd?

    or will you lie,

    your empery,

    your father's in some sort,

    these

The code producing these, should anyone care, is dreadful and incredibly slow mess that, if exposed to sunlight, would probably grow teeth, hands and implant the head of a microsoft marketing drone. It is integrated to ViaVoice TTS though, meaning that it will read the poetry it generates aloud. The effect of this, combined with "infinite poetry mode" and some optimizations to enable it create poetry realtime makes a nice bedtime story teller. For some reason I become pretty nauseous soon though, listening it.

Consider this, your local weather channel predicts 25% chance of rain today. Given if it rains today, there is a 40% chance it will not rain tomorrow, and a 60% chance it will. If it does not rain today, there is a 70% chance it will not rain tomorrow, and a 30% chance it will rain tomorrow. Predict the chance of showers tomorrow.
Easy.

                             /\
                            /  \
                           /    \
                          N      R
                         /        \
                        /          \
                      .75         .25
                      /  \        /  \
                     N    R      N    R
                    /      \    /      \
                   *.7    *.3  *.4    *.6
                    |      |    |      |
                  .525   .225 .100   .150
5/8 chance of no rain tomorrow,
3/8 chance of rain tomorrow.

What about day 2, or 3, or 8? Trees become obsolete, and exponentially useless. For this we employ matricies, and the novelty of Markov Chains is realized

From day to day, our probability of rain transitions thru a transition matrix, T. Each row in the matrix is the current rain state for the current day, and each column being the probability for rain the next day. Note each row adds up to 1, and values for each key must be between 0 and 1 inclusive, and the matrix is a Square Matrix.
         N    R
       _        _
    N | 0.7  0.3 |
T =   |          |
    R |_0.4  0.6_|
For the tree above, we can express D[0] as
          _ N   R _
D[0] =   |_.75 .25_|
and multiply D[0] by T to get D[1], or the probability of rain tomorrow.
D[1] = D[0]*T
        _ N      R _
D[1] = |_.625  .375_|
We can then get the probability of rain in 2 days by multiplying D[1] by T
D[2] = D[1]*T
        _  N       R  _
D[2] = |_.5875   .4125_|
Because matrix multiplication is associative, we can say
D[1] = D[0]*T
D[2] = D[1]*T
D[2] = (D[0]*T)*(T)
D[2] = (D[0])*(T*T)
D[2] = (D[0])*(T^2)
We can get the probability of rain on any given day (X) by bringing our transition matrix to a power of X, and multiplying our initial distribution by this.
D[X] = (D[0])*(T^X)
A markov chain is a sequence of random variables X0, X1, ... Xn, ..., with the following property:

P{Xn = j | X0 = i0, X1 = i1,..., Xm-1 = im-1, Xm = i} = P{Xn = j | Xm = i}

for all integer times n > m and for states i0, i1,..., im-1, i, j belonging to state S. (Where S is a countable, finite, state space, and P denotes probability)

This means that given that you know the present state of the process then knowledge of the past of the process is irrelevant when calculating the probability distribution for future values of the process.

I know what a Markov chain is, and I have a very hard time decoding the terminology and notation above. Hopefully this writeup will help those who don't yet have a Ph.D. in Mathematics understand what a Markov chain is, and how they work.


Named after the Russian mathematician, Andrey Markov, a Markov chain is basically a structure in which future states or statuses depend only on the current state - past states are not considered when deciding what to do next. These future states are determined through probability.

While Markov chains have uses in physics, statistics, economics, biology, gambling, sports, and more, this writeup will deal with the more fun aspect - generating (semi-) meaningful/coherent sentences from existing text. In any case, the concepts are the same: Forget where you were. You are here now. Decide where to go next.

One of the most famous implementations of a Markov chain used to generate random text is Mark V. Shaney (say it out loud, and run the words together, and it sounds similar to "Markov Chainy"). Mark V. Shaney was a Usenet "user" whose posts were generated through an algorithm. Because no one on Usenet had ever seen anything like this, many people were led to believe that these submissions were just posted by real person with an unusual writing style.

How do they work?

The basic algorithm for generating text with Markov chains is actually quite simple, you can even do it on paper - no computers required:

  1. Start with some text input. It can be a single sentence, a nursery rhyme, a book, the entire works of Shakespeare - pretty much anything.
  2. Read in each word, and keep track of what word comes after it. Since words can be repeated, each word may have a list of words that follow it. For example, if your input was "the boy went to the store" you would have:
    • the: boy, store (both "boy" and "store" can follow the word "the")
    • boy: went
    • went: to
    • to: the
  3. After you are finished making your list of words, now you can output your own generated text. Pick a starting word, and randomly pick one of the words from the list that follows that word. Now with your randomly chosen word, pick a word from the list that follows it. Repeat until you feel like quiting.

Using the terminology from above, the word you are on is the current "state", and you are randomly choosing the future state by picking a valid word from the list. This simplistic algorithm can be fun to play with, but its grammar will be rather poor. To generate better output, instead of just keeping track of one word for the current state, you keep track of two words. In other words, instead of making a list of all the individual words in your input, you keep track of all the word pairs, and all the words that can follow this word pair. This is called an n-tier algorithm (where in this case, n is 2). You can keep track of as many tiers as you want, but the higher the number, the more likely your generated output will be exactly the same as your input. For this reason, 2- or 3-tiers are typically the highest a text generator will go.

Okay, I think I understand, now give me an example

(or, I don't understand this at all, give me an example)

1-tier

Alright. Let's start with a 1-tier example using Genesis 1: 1-5. For our word list we are going to ignore all punctuation.

In the beginning God created the heavens and the earth.
And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
And God said, Let there be light: and there was light.
And God saw the light, that it was good: and God divided the light from the darkness.
And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.

On the left are all the words used in the input, and on the right are all the words that follow each of the words. Notice that we include duplicate words - since the input uses these words multiple times, they should have a better chance of occurring in the output. I have also alphabetized the list, but that is not necessary for the algorithm (it just makes it easier to keep track of your words when you are doing this on paper instead of with a computer).

and:the, the, void, darkness, the, God, there, God, God, God, the, the, the
be:light
beginning:God
called:the, night
created:the
darkness:was, and, he
day:and
deep:and
divided:the
earth:and, was
evening:and
face:of, of
first:day
form:and
from:the
God:created, moved, said, saw, divided, called
good:and
he:called
heavens:and
in:the
it:was
let:there
light:and, and, that, from, day
of:the, God, the
morning:were
moved:upon
night:and
said:let
saw:the
spirit:of
that:it
the:beginning, heavens, earth, earth, face, deep, spirit, face, waters, light, light, darkness, light, darkness, evening, morning, first
there:be, was
upon:the, the
void:and
was:without, upon, light, good
waters:and
were:the
without:form

Now let's pick our words to generate a sentence. Since the input starts with "In" that's what I will start with. When using a computer, you would use a random number generator for chosing the next word, but I'm just going to use a counter to pick what word comes next. I'm going to start at 1, then increase by one after I pick each word. In other words, I'll look at the word "In" and pick the first word in the list (the); then I'll look at "the" and pick the second word in the list (heavens), then I'll pick the 3rd word for the next one, and so on. If there aren't enough words, I'll loop back to the first word and keep going.

This gives the output:

0  1   2       3   4        5   6   7       8   9
In the heavens and darkness and God created the waters

10  11  12       13  14   15  16
and the darkness was upon the morning

Not to bad, but it doesn't make much sense. Let's try 2-tier...

2-tier
Using the same input, but tracking pairs of words instead of single words, we get:
and darkness: was
and God: said, saw, divided, called
and the: earth, earth, spirit, darkness, evening, morning
and there: was
and void: and
be light: and
beginning God: created
called night: and
called the: light
created the: heavens
darkness and: God
darkness he: called
darkness was: upon
day and: the
deep and: the
divided the: light
earth and: the
earth was: without
evening and: the
face of: the, the
form and: void
from the: darkness
God called: the
God created: the
God divided: the
God moved: upon
God said: let
God saw: the
good and: God
he called: night
heavens and: the
in the: beginning
it was: good
let there: be
light and: there, God
light day: and
light from: the
light that: it
morning were: the
moved upon: the
night and: the
of God: moved
of the: deep, waters
said let: there
saw the: light
spirit of: God
that it: was
the beginning: God
the darkness:and, he
the deep: and
the earth: and, was
the evening: and
the face: of, of
the first: day
the heavens: and
the light: that, from, day
the morning: were
the spirit: of
the waters: and
there be: light
there was: light
upon the: face, face
void and: darkness
was good: and
was light: and
was upon: the
was without: form
waters and: God
were the: first
without form: and

In this case, we are going to start with "in the", and follow the same word-chosing process as above:

0  1   2         3   4       5   6       7   8   9
In the beginning God created the heavens and the spirit

10 11  12    13   14  15   15 16  17
of God moved upon the face of the deep

That's a bit better!

Punctuation

In the examples above, we have ignored punctuation, however to generate meaningful output that is more than one sentences long, it is a necessity. Believe it or not, without making any changes (other than NOT ignoring it), we don't have to change the algorithm in any way to get decent punctuation handling - we just need to consider punctuation as part of the word. In other words, in the example above, the first sentence ends in "earth." (with the period). We just handle that as a separate word from "earth" (without the period) in the second verse. As I said, this gives you "decent" punctuation handling - not great. If you try to do better, the algorithm gets much more complicated - especially punctuation which requires you to open and close it like parenthesis, quotation marks, brackets, etc. Basically, you need to keep track of what punctuation you have already seen and use that information to help you decide whether the words in your word list are a good choice. If not, you need to decrease their likelihood of being chosen (typically by temporarily removing them from the list of choices).

A Real World Example: E2D2

E2D2 uses 2-tier Markov chains extensively for generating things to say. While he does a lot to handle punctuation, he is especially strict (though still not perfect) with the use of square brackets to handle hard links. He uses two forms of input: an archive of the chatterbox conversation, and noder writeups.

When you say: e2d2 tell me about everything2, he will find all the chatterbox messages that mention "everything2", combine them all together, and generate a Markov chain based on all these messages. Sometimes the output will include the word "everything2", and sometimes it won't simply due to how the words are chosen.

When you say: e2d2 tell me about [everything2], he will download all the writeups in the Everything2 node, remove all the HTML (and also remove the text from certain tags that tend to cause problems), combine them all together, and generate a Markov chain based on the writeups.

Similarly, when you say: e2d2 process my writeups, he will download all of your writeups, remove all the HTML, combine them all together, and generate a Markove chain based on the writeups.

Some Markov chain programs keep a database of their word lists, however with a properly coded algorithm, this is completely unnecessary. E2D2 does all his generating in memory. If you ignore the time it takes to download the user search and individual writeups, E2D2 can typically generate 10 random paragraphs from someone's writeups in less than 1 second. Even The Custodian, who has over 1600 writeups containing over 1 million words can be processed in about 2 seconds. Usually the delay you see when you ask him something is caused by either having to download the content, or simply because he only refreshes the catbox messages every 10 seconds.

Additional real world examples include:

Log in or register to write something here or to contact authors.