A technique that is the basis of modern speech recognition and handwriting recognition systems. Hidden Markov models (HMMs) were developed by L. Rabiner at AT & T, and superseded dynamic time warping. Dynamic time warping worked well for isolated word recognition, but did not generalise well to continuous speech recognition. An HMM consists of a few bits and pieces (note the similarity to Finite State Machines):
  • Some symbols that it emits. For now think of a symbol as a letter.
  • Some states. An HMM might have 3 states, but as few as one and as many as a few hundred. At every time step a state emits a symbol.
  • Some probabilities. Firstly the probability that in a given state, a particular symbol will be emitted, secondly the probability that the HMM will transition (i.e. jump) from the any state to any other state at the next time step. Many of these will be zero.
  • A starting state and a final state.
So the HMM starts in a start state, emits a letter, jumps to another state (or possibly stays in the same state), emits a letter and so on until it hits a final state.

What this allows you to do is now to ask a question like: Given a particular HMM, and if I start in state 1, emit "a", go to state 2, emit "b", go to state 3, emit an "a", go back to state 1 and emit a "c", what is the probability of all of that happening for this HMM?

This question's fairly easy to answer ... you just multiply all the individual probabilities of the emissions and transitions together. But what if I ask the more difficult question without knowing the states it jumped through were, such as: What is the probability that this HMM generated the sequence a, b, a, c?

This is trickier. Several other sequences of transitions (e.g. rather than 1,2,3,1 you might have had 1,3,2,1 or even 1,1,1,1) could have produced that sequence. Not knowing what the states the system went through is why they are called "hidden" Markov models ... the state at any time is unknown. If you knew the state they would just be Markov models.

You can use the Viterbi algorithm to work out an approximation to the probability that a particular sequence of observations was generated by an HMM.

Fine, I hear you say ... what use is that for speech recognition? Well, imagine now that I have an HMM for every word. And rather than the symbols being letters, let them be the sounds that you made in a short chunk of time --usually about 50 milliseconds -- that we've processed in some way (usually using Fourier transforms and Cepstral coefficients) to simplify to phonemes. I record what you say, chop it into little chunks, then I work out the probability that the sounds were generated by this particular HMM. I do that for each HMM (i.e. for every word), then I then choose the word associated with the HMM that had the highest probability. Hey presto! I have a word that should correspond to the word that you spoke.

Why do HMMs work so well? Because of the way they're designed, it means that you can stretch words here and there or speed up the way that you say them and the basic idea will still work, you can say them slightly differently and it will still work, you can change the pitch of your voice and it will still work. Well, not always, but close enough to it to be useful. Basically, it's robust because it's a good stochastic model.

Ok, ok, but there are all these darned probabilities that need to be set? Where are they going to hire the drone that has to set the buggers? For every word there might be thousands of probabilities that need to be set.

Well, you don't. There is a thing called the Baum-Welch reestimation algorithm that can work out the probabilities from data. In other words, you buy thousands of hours of recordings of people talking (called a corpus) and you train your HMMs on them. This is an example of machine learning, and it's a nifty idea, because it means that the speech recognition software can start off with generic settings for everyone, but as time goes on, it tunes itself to understand your voice better.

Hidden Markov Models are also used in genomics. In much the same way as they are used in speech recognition they can be trained using large amounts of sequenced DNA. These models can then be used to quickly make comparisons between an unknown sequence and the large store of already sequenced data.

A small technical precision about the definition of HMMs : in addition to the matrices defining the transition probabilities (usually called "A") and the symbol emission probabilities (similarly, matrix "B"), a third matrix is often provided that gives the probabilities for each state to be the initial one (instead of just giving one initial state). That third matrix (a vector, actually) is usually referred to as pi.

Hidden Markov Models are also used quite a lot in computational linguistics, for example to solve the ambiguities when tagging a piece of text (that is, saying "this is a noun, this is a verb", etc.) : when encountering the word "green", you can't tell whether it's a noun (the colour) or an adjective -- so you associate it with the symbol Noun|Adjective, which can be emitted either by the state Noun or the state Adjective. For that particular application, people sometimes refer to the symbols as "ambiguity classes". Apart from that, the principle is the same as usual with HMMs : finding the path in the network that gives the best score for a given sequence of symbols, and keeping the states (i.e. the tags) found on that path. With a proper training and maybe a few additional heuristics to help the tagger, it's quite easy to reach an accuracy of about 95% -- which is an excellent result when you deal with natural languanges.

Log in or register to write something here or to contact authors.