A technique that was used early on for
speech recognition and is still used for some simple applications such as
isolated word recognition, as well as in other problems such as
robot sensor
processing. Unfortunately, the technique is not nearly as sexy as the name would suggest. Since been superseded pretty much by
hidden Markov models.
To understand how it works, you need to know a little about what makes speech recognition difficult. What makes it difficult is that every time you say a word, you never say it exactly in a way that matches the time sequence in any other time you say the word. You might say it faster or slower, but you might also say it faster at the beginning of the word and slower at the end.
It also heps to understand how it is used in practice. For every word you have a template, essentially a really good example of that particular word. The basic idea is to see how close the word you heard was to all the templates, then choose the word associated with the template that was most similar.
But how do you compare the sounds you heard against the template when the times don't match and the template might even be longer or shorter than the sound?
This is where dynamic time warping comes in. Let's call the thing were trying to classify the instance. Both the instance and the template consist of a sequence of measurements, one for each chunk of time. To be exact, each measurement is actually a vector of Fourier and Cepstral coefficients, but that doesn't really matter. All that matters is that I can measure the distance between two of these vectors. Now we need a diagram.
Time in the ^
instance |
|
end + _____---- best match map
| ____/
| /
| ___|
| /
|/
start +----------------+-
start end Time in the
template
So what we try to do is try to find a "
map" between the time in the template and time in the instance. That's what the "best match map" in the above diagram represents. But which
mapping to choose? I mean obviously the start of the template has to map to the start of the instance and the same thing for the ends, but what about for the points in between? If we just doing a "linear
warp" it would be easy, we'd just have a straight line from (start, start) to (end, end), but because we stretch words in irregular way (that is to say, we don't simply speed up or slow down, but we might slow down the beginning then speed up the end) that won't work.
The heuristic we choose is to find the path that minimises the total distance along the whole path, in other words, find the map that makes it the best possible fit between the instance and the template. The total distance is the sum of the distances between the vectors we talked about above at each time step. We can use dynamic programming to speed up this calculationt oo.
We use this "smallest possible distance" between the two as a measure of how well the instance fits the template. Then we choose the template that has the smallest of these smallest possible distances and that becomes our guess at the word!
There's still the issue of how you come up with the templates in the first place, but we'll leave that for another node ...