Introduction

Speech recognition is the task of converting speech signals into text. Automatic Speech Recognition, or ASR, is the technology that allows a computer to perform the conversion process. Although human beings can generally perform the task from an early age, producing a reliable ASR system is one of the most daunting tasks in data processing.

Below I describe the basic structure of modern ASR systems, the problems they face, and how they try to overcome these problems, and I finish by characterising the state of the art and outlining current areas of research. I have tried to avoid equations or assuming too much mathematical terminology, but speech recognition is a very difficult problem and consequently uses some very advanced techniques. Trying to describe it E2-style, without pictures or complex typesetting, is even harder.

Overview

A modern ASR system will take as its input a list of numbers representing the amplitude of a speech signal sampled at regular time intervals. PCM audio is the most common representation of uncompressed audio data used in computer systems, and an ASR system will typically accept PCM data from either a microphone and analog-to-digital converter (ADC) or from digitized speech already stored in a computer system. At a sampling rate of 16 kHz, there will be 16000 values generated per second. The task of ASR is to map a sequence of PCM values onto a sequence of words.

The task is complicated because the same word may be represented by many different sounds. It is not possible just to record a sample of a word and compare it against the speech to be recognized: in general there are three sorts of variations in the way a spoken word sounds.

Inter-speaker variations are due to the fact that different people speak differently: they may have different accents, speak at different speeds, and the different shapes of their vocal tracts will affect the acoustic qualities of their speech.

Variations also occur because the same person will pronounce the same word in different ways on different occasions. This may be due to differing stresses or intonation, to the word's place in a sentence or the words surrounding it (as sounds from the end of the previous word mix into the start of the next), to hesitations and irregular pauses, to mood and the emotion they are trying to convey, or to physiological factors such as tiredness or a stuffy nose.

Thirdly, there are variations from external sources. There may be environmental noise from traffic, air conditioning, and many other factors. The characteristics of the speech signal will also depend on the microphone and other aspects of the communication channel; this will tend to be constant for a given system (or at least a fixed location), but needs to be compensated for.

Speech recognition which can take these things into account is therefore an enormously complex task of pattern recognition. In practice, the task is often simplified by imposing restrictions. A speaker-dependent system is tailored to one individual's voice, and is not required to recognise other people, making the recognition far simpler.

Another way to simplify the task is to break the speech up into individual words, which means that each word will be pronounced more clearly and it is easy to detect the boundaries between words (which is not always easy as there are frequently no inter-word pauses in speech): therefore isolated word recognition is less error-prone than continuous speech recognition.

However the task is restricted, all speech recognizers contain a few basic components. First there is a feature extraction module, the front end which takes in raw audio data (such as from an ADC) and attempts to remove noise, extraneous material and redundancy in the data. The data is then passed into the recognizer proper, which computes which series of words is most likely to correspond to the audio data. To do this, the recognizer also makes reference to a language model, which attempts to represent the valid ways in which words can be combined. The output of the recognizer will be text, which hopefully matches what a human listener would perceive.

One additional thing to bear in mind is that all speech recognition systems must be trained before they are used. Whether this is done by the software engineer during development or by the user on site may vary, but it is always necessary to present the recognizer with a large amount of sample data and to tell the recognizer what that data actually means, in order for the recognizer to build up a database to check new speech against.

Feature Extraction

Purpose and Method
A raw audio signal may contain 16000 values per second. In contrast, human speech typically contain less than 10 phonemes (the basic sounds of speech) per second, and perhaps 3 or 4 words per second. There is clearly an enormous amount of redundancy in the input signal, and the task of feature extraction is to remove all data which is due to the individual speaker, the environment, and the particular circumstances of an individual utterance, to leave only those parameters which uniquely identify a speech sound and hence a word.

It is helpful to know a bit about the acoustic properties of speech. As with all acoustic signals, speech can be considered as being made up of sinusoids of different pitches and amplitudes. On the large scale, these pitches and amplitudes will vary, but over a short period of time (10-25 ms) speech signals are constant. With an adult male, speech signals are concentrated in the frequencies between 50 Hz and 5 kHz; the speech of women is slightly higher, and children may be significantly higher. Within this range, certain frequencies are more important: most of the data is to be found in the lower bands, and therefore greater weight is given to these bands in calculation.

Speech production can be considered in digital signal processing (DSP) terms to have two components. Firstly, there are the vocal cords, which produce a series of pulses. Secondly, there is the vocal tract, which can be modified by moving the lips, tongue, velum, teeth, and jaw. The vocal tract acts like a filter to modulate the pulses from the vocal chords, and produces the speech sounds we hear out of the mouth and nose.

Speech is made up of vowels and consonants, and sounds may be voiced (pronounced using the vocal chords, generally giving the sound a clear pitch) or unvoiced (more like white noise). Vowels can be identified by their pitch, with each vowel having a characteristic pattern of formants, the different frequencies which go to make up a sound. Vowels carry most of the energy of a speech signal, while consonants are weaker and harder to recognise by an ASR system.

Feature extraction is performed by extracting the different frequencies that make up the sound. This is done in one of two ways, in each case acting on a block of data of perhaps 10 or 20 ms duration. (Those not familiar with maths or DSP can skip the rest of the paragraph.) The first is PLP (perceptual linear prediction) which uses linear prediction coefficients produced by computing the autocorrelation of the time domain signal; these values describe a filter which is similar to the filtering effect of the vocal tract. The second is to use Mel frequency cepstral coefficients (MFCC), which are computed by taking the time-domain audio data, computing the discrete Fourier transform (DFT), combining the output into bands to reduce the number of coefficients, taking the logarithm (which has the effect of converting the convolution of the channel impulse response into a sum of logarithms), and then performing an inverse transform (an inverse DFT or a discrete cosine transform). Typically 12 MFCC coefficients are used, and these are combined with a measure of the signal energy; first and second order derivatives of these values are also computed to represent changes in the signal.

The result of FE is a small number of coefficients, which can then be passed to the recognizer. The feature extraction module generates a vector of these coefficients at regular intervals, and these coefficients represent the relevant properties of the audio signal at that point.

Improvements to Feature Extraction
There are a number of techniques which can be used to improve the quality of the data before it is passed to the recognizer. Some operate on the audio data before feature extraction, others are used during or after FE.

Noise reduction techniques can be used to remove background noise. One method of doing this is to assume that the frequency spectrum of noise is constant, to measure this noise before the user starts to speak, and to subtract the noise from the signal once the user is speaking. Smoothing can be applied (e.g. using a low-pass filter) to frequency or cepstral coefficients to remove short-term changes caused by noise spikes; although this will also remove information about sudden changes in speech.

A voice activation detector (VAD) can tell the difference between speech and other noise, and this can be used to turn the recognizer on and off. Techniques for this range from the simple (measuring the signal energy), through the less obvious (combining energy measures with the zero crossing rate of the signal), to very complex comparisons of the energy of different frequency bands.

Vocal tract length normalization (VTLN) can be used to compensate for the some of the differences between different speakers, by raising or lowering the frequency of the signal to match that of the reference. The frequency difference between a speaker and the reference is estimated in a number of ways: using pattern recognition devices like neural nets and hidden Markov models; a trial-and error search by varying the frequency and comparing the accuracy of the recognizer; or by estimating the average frequencies of the speech formants.

Mean and variance normalization of the logarithms of the frequency coefficients can be used to reduce the effects of channel noise, based on the assumptions that channel noise is constant, and is convolved with the signal. The mean and variance need to be calculated, and in an interactive system this may be done using partial sample averages or Bayesian methods.

Feature transformations are used on the FE output to reduce the redundancy in the output coefficients, and thereby to reduce the total number of coefficients. The techniques of linear and non-linear discriminant analysis (LDA and NLDA) are used to reduce the number of coefficients using multiplication by a pre-computed matrix, or processing using multilayer perceptrons (MLPs) or similar.

Recognition

Recognition using Hidden Markov Models
There are two techniques used for recognition, hidden Markov models (HMM) and artificial neural nets (ANN); the latter can be used in conjunction with HMMs to create hybrid systems. Both of these are pattern-matching systems. The basic problem they face in recognition is that of taking a sequence of input feature vectors and working out which of the training sequences they most closely resemble. One factor that particularly complicates this is time-stretching: the same words may be pronounced at different speeds, but even more problematically within a word different components will vary in length, and varying pauses will be interjected.

Here I will describe the procedure for recognizing isolated whole words using HMMs. Similar techniques can be used for recognising parts of words, which may then be combined into whole words or into sentences of continuous speech. To recognize isolated whole words, each word can be represented using a different Hidden Markov Model.

An HMM models a word as made up of a series of states (like a finite state machine) through which the speaker's mind passes as they pronounce the word. Each state is therefore associated with a specific sound, and therefore certain feature vectors are more common in that state. For a whole-word recognizer, four states are commonly be used to represent the different stages in the word, and it is assumed that states are moved through in sequence from first to last. Modeling each word in this way makes it easier to stretch and shrink the components to get the best match between input and reference.

A state transition is a movement from one state to another, and is assumed to happen with every new feature vector. A transition may mean remaining in the same state or advancing to a new state. Each transition has a fixed probability determined at training. This has consequences when you consider how an utterance maps onto a sequence of states: if the states are numbered 1 to 4, in uttering a word the states measured against time may be 111222223344444 or 1122222233344444444 or anything similar. The model will record how certain patterns are more likely (e.g. 111222222333444), while others (such as 1234444444444444) may be less likely; this is determined at training.

Considering audio production this way allows you to map a received speech signal representing a word onto a reference word even if the two signals have different timings of their subcomponents, their phonemes. Assigning probabilities to routes means that if you can't tell whether a transition is 111222233333444 or 11123444444444 you can pick the more probable one. However, this must be combined with an analysis of the actual speech sound because you are not going to have the most probable sequence all the time.

To calculate the probability that input audio matches a reference model of the word, it is not enough just to have a set of state transitions: it is also necessary to associate states with speech sounds in some way. This is done by specifying the probability of a given state producing different feature vectors. By multiplying the probability of arriving in a state by the probability of a state producing a given set of feature vectors, you can balance the likelihood of a path against its fit with the speech signal, and compute the most likely path.

In general, the probability of a given sound being associated with a given state is calculated using a pdf (probability density function), which maps a feature vector to a probability. Each state of each HMM has its own pdf, and calculating all the probabilities for each state for every feature vector is the most time-consuming aspect (the actual calculations are done by determining the mean and variance for each coefficient of the vector in training, and using these to produce Gaussian probability distributions).

You now have a method of calculating the probability of a segment of speech coming from a given state, and the probability of each sequence of states. From this, you can calculate the probability that a given word is represented by a given HMM. This seems to require a huge amount of calculations, but there are search procedures which simplify the calculations, and mean that speech recognition is possible on a desktop PC or even an embedded processor in a PDA or mobile phone. The algorithms for doing this are called the Forward algorithm and Backward algorithm, used for calculating probabilities, and the Viterbi search, used for finding the most probable path.

So you are able to input your feature vector into your HMM, find the most probable path, and find the probability that the speech sound you are processing corresponds to the word represented by the HMM (there is one HMM per word). Now you do the same for every other HMM, and find which word is most likely to have been spoken.

You can see that for a small vocabulary this is just about possible, but with a large vocabulary, it is necessary to work on a different level, by finding the most likely phoneme for each section of speech. To do this, you must work out where phonemes start and end: this can be done by including the start and end in your HMM as you include the other stages of the phoneme. Commonly each phoneme will be represented by one HMM with three states for beginning, middle and end, plus a dummy start and finish state to detect the edges of the phoneme. This technique can be extended to work on diphones or triphones (combinations of 2 or 3 phonemes), with greater accuracy, but far more models.

Using Neural Nets
Multi layer perceptrons (MLP) provide the most common architecture for using neural nets. Nets can be used for isolated word recognition. Alternatively they may be combined with HMMs to produce a hybrid system: neural nets can be used used to calculate probabilities, or to perform feature extraction.

HMMs have the advantage of a strong theoretical background, and are generally easier to train. ANNs are more flexible and potentially faster and smaller; they also have the advantage that they place fewer restrictions on inputs, unlike HMMs which require the coefficients of the input vector to be statistically uncorrelated to each other.

Language models

In determining the current word being recognised, it is useful to know the probability of different words occurring in speech. In a simple system, you may be able to restrict the order of words e.g. to basic verb-noun commands. You can also use previous words to estimate the probability of the word which will follow, to privilege common words over rare words, or to disambiguate similar-sounding or identical-sounding words. This is the province of language modelling.

A language model is essentially an array of values, each value being the probability of a specific word following a given previous word. You may have a simple system such as a phone voice-dialer, where you say a pre-programmed name or a sequence of digits, but are not allowed to say "0 1 5 6 Colin 8 3". Here you may have a probability of 0 for Colin following a digit, but a higher probability of Colin being said with no preceding words. This means that following a digit you only have 10 possible words to consider (zero, one, two, three, etc) rather than 50 possible words (zero, one, two, three, ..., Colin, James, Jack, Bruce, and everyone else you know): the result is greater speed and accuracy.

More complex systems may use more complex rules to encode subtler features of language. But in each case, the output of the language model is the probability of each word given the previous words in the sequence. You can multiply this by the per-word probabilities given by the HMMs, and find the largest product to determine the most probable word.

In a more complex system that works on phonemes, diphones or triphones rather than words, it is also necessary to consider the probabilities of each sequence of phonemes forming a word. In such a case, the sequence of phonemes can also be modelled using an HMM and the most likely sequence which matches a word in the vocabulary is found.

Training

Before an ASR system is used, it is necessary to expose it to many samples of speech. For a speaker-independent system, these should exhibit a wide range of speakers, in terms of age, sex, and regional accent. You also want multiple instances of each word in the system's database.

Training data is generally tagged, with word boundaries being marked. Individual words are also labelled, and in a system that works on the level of phonemes, a phonetic transcription must be provided, including the boundaries of the individual phonemes. A number of organizations and corporations sell corpora of training data, often collected from a large number of speakers, and with specific sorts of background noise (e.g. traffic, airplane, tank).

To perform training, the training data is passed through the feature extraction module so it arrives at the recognition module in the same format as the data the recognizer is to recognise. This training data is used to compute the probabilities used in the HMMs and/or neural nets, although it may also be used for other parts of the system (e.g. the feature transformation parameters in the feature extraction block). For HMMs the technique used are the Maximum Likelihood Estimation (MLE) algorithm or the Maximum a Posteriori (MAP) algorithm, both of which utilise the Baum-Welch re-estimation algorithm.

An alternative to training a system at the beginning and preserve the HMM or ANN parameters is to use adaptive training, which modifies the models in the speech recognition system while speech is being recognized. This technique, known as Speaker Adaptation, offers an improvement over conventional Speaker Independent systems. However, it still requires initial training to begin the process and provide a reference for the less structured data used for adaptation.

The state of the art and the future

Performance of speech recognition is measured by the word error rate, obtained by adding the number of substitutions, insertions and deletions, and dividing this total number of mistakes by the total number of input words. This is usually expressed as a percentage.

However, when sentences are being recognised, an error in any word can change the meaning of the sentence (unlike in typing where errors generally produce nonsense words). Another measure, "meaning accuracy", can be used to take this into account - if you assume that one error affects the meaning of 5 words (e.g. a clause or small sentence), you multiply the word error by 5 to get the meaning error.

The word error rate of a system depends on the task. Spoken digit recognition can be performed with an error rate of 0.3%. Slightly more complex multi-user systems with a constrained grammar (e.g. verb-noun pairs) have an error rate of below 5%: the Air Traffic Information Service, which has 2000 words and a bigram language model, has an error rate of 3%. (Cole 1.2.2)

The performance of systems on unconstrained speech from multiple speakers is notably worse. For read speech (e.g. from newspaper articles), a rate of 7-8% is possible; however for spontaneous speech (e.g. transcribing voice mail) error rates can be as high as 30% (Padmanabhan p.1). Dictation systems for home computers, which are trained for a single user, claim very low errors, but in practical use the true rate is 5% to 10% (Koester; Codewell), or a meaning error of 20% to 40%; the main products for dictation are IBM's ViaVoice and Dragon System's NaturallySpeaking.

The principal applications for speech recognition today are dictation/voice control systems for home computers and the assistive technology market; voice dialing and similar applications for mobile phones; and telephone menu systems.

With the basic technology of HMMs and neural nets being well established, and accuracy for many constrained systems plateauing (e.g. speaker-dependent dictation, small vocabulary recognition), future research is concentrating on improving performance and easing implementation in a few areas.

  • Robustness in high noise environments, much of which is being driven by military applications, but which is also necessary for automotive applications.
  • Improving performance on spontaneous speech, which is far more irregular and full of hesitation and false starts.
  • Capturing prosody (intonation, pitch and rhythm) as well as words, which has less practical value, but is an interesting problem in its own right.
  • Constructing efficient and accurate language models, and handling out-of-vocabulary words.
  • Categorizing speakers, e.g. by sex, and constructing separate models or adapting the models to a class of speakers.
  • Confidence measures, allowing assessment of how likely an individual output word is to be correct.
  • Combining different recognizers and selecting the most likely output from all of them, which can improve system performance significantly at the cost of increased processing requirements.
Another major development is the Aurora project of ETSI, aimed at specifying a standard for speech recognition by a remote server. Processing and memory requirements are still a major issue in ASR, and it is likely that substantial improvements in performance will depend on increased processing power.

The development of ASR has paralleled that of many other artificial intelligence tasks, where initial claims have never been quite fulfilled, but useful applications on a smaller scale have emerged. ASR systems are now good enough for many tasks, where they can replace numeric keypads or human operators in restricted domains; however the day when a computer understands you as fully as a human being is still far off, and it is unclear how to what extent the difference can ever be closed.


Main sources
  • Claudio Becchetti and Lucio Prina Ricotti. Speech Recognition: Theory and C++ Implementation. Wiley. New York. 1999.
  • Codewell. "Speech Recognition in the Classroom". Typewell. 1999-2003. http://www.typewell.com/speechrecog.html (July 23, 2003).
  • Ronald Cole et al. "Survey of the State of the Art in Human Language Technology". Center for Spoken Language Understanding, Oregon Graduate Institute, USA. 1996. http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html
  • Heidi Horstmann Koester and Simon P. Levine. "User Performance With Continuous Speech Recognition Systems". 2000. http://umrerc.engin.umich.edu/jobdatabase/articles/resna00symp/KoesterLevine2.rtf
  • M Padmanabhan et al. "Automatic Speech Recognition Performance on A Voicemail Transcription Task". IBM T J Watson Research Center. IBM Technical Report RC-22172. 2001. http://domino.watson.ibm.com/library/cyberdig.nsf/0/c3b4d86c12c3611c85256acb0052de98?OpenDocument&Highlight=0,speech
  • Steve Young et al. The HTK Book (HTK Version 3.2). Cambridge University Engineering Department. 2002. http://htk.eng.cam.ac.uk/docs/docs.shtml