display | more...
Digital speech synthesis is fairly different from the sound modeling approach taken by analog speech synthesis, and dramatically different from the physical modeling done in mechanical speech synthesis. It is, instead, an entirely new beast wholly made possible by the ability of computers to deal with a signal in much-faster-than-human time frames. Indeed, that's why the synthesis is "digital;" it occurs entirely within the realm of discrete, unrelated digits, rather than continuous, related states.

Closely related to digital synthesis is simple digital sampling, by which a sound is recorded and played back with a digital device. That is, a microphone is connected to an analog-to-digital converter, which reads the microphone's amplitude thousands of times a second and sends those readings on to be recorded. At this point the recorded sound is entirely digital, no longer having any continuous sections. From there, the readings can be stored indefinitely, and eventually sent to a digital-to-analog converter attached to a loudspeaker. The DAC converts the numbers into electrical signals, which excite the loudspeaker's magnet and reproduce the sound. While this is a valid approach for having a computer produce speech, it is not synthesis as-such, since the sound is recorded at an arbitrary time rather than created in real-time.

One step up from that, and into what can really be considered synthesis, is phoneme-level synthesis. English words can be broken down into between thirty and sixty distinct phonemes, the usual estimate is roughly forty. All other spoken human languages work the same way, though some have dramatically lower or higher phoneme counts. Obviously, then, if all of these phonemes are recorded as above, one needs only trigger them in the right order to produce speech. Old low-end speech synthesis systems, like the Intellivoice module or the Speak & Spell, did exactly this. Combining the crude nature of the phonemic approach with the low sampling rates of the time, this difficult-to-listen-to technology never really took off for business or industry use.

Even before any phonemic devices were even constructed, engineers understood that speech constructed from so few sounds would never sound good enough to be of any use. So, while the consumer sector got these devices, academia and industry were working on the much more intelligible allophone synthesis. Instead of producing the phonemes, this method used the transitions between them, numbering around 400. Where a phoneme system would pronounce the word apt by generating the sounds aah + p' + t', an allophone system would generate aap + p't'. By normalizing the starting and ending frequencies of phoneme-halves at the end of allophones with the corresponding phoneme-halves at the beginning of other allophones, voicing could be quite consistent, if also quite monotone. By using higher sampling frequencies and resolutions (which necessitated larger, more expensive memory stores), synthesized allophonic speech could be crystal-clear.

Necessary to drive both of these methods of speech synthesis were chips and software to take an input and string the phonemes or allophones together in the correct order to make speech. Early versions of these, both consumer and industrial level, were only capable of a limited "dictionary" of words, which the device or software manufacturer had to choose from. English Speak & Spell devices had a dictionary of a couple hundred words and letters, which is why they begin to repeat after a while. A Texas Instruments industry-level driver chip of the same vintage knew even fewer words, but could be interrupted mid-word with a command to start another word; in this way it was possible to hack the chipset into saying words outside of its vocabulary, for instance interrupting computer with pound to produce the word compound.

For the past ten or fifteen years the drive in digital speech synthesis has been to improve the devices' prosody and intonation, making them sound more realistic. Computing power has increased enough that a relatively comprehensive dictionary can be included in software, so this improvement is all that's left to do. Text-to-speech (TTS) conversion is the last hot area in speech synthesis today, and work is being done on making all words coherent within a sentence's prosody. Even though the technique still isn't perfect, it's good enough to replace many word-speaking jobs, taking the place of telephone operators, weather reporters, readers for the blind, and so forth.

Log in or register to write something here or to contact authors.