Concatenative speech synthesis
is the newest technique in text-to-speech software
development. Unlike previous speech engines that were built from pre-recorded words and phrases or, alternatively, from completely synthesized sounds
, a speaker’s voice is broken into "the smallest number of units possible," according to Dr. Juergen Schroeter of AT&T Labs
, an expert in speech synthesis
AT&T Labs’ custom speech synthesis software, called Natural Voices, begins with a ten to forty hour recording session in which texts ranging from newspaper stories to nonsense syllables are read by the client. These recordings are then arranged in tiny fragments of sound and sorted into a database. The software then processes entirely new text, retrieving and reassembling the sounds required from the database.
The technique is not yet perfect; there are still isolated instances of robotic-sounding and unnatural inflections, but the creators of the technology believe that, finally, the possibility of "voice cloning" is at hand.
"If ABC wanted to use Regis Philbin’s voice for all of its automated customer-service calls, it could," says Lawrence R. Rabiner, vice president for AT&T Labs Research.
By the same token, it is conceivable that other actors’ voices—both living and dead—might similarly be encoded and, as it were, recycled. The question of who owns the concatenative speech synthesis database of a celebrity should keep lawyers busy for years. The use of the technology to commit fraud is also not far over the horizon. For example, how would you know that the person who was asking you to wire him money over the phone was not really your ne’er-do-well brother, in yet another fiduciary jam?
Interim possibilities for the technology abound. Analysts at McKinsey & Company, industry consultants, predict that the market for text-to-speech software will exceed $1 billion in the next five years. Customer call centers, automated voice systems, video games, books on tape, auto manufacturers and computer front-end applications will all benefit from the process.
So far as true "speech cloning" is concerned, skeptics abound, of course. P.S. Gopalakrishnan, manager of the pervasive speech technologies group at I.B.M. Research, one of AT&T’s competitors, says that "the methods and algorithms that we know of, they still need a lot more work." But the fact is, AT&T has already hired three actors to record generic voices, an aspect of their business separate from the "custom voice" option that the company expects to market most lucratively.
One of the actors, an Afro-American from New Jersey, says that the experience of being a "voice donor" was both stimulating and unsettling.
"It’s been for me exciting because I know there is an end product that will have my voice carried on forever," he says. But at the same time, he admits "I have a lot of dread, or at least concern, of whether I’m contributing to the demise of the live actor."
The resurrection of the no-longer-living actor, perhaps, is equally, ultimately, to be considered. John Belushi. John Candy. James Dean and Humphrey Bogart. Together again. Naturally.
The New York Times, July 31, 2001.