Steps Towards Machine Recognition of Sign Language

Probably the most difficult challenge in the area of Natural Language Processing is machine recognition of Sign Language. Whereas machine translation, speech recognition, and the interpretation of context-sensitive passages are all certainly difficult, they all are based on relatively well known and well understood forms of expression -- written and spoken language. One can just input the words directly or analyze an audio file and begin processing. In contrast to this, the processing and interpretation of sign language is infinitely more difficult because not only does it require the integration of all the fields mentioned above, it is a multi-modal language which differs immensely from spoken language in terms of its grammar, phonemic structure, and tense and aspect markings.

Why is Sign Language Recognition Hard?

image processing (computer vision)
three dimensional space extraction
keeping track of not only the hand movements but the orientation of the fingers
identifying facial expressions
Modelling a complex language (this is a problem for all NLP)
Variation in statistical motion data
Detection of boundaries between signs in continuous speech

Image Processing

The image processing that goes into interpreting sign language is not trivial. A two-camera setup must be used in order to obtain a stereoscopic image of the speaker. This allows the computer to model the motion of the speaker's hands in space. This task is made more difficult by the fact that the orientation of the fingers must be computed as well. In addtion to the limbs, the face is a very important part of signed speech (to be discussed later on) so the facial expressions of the speaker must be modeled as well. This means creating a mathematical description of a series of deformable shapes that act in unison with each other. To make this even trickier, portions of the face may at times be blocked by the hands, making some amount of interpolation necessary.

From Gesture to Language

Once the visual data has been processed, we now need to take all that information about spatial relationships and translate it into signs. Unfortunately, unlike speech recognition which may try to match audio data to a word by making guesses about which phonemes were used, then using the closest possible match which makes sense in context, there is no easy process for matching up a particular motion to a particular sign. For one thing, there are no phonemes (as we understand them) that can guide us*. Secondly, it is difficult for the computer to discern what is a sign and what is movement made between gestures. If a signer makes the sign for father (thumb against the forehead, fingers outstretched) and then makes the sign for nice (chest level, both hands wipe past each other), the hands must move at some point from the forehead to the chest area. Is this part of the sign? What about the hand that wasn't making a sign--must that be tracked as well? Additionally, there are more than 20 regions of activity on the speaker's body in which the motions may take place; all of these will have to be mapped to the individual speaker and tracked through the course of the dialogue.

Another difficulty encountered when trying to match movements to signs is the fact that there can be great variation in the position of the fingers and hands. In some signs the thumb can be either extended or curled, in others, its position is critical. To confound all of this, it is rather difficult to create statistical models of the movement because the there is so much normal variation. Even in signs that have repeated motion (signing is signed by circling the index fingers two or three times), one cycle can vary significantly from another, appearing to the computer as a different sign.

The Language Itself

Once we have gotten to the point where we have a list of all the signs that have been produced, it is necessary to parse it through the grammar, disambiguate meanings, and all the other processes involved with NLP. New grammatical models will have to be constructed because ASL grammar is different from English Grammar, using different word order and markers. One noticeable difference in the grammar is the use of the face to augment the linguistic data. The face is used to demarcate the topic of the sentence, as well as to construct specific questions. Yes/No questions are marked differently from Who/What/Where types, etc.

The Future

The future for this field can only get better. That being said, it will be a long, hard process to integrate the latest technology, mathematical techniques, and linguistic theory into a coheasive, effective system. More and more people are taking on this task, and one of the most inspiring researchers is Christian Vogler, a third year PhD student at the University of Pennsylvania who is deaf himself. His work, in collaboration with Dimitris Metaxes has made much progress using Hidden Markov Models and new techniques in three-dimensional computer vision. Hopefully their work and the work of others will allow us to create a workable system for sign language recognition, making the communication between deaf and hearing people many times easier.

*Depending on how you look at it, there are either 0 "phonemes" in sign language or an infinite number. Either way it doesn't help us very much.

Why I should not fix your computer	Natural Language Processing	British Sign Language	Is development in AI bad?
Hidden Markov Model	invention	viseme	Digital speech synthesis
computer learning	sign language	computer vision	Christian Reconstructionism
machine translation	FEMA	Linguistics	Milan Conference
Auslan	speech recognition	phoneme	University of Pennsylvania
Congo	context	Mr. T