First, a story: A while ago my brother was sitting at his desk and hit refresh on his my.yahoo.com window. He noticed the major markets were all down rather sharply, and the business section was leading with a story that markets had opened lower due to the release of several key numbers (housing starts, unemployment, etc). He thought to himself that maybe as an investor he should be better aware of when these numbers are published.
So off he goes to Ask Jeeves to find out, since he has had some success there before. He brings it up, and types: "When do the financial numbers come out in the morning?"
Ask Jeeves' best prepared response to this was:
Where can I find advice about how to confess my homosexuality to my parents?
He was so stunned that by the non sequitur that it took him a while to realize what had happened. Obviously, the search engine was picking up the keyword pairing "come out" and assuming that's what he probably meant. I tried the same search just now, and while the results are less dramatically wrong, they are essentially just as bad.
So my question is: Why design a search engine intending your usership to ask questions in natural language if you are just going to do statistically-ranked keyword searching like every other search engine? Granted it's well known that natural language processing (or NLP) is hard to do, and a good implementation is more or less equivalent to weak AI. But if I were Ask Jeeves I would be tempted to, well, make an attempt. I tried for about ten minutes to get Ask Jeeves to demonstrate that there was any NLP going on behind the scenes, and found no evidence.
I've been thinking of a list of questions that could be answered well with NLP but poorly otherwise. So far the best one is: "Where is the island of Java?". The only reason it gets even one response right there is that it partners with Britannica which has an entry for the place.
In Jeeves' defense, most everything on the web is just text and pictures, with no metadescription to use in order to make the NLP work. But they are in the unique position of preparing standard, canned answers to most of their frequently asked questions, and they have complete control over that content. Even a simple database where lexical tokens were identified with parts of speech would get you on your way. I'm not asking for the computer on the Enterprise, I just want it to realize I'm probably talking about, e.g., Java the island, not the language, and since I started with "Where is..." tell me that it's in the Indonesian archipelago.
It's not exactly my area of CS, but it occurs to me that the structure of the ideal representation of objects for a good NLP engine would be not unlike E2, except with all kinds of extra formalisms and rules imposed that would make it much harder to enter a node. Still, I wonder if the code could be adapted...hmm...