space space space
space
University of Illinois at Urbana-Champaign
space
space

Ask naturally and you should receive


Dan Roth

Creating a reliable question answering machine that accepts natural language queries is what computer scientists like Professor Dan Roth are chipping away at. The title of a recent article by James Fallows of the New York Times sums up the current situation succinctly: "Enough Keyword Searches. Just Answer My Question." Charitably, he did not include exclamation points. Ray von Dran, dean of the Information School at Syracuse University, has been quoted as saying, jokingly (or not?), that the school was planning to offer a masters degree in Google, which currently has an index of more than 10 billion pages.

"We know how to deal with databases," said Roth, "if the information we want is in databases with a known structure, we know how to query it and get information. The problem is that most of our information today is in text form, so the goal is to be able to access this unstructured information-free form text-as if it is a database. The only thing we can do today is keyword search. This isn't enough. The goal is to move up and access information at a semantic level."

"I've always wanted to do learning and reasoning," Roth said, explaining the integrated approach to his research. "Traditionally in artificial intelligence, these two research areas are separated. My idea was to put them together, and natural language processing provides the key example for how these work together." Natural language is what we humans speak and write to each other, and natural language processing, a subfield of artificial intelligence, involves converting natural language into something a computer can "understand." With the availability of so much text online, more emphasis is put on the ability to process natural language, and question answering is one of the applications that drives its development.

Roth's Cognitive Computation Group develops theories and systems that pertain to intelligent behavior using a unified methodology. At its heart is the idea that learning plays a central role in intelligence. Machine learning is the research area that studies algorithms to improve the machine's behavior based on its previous experience. In other words machines, like people, can learn from experience. "Natural language processing is currently dominated by machine learning due to the variability inherent in natural language and the fact that almost all the decisions made in the process of understanding language are context sensitive, he said. "It is hard to program, explicitly, a program that determines if a given word is a noun or a verb in the context of a specific sentence. (Notice that in the previous sentence the word "program" was used twice-once as a noun and once as a verb.) The current technology therefore is to develop learning techniques to look at a lot of data and learn from experience. This technology is being used to resolve many natural language ambiguities at multiple levels of the text (what is the part of speech tag, the sense of the word, the subject of the sentence, etc.). It is hoped that this technology will lead to better natural language question answering." At least ten commercial question answering systems can be found on the Web, like AskJeeves and BrainBoost, as well as experimental systems within research groups like Roth's.

The first practical learning algorithm Roth designed was a Context Sensitive Spelling Corrector. This is the problem of determining in a sentence like "I'd like a peace of cake" that the writer probably meant to write "piece" rather than "peace." This capability is based on a machine learning based algorithm that is trained to recognize the more likely context of "piece" versus that of "peace" ("weather" or "whether," "know" or "now," "principle" or "principal," etc.). A demonstration of this program, written in 1996, is available from the Web page of Roth's Cognitive Computation Group. When he wrote his context sensitive spelling corrector, the notion of using machine learning within natural language processing was still in its infancy. "People have started to realize the power of taking statistics over a lot of data," he said, "but have not used sophisticated learning algorithms. Over the last ten years, the Web gave us more data and illustrated the need. At the same time, there have been huge developments over the last twenty years in the foundations of machine learning. We now have a better understanding of learning theory and what it means to generalize from past examples and extend to examples you have never seen before. In particular, we have better understanding of the algorithmic and theoretical issues involved in using machine learning techniques for natural language processing. The field has evolved from simple statistics to real learning." The basic capabilities of the context sensitive spelling correction program are still in use and have been further developed in Roth's effort to create programs that can identify the semantics of the text and thus enable access to free form text as if it was stored in a database with a known schema. That is, he attempts to achieve some level of language comprehension that will support intelligent access to unstructured information.

One example of a challenging information access problem is that of identifying and tacking entities in documents. When looking for information about John F. Kennedy, whom do you ask for: John Kennedy, John Fitzgerald Kennedy, Kennedy, President Kennedy, JFK? Similarly, when the word Kennedy occurs, does it refer to the president, one of his brothers, the baseball player, the expressway, the performing arts center, the airport? Roth's group takes on both types of ambiguities with a learning and inference algorithm that identify phrases in text that represent names of people, locations, and other categories. The algorithm traces occurrences of these entities across documents, identifies their typical contexts, and learns the variability in name representation. A search engine with these capabilities, called I-Track, is also available from Roth's Web page

One of the key technologies underlying the I-Track search engine is Named Entity Recognition, another machine learning based tool developed by Roth's group, which identifies different entities and categories in text. Consider this input sentence: "Washington State was named after George Washington, the first president of the United States." Here is a case in which Washington refers to both a location and a person. The tagged output of the program identifies each Washington correctly, as well as United States as a location. "Although the tagged output is not always correct, because it was trained on a relatively small collection of new articles," said Roth, "its output that is around ninety percent correct. This is state-of-the-art performance, which is sufficient for most applications."

"Lots of stuff we're doing is theoretical," said Roth. "The work that underlies the semantic parsing piece is based on a new way to perform inference over a large number of learned classifier, via an optimization algorithm that is based on linear programming relaxation. The application of this algorithm allows us to analyze a sentence at the level of 'who did what to whom, when, where and why?'" This program, which won first place in a semantic parsing software competition in summer 2005 allows Roth's group to push forward the level of comprehension of natural language text. For instance, you can use it to conclude that the following sentences mean the same thing: "John met George." "John and George met." "John met with George." "John and George had a meeting." Moreover, this allows Roth's group to study the problem of textual entailment, which refers to the directional relationship between text expressions in which one expression can be inferred from the other. For example, "to buy" usually entails "to own," but not necessarily the other way around (i.e. you can own something without having bought it). As another example, after reading text that describes "the acquisition of Overture by Yahoo," we would like to be able to determine that it is true that "Yahoo bought Overture."

Roth predicts that the next ten years will see great strides in applying the fundamental technologies he and other researchers are developing to revolutionizing search and other ways to access information, as well as to how humans interact with computers. "A lot of stuff we're doing already can be used," he pointed out, "but for some reason it's not." One such application ready for prime time is Roth's Context Sensitive Spelling Correction tool, which boasts an accuracy rate greater than 95 percent. "It takes a lot of time before these developments are adopted in actual applications," he continued. "Using machine learning techniques, this corrector learns the proper usage of words from large bodies of text that are assumed to be error free. Errors in phrases containing valid words, such as 'a peace of cake,' usually get past standard spell checkers. Our tool not only catches these mistakes but suggests corrections as well." Microsoft, take note.

Written by Judy Tolliver, March 3, 2006


--
Last Modified August 07 2006 08:57:34.

space
space

space

Department of Computer Science, Thomas M. Siebel Center for Computer Science, 201 N Goodwin Ave,
Urbana, IL 61801-2302. The Department is part of the College of Engineering at the University of Illinois at Urbana-Champaign. Contact academic@cs.uiuc.edu with academic questions
or webmaster@cs.uiuc.edu with questions or comments on this page.