space space space
space
University of Illinois at Urbana-Champaign
space
space

Deepening Search: Exploiting structured data on the Deep Web


Kevin Chang

Unlike the traditional scenario, in which data is confined to isolated databases, we now have much data online that serve people everywhere. There is a large amount of data, from many sources. Professor Kevin Chang's research revolves around large-scale information access- to find information from these large numbers of sources and these large amounts of data. And he draws his inspiration from everyday real world experience, like finding real estate or making travel plans.

MetaQuerier project: Probing multiple online databases

Suppose you can cull real estate information from the content of housing databases accessible on the Web, which are normally accessible only by querying, unlike normal Web pages that can be accessed directly. In the real estate example, one such source would be Realtor.com, which provides access to the Multiple Listing Service data. Other sources provide listings collected directly from realtors. These dynamic databases are referred to as the Deep Web- in many domains like real estate, jobs, travel, etc. To query multiple sites is to metaquery.

"There are lots of databases on the Web, but users are not making good use of them," said Chang. In 2004 his team surveyed the Web by random sampling and estimated 1.2 million query interfaces on this Deep Web. "Often you don't know where the information is or if the sources even exist. We want to build a MetaQuerier to point users to the right sources by building a database of online databases-that is, to collect data sources from the Web, and model their key parameters."

A good use for such a metaquery agent would be to find what online databases provide flight information and to query them for specific itineraries. This agent would have to know where each database is and speak its language to navigate the target site. This is hard because the agent will not know in advance what the user is going to look for. Chang has built a tool called MetaQuerier- a new kind of search engine for matching the user to the right sources and to query these sources. He has been using a 100-node cluster that has been running for one year doing just that. In a related project, he is working with the National Center for Supercomputing Applications (NCSA) to find job sites on the Web and to extract information from them. He has also been transferring the technology by licensing to vertical search engine companies in several domains.

A huge technical challenge for search integration is to understand the semantics of each site on the fly. "The feeling is that sources on the Web are not arbitrarily complex," Chang explained. "They seem to be converging to some small number of patterns. If you wanted to open a bookstore on the Web, you would first see how Amazon does it. Are searches conducted by author, by title, or what? You would assume that users are accustomed to Amazon's way of doing business, so that if Amazon creates a new way of interacting with its users, other sites will follow." Chang referred to this phenomenon as the "Amazon effect," and it has been the key insight that drives the project. In other words, the Amazon effect says that if you've seen some fields, you've seem them all, especially among data sources of a similar nature. Just as common sense would tell you where elevators are located in a building, the same can be said of user interface design patterns. Instead of being overwhelmed by data on a large scale, Chang considers it "an opportunity-a blessing really-because when things appear in a large quantity, patterns will emerge that will allow you to do some mining in a statistical way." For example, schema matching needs to know that "first name" and "last name" refer to the same thing as "author" in two different bookstore sites. How can this be figured by the metaquery agent? When you look at many such sites (e.g., Amazon, Barnesandnoble, Borders), you will notice that the sites speak in similar languages, and the vocabulary is small. You will notice patterns such as: If a last name is mentioned, then you can expect a first name. And if these fields are used, then you will not see another field "author." Or on a travel site, if you see "departure," then you will also see "arrival;" if you see "coming from," then you will also see "going to." "This 'hidden regularity' turns large-scale data into a novel opportunity for information integration: we can now can leverage a large number of sources to statistically discover their semantics by analyzing their regularity," said Chang. "There are many sites but few patterns, and our challenge is to develop precise data mining techniques to analyze such patterns-to enable metaquerying across sources."

AIM project matches data to soft conditions

With structured data, traditionally, Boolean queries are exclusively used to find information. Everything is answered by yes or no, with no gray area available. If conditions are relaxed too much, the result will be too many matches. This type of querying cannot be applied effectively to new online scenarios, say in e-commerce, where users are searching data with some soft preferences instead of hard constraints. Can we search a structured database to find best results first, in the order of their matching quality, like the way we use Google? What is needed is a fuzzy, preference-based query tool that gives ranked results. "We aim to add support ranking into true/false queries," Chang said. "We are building a system with a machine learning front end to help users specify queries and to evaluate the query." Both functions, an interface for specifying ranking queries and an engine for processing them, are lacking in current database systems. By learning from user's interactive querying, a ranking function can be generated and the query can be optimized.

For example, Realtor.com lists many houses from which you can generate a short list based on number of bedrooms, price, zip code, and other terms suitable for Boolean queries to a structured database. Suppose you want to live close to campus? That condition is impossible to specify within Realtor.com because location is only defined by zip code. The AIM project matches data by "soft" conditions such a similarity, relevance, or preference: "What about houses on the next block? What about a newer house? What about a house a few square feet larger or smaller than what I originally specified?"

Written by Judy Tolliver, June 12, 2006


--
Last Modified August 07 2006 09:00:54.

space
space

space

Department of Computer Science, Thomas M. Siebel Center for Computer Science, 201 N Goodwin Ave,
Urbana, IL 61801-2302. The Department is part of the College of Engineering at the University of Illinois at Urbana-Champaign. Contact academic@cs.uiuc.edu with academic questions
or webmaster@cs.uiuc.edu with questions or comments on this page.