Deepening Search: Exploiting structured data on the Deep Web

Kevin Chang
Unlike the traditional scenario, in which data is confined
to isolated databases, we now have much data online that serve
people everywhere. There is a large amount of data, from many
sources. Professor Kevin Chang's research revolves around large-scale information access- to find information from these large numbers of sources and these large amounts of data. And he draws his inspiration from everyday real world experience, like finding real estate or making travel plans.
MetaQuerier project: Probing multiple online databases
Suppose you can cull real estate information from the content of housing databases accessible on the Web, which are normally accessible only by querying, unlike normal Web pages that can be accessed directly. In the real estate example, one such source would be Realtor.com, which provides access to the Multiple Listing Service data. Other sources provide listings collected directly from realtors. These dynamic databases are referred to as the Deep Web- in many domains like real estate, jobs, travel, etc. To query multiple sites is to metaquery.
"There are lots of databases on the Web, but users are not making good use of them," said Chang. In 2004 his team surveyed the Web by random sampling and estimated 1.2 million query interfaces on this Deep Web. "Often you don't know where the information is or if the sources even exist. We want to build a MetaQuerier to point users to the right sources by building a database of online databases-that is, to collect data sources from the Web, and model their key parameters."
A good use for such a metaquery agent would be to find what online databases provide flight information and to query them for specific itineraries. This agent would have to know where each database is and speak its language to navigate the target site. This is hard because the agent will not know in advance what the user is going to look for. Chang has built a tool called MetaQuerier- a new kind of search engine for matching the user to the right sources and to query these sources. He has been using a 100-node cluster that has been running for one year doing just that. In a related project, he is working with the National Center for Supercomputing Applications (NCSA) to find job sites on the Web and to extract information from them. He has also been transferring the technology by licensing to vertical search engine companies in several domains.
A huge technical challenge for search integration is to
understand the semantics of each site on the fly. "The feeling is
that sources on the Web are not arbitrarily complex," Chang
explained. "They seem to be converging to some small number of
patterns. If you wanted to open a bookstore on the Web, you
would first see how Amazon does it. Are searches conducted by
author, by title, or what? You would assume that users are
accustomed to Amazon's way of doing business, so that if Amazon
creates a new way of interacting with its users, other sites will
follow." Chang referred to this phenomenon as the "Amazon
effect," and it has been the key insight that drives the
project. In other words, the Amazon effect says that if you've
seen some fields, you've seem them all, especially among data
sources of a similar nature. Just as common sense would tell you
where elevators are located in a building, the same can be said
of user interface design patterns. Instead of being overwhelmed
by data on a large scale, Chang considers it "an opportunity-a
blessing really-because when things appear in a large quantity,
patterns will emerge that will allow you to do some mining in a
statistical way." For example, schema matching needs to know that
"first name" and "last name" refer to the same thing as "author"
in two different bookstore sites. How can this be figured by the
metaquery agent? When you look at many such sites (e.g., Amazon,
Barnesandnoble, Borders), you will notice that the sites speak in
similar languages, and the vocabulary is small. You will notice
patterns such as: If a last name is mentioned, then you can
expect a first name. And if these fields are used, then you will
not see another field "author." Or on a travel site, if you see
"departure," then you will also see "arrival;" if you see "coming
from," then you will also see "going to." "This 'hidden
regularity' turns large-scale data into a novel opportunity for
information integration: we can now can leverage a large number
of sources to statistically discover their semantics by analyzing
their regularity," said Chang. "There are many sites but few
patterns, and our challenge is to develop precise data mining
techniques to analyze such patterns-to enable metaquerying across
sources."
AIM project matches data to soft conditions
With structured data, traditionally, Boolean queries are exclusively used to find information. Everything is answered by yes or no, with no gray area available. If conditions are relaxed too much, the result will be too many matches. This type of querying cannot be applied effectively to new online scenarios, say in e-commerce, where users are searching data with some soft preferences instead of hard constraints. Can we search a structured database to find best results first, in the order of their matching quality, like the way we use Google? What is needed is a fuzzy, preference-based query tool that gives ranked results. "We aim to add support ranking into true/false queries," Chang said. "We are building a system with a machine learning front end to help users specify queries and to evaluate the query." Both functions, an interface for specifying ranking queries and an engine for processing them, are lacking in current database systems. By learning from user's interactive querying, a ranking function can be generated and the query can be optimized.
For example, Realtor.com lists many houses from which you
can generate a short list based on number of bedrooms, price, zip
code, and other terms suitable for Boolean queries to a structured
database. Suppose you want to live close to campus? That condition
is impossible to specify within Realtor.com because location is
only defined by zip code. The AIM project matches data by "soft"
conditions such a similarity, relevance, or preference: "What
about houses on the next block? What about a newer house? What
about a house a few square feet larger or smaller than what I
originally specified?"
Written by Judy Tolliver,
June 12, 2006
--
Last Modified August 07 2006 09:00:54.