Exploiting geometry to make a hard problem simpler

Sariel Har-Peled
"How much information is needed to perform a computational
task," theorist Sariel Har-Peled asked philosophically. "You want the best answer," he said. "But like life, being the best is hard. Being good is much easier. It's the same with computing."
In addition, anything involving distances and points, like ad hoc networks, can be approached geometrically, which really helps in the process of producing useful knowledge from difficult data. "This underlies a huge part of computer science," said Har-Peled. "You have data that appears to be very high-dimensional, but really it's not. There is a lot of hidden structure that can reduce the dimension of a set, making it easier to work with. For instance, a human can easily divide a set of points on a plane into two sets visually. High dimensional problems are much harder for humans to solve. Computers can capture the hidden structure of high-dimensional data so that it seems low-dimensional and geometric, which leads to better algorithms."
Extract and conquer: the coreset paradigm
Before working with a large set of data, it is important to identify which items of information are critical to the task at hand. This subset of the data, called a coreset, approximates the original data and is ideally suited for geometric problems. The solution to a problem performed on the coreset can be translated to an approximate solution on the entire set. Har-Peled is interested in ways to find the smallest possible coreset while guaranteeing a certain accuracy threshold. The difficulty in coming up with a good coreset is that when items are thrown away, accuracy is potentially lost. Once this coreset is found, you must then figure out how to perform your task on it, and how the corset is chosen depends on the task you want to perform.
One good example of an application of learning is, say, in banking. For example, if customers apply for loans at a bank, the bank must learn which are high risk (bad set) and which are low risk (good set) based on previous examples. In this case, the input data (loan applicants) are not classified or labeled. Using a human expert is too expensive. By machine, the traditional way to separate good from bad is to use a learning classifier, which places items in groups according to their characteristics. The classifier is based on a training set of previously labeled items. The classifier would put someone who had defaulted on a loan into the bad group, for instance. It turns out, Har-Peled explained, that the best learning function uses a large-margin classifier. The problem, however, is that even with a good algorithm, the function is computationally intensive when performed on lots of data. The coreset approach can be used to automatically dig for information, identify which parts are critical, and then use this information to separate the good from the bad. The minimum number of items (loan applicant attributes) needed to produce the classifier is the coreset. Another related approach is to learn with interactive data. "Start with a small sample, pick the worst misclassified example, and then add that to your learning set," said Har-Peled. "Your algorithm will be more efficient because you're working on a very small set. You will also have a theoretical guarantee that is very strong provided you go back and look at where the classifier made its biggest mistake."
"We're in this vicious loop," lamented Har-Peled. "The better computers we have, the more data we have, and the more data we have, the better the computers have to be. For example, NASA has terabytes of information on galaxies, and you'd like to cluster these. But even just reading the data takes so much time. Can the data be clustered without being read before hand? Can it be clustered in only one pass? One possible way to achieve answer to these questions is with the coreset approach. "My work is to try and explain the real world but not necessarily by looking at it," he concluded, "because I'm doing theory."
Written by Judy Tolliver,
May 17, 2006
--
Last Modified August 07 2006 09:01:59.