| |
Our research focuses on the development of algorithms for solving problems in molecular biology.
We worked on probabilistic algorithms to detect
cis-regulatory modules (CRMs) in whole-genome scans. We have
developed an HMM (Hidden Markov Model) -based approach that
looks for clusters of motifs, represented by PWMs position weight
matrices. The states of the HMM correspond to different motifs,
and probabilistically output binding sites with varying degrees of
similarity to the input PWM's. This algorithm, called Stubb,
also looks at multiple species data, making phylogenetic
comparisons between them. The simple premise is that what is
functional is more likely to be conserved in evolution, hence the
signal encoded in the genome can be amplified by looking for
evolutionary conservation. Moreover, binding sites of certain pairs
of transcription factors have a biological
preference to occur near each other. Such positional
correlations are captured in Stubb by special restrictions on the
transitions of the underlying Markov chain.
An implementation of this algorithm, the Stubb software suite,
may be downloaded
here and
may be used via our web server here.
Recently, there has been growing
interest in a new kind of motif finding application - to find
transcription factor binding sites when we are given the promoters or
enhancers of a set of potentially co-regulated genes as well as
some or all of their orthologs from other species. The interesting
problem here is to detect regulatory motifs in such heterogeneous
data by evaluating the motifs' overrepresentation while accounting
for known evolutionary relationships among the orthologs. I have been
working on an Expectation-Maximization approach to solving this
problem. The following papers present some of these ideas.
Cross species comparison of genomic
sequences has mostly focused on what is conserved, rather than what
is not, at least from a bioinformatics perspective. However, I believe
bioinformatic approaches stand to benefit considerably from an
understanding of how regulatory elements evolve, and how changes in
regulatory sequences translate to changes in phenotype, even the birth
of new morphologies. We have screened a large number of cis-regulatory
modules for changes at the level of binding sites (between the two fly
genomes) using the Stubb software. The modules that showed the most
interesting changes were then experimentally analyzed and demonstrated
to drive differing expression patterns in the two species. This
project has involved extensive experimental validation of
our bioinformatic predictions, and its results will be published in the
coming months.
While working on the problem of regulatory
evolution, we asked the question : "How are new regulatory sequences created ?"
Once again, the three fruitfly genomes (D. melanogaster, D. yakuba and D.
pseudoobscura) and comparative genomics provide an opportunity to
address this question on a genome-wide scale. We have examined
sequence changes in regulatory regions and found a different pattern
of change than in neutral regions. The findings, which are
somewhat surprising, partly answer the question of how regulatory
sequence may be created, at a crude quantitative level. We report this
study in :
Most of my earlier research focuses on the detection of short motifs that represent transcription factor binding sites in the promoter regions of genes. My doctoral thesis ( PS.gz, PS) addresses the motif-finding problem by looking for fuzzy, but statistically overrepresented words in randomly generated text. Based on prior biological knowledge, we first define a generic characterization of these motifs, which guides the search. One approach we have taken to the motif-finding problem is that of statistical overrepresentation: a feature that has the expected characteristics of a motif, and occurs surprisingly often in relevant genomic sequence, is likely to be the motif we are looking for. Our algorithm defines an objective criterion for measuring overrepresentation and searches the space of all candidate motifs for the highest scoring one. This was joint work with my advisor, Martin Tompa.
Often, the motifs reported by a computational method as being statistically significant include several slightly different copies of the same motif, only one of which may be truly significant, the others' overrepresentation being a statistical side-effect of this real motif. My colleague Mathieu Blanchette and I have proposed a statistically sound method for finding the real motif among a set of close variants, and demonstrated that it can recover significant motifs with a very low false positive rate.
In another project, I took a very different approach to the motif-finding problem, formulating it as a task of feature selection for classification between sequences that are believed to have the motif and sequences that are not. This enabled me to build on existing techniques of machine learning, resulting in a powerful probabilistic framework for finding a large class of motifs, including combinations of motifs.
As part of the build-up to my thesis, I did a survey of the literature
on combinations of motifs, called composite motifs or higher order
motifs, their biological significance, and relevant computational
methods.
Thus, in these various projects, I combined ideas from statistics and machine learning in proposing different solutions to motif finding and its related problems. It is this mix of various techniques, applied successfully to real-world problems, that excites me most about my work.
While the prime focus of my research is in computational biology, I have also enjoyed working in other domains, such as software analysis and protection. In collaboration with Ramarathnam Venkatesan of Microsoft Research, I worked on analyzing program binaries through their control flow graph abstraction.
We proposed a scheme for software watermarking, where a random subgraph was embedded in the control flow structure of the program, in a way that would be provably hard to recover, using reasonable adversarial models.
We also came up with a scheme to verify the untampered execution of code by a scheme that hashes code execution on the fly, without necessarily reading the code segment.
Another exciting project that I worked on was that of comparing the control flow graphs of two related programs, using a local subgraph hashing technique, and computing the minimum size of a patch that would convert one program to the other.
We also developed schemes for software obfuscation and tamper resistance, both of which exploited the graph underlying the program. All these projects are greatly influenced by my passion for applying or adapting techniques of algorithm design to new domains that have an immediate practical significance; a passion that I believe will drive my research in the future years.