Computer Science Department
University of Illinois at Urbana-Champaign

Pre-2005 Bioinformatics Research - Saurabh Sinha

MOTIVATION

Our research focuses on the development of algorithms for solving problems in molecular biology.

POST-DOCTORAL RESEARCH

We worked on probabilistic algorithms to detect cis-regulatory modules (CRMs) in whole-genome scans. We have developed an HMM (Hidden Markov Model) -based approach that looks for clusters of motifs, represented by PWMs position weight matrices. The states of the HMM correspond to different motifs, and probabilistically output binding sites with varying degrees of similarity to the input PWM's. This algorithm, called Stubb, also looks at multiple species data, making phylogenetic comparisons between them. The simple premise is that what is functional is more likely to be conserved in evolution, hence the signal encoded in the genome can be amplified by looking for evolutionary conservation. Moreover, binding sites of certain pairs of transcription factors have a biological preference to occur near each other. Such positional correlations are captured in Stubb by special restrictions on the transitions of the underlying Markov chain. An implementation of this algorithm, the Stubb software suite, may be downloaded here and may be used via our web server here.

We have used our Stubb software to demonstrate that the conventional wisdom about comparative genomics is indeed true -- it really helps to have multiple species. We showed in the following paper that cross species comparison between the two sequenced fruitfly species improves the genome-wide detection of cis-regulatory modules.


Recently, there has been growing interest in a new kind of motif finding application - to find transcription factor binding sites when we are given the promoters or enhancers of a set of potentially co-regulated genes as well as some or all of their orthologs from other species. The interesting problem here is to detect regulatory motifs in such heterogeneous data by evaluating the motifs' overrepresentation while accounting for known evolutionary relationships among the orthologs. I have been working on an Expectation-Maximization approach to solving this problem. The following papers present some of these ideas.

Cross species comparison of genomic sequences has mostly focused on what is conserved, rather than what is not, at least from a bioinformatics perspective. However, I believe bioinformatic approaches stand to benefit considerably from an understanding of how regulatory elements evolve, and how changes in regulatory sequences translate to changes in phenotype, even the birth of new morphologies. We have screened a large number of cis-regulatory modules for changes at the level of binding sites (between the two fly genomes) using the Stubb software. The modules that showed the most interesting changes were then experimentally analyzed and demonstrated to drive differing expression patterns in the two species. This project has involved extensive experimental validation of our bioinformatic predictions, and its results will be published in the coming months.

While working on the problem of regulatory evolution, we asked the question : "How are new regulatory sequences created ?" Once again, the three fruitfly genomes (D. melanogaster, D. yakuba and D. pseudoobscura) and comparative genomics provide an opportunity to address this question on a genome-wide scale. We have examined sequence changes in regulatory regions and found a different pattern of change than in neutral regions. The findings, which are somewhat surprising, partly answer the question of how regulatory sequence may be created, at a crude quantitative level. We report this study in :

With the completion of many more fruitfly genome sequences, the prospect of improving and expanding the above analysis is overwhelming. Such studies will not only improve our fundamental understanding of regulatory evolution, but also nicely complement bioinformatic algorithms that try to detect the regulatory regions in the genome.

DOCTORAL RESEARCH

Most of my earlier research focuses on the detection of short motifs that represent transcription factor binding sites in the promoter regions of genes. My doctoral thesis ( PS.gz, PS) addresses the motif-finding problem by looking for fuzzy, but statistically overrepresented words in randomly generated text. Based on prior biological knowledge, we first define a generic characterization of these motifs, which guides the search. One approach we have taken to the motif-finding problem is that of statistical overrepresentation: a feature that has the expected characteristics of a motif, and occurs surprisingly often in relevant genomic sequence, is likely to be the motif we are looking for. Our algorithm defines an objective criterion for measuring overrepresentation and searches the space of all candidate motifs for the highest scoring one. This was joint work with my advisor, Martin Tompa.


We applied our algorithm on several co-regulated genes and co-functional genes from yeast, and several interesting motifs (both known and novel) came out of the analysis.

Our motif finding program is called "YMF" (Yeast Motif Finder) and we have implemented a Web interface for this software. This server can be accessed here. The following document describes this web interface.

We also performed a rigorous study of our algorithm's performance (on synthetic and biological data in yeast), and evaluated its strengths and weaknesses when compared to two other popular motif detection algorithms - MEME and AlignACE. Our enumerative approach to the motif search problem gives it the edge over the local search heuristics deployed by these programs. Also, our motif model turned out to be more suited for transcription factor binding sites in yeast, contributing to the better performance of our algorithm.

Often, the motifs reported by a computational method as being statistically significant include several slightly different copies of the same motif, only one of which may be truly significant, the others' overrepresentation being a statistical side-effect of this real motif. My colleague Mathieu Blanchette and I have proposed a statistically sound method for finding the real motif among a set of close variants, and demonstrated that it can recover significant motifs with a very low false positive rate.

In another project, I took a very different approach to the motif-finding problem, formulating it as a task of feature selection for classification between sequences that are believed to have the motif and sequences that are not. This enabled me to build on existing techniques of machine learning, resulting in a powerful probabilistic framework for finding a large class of motifs, including combinations of motifs.


A more complete version of the above report appears in:

As part of the build-up to my thesis, I did a survey of the literature on combinations of motifs, called composite motifs or higher order motifs, their biological significance, and relevant computational methods.

Thus, in these various projects, I combined ideas from statistics and machine learning in proposing different solutions to motif finding and its related problems. It is this mix of various techniques, applied successfully to real-world problems, that excites me most about my work.




My Stint with Software Analysis and Protection

While the prime focus of my research is in computational biology, I have also enjoyed working in other domains, such as software analysis and protection. In collaboration with Ramarathnam Venkatesan of Microsoft Research, I worked on analyzing program binaries through their control flow graph abstraction.

We proposed a scheme for software watermarking, where a random subgraph was embedded in the control flow structure of the program, in a way that would be provably hard to recover, using reasonable adversarial models.



We also came up with a scheme to verify the untampered execution of code by a scheme that hashes code execution on the fly, without necessarily reading the code segment.



Another exciting project that I worked on was that of comparing the control flow graphs of two related programs, using a local subgraph hashing technique, and computing the minimum size of a patch that would convert one program to the other.



We also developed schemes for software obfuscation and tamper resistance, both of which exploited the graph underlying the program. All these projects are greatly influenced by my passion for applying or adapting techniques of algorithm design to new domains that have an immediate practical significance; a passion that I believe will drive my research in the future years.