Computer Science Department
University of Illinois at Urbana-Champaign

The simplest unit of the regulatory network is the binding site, where proteins called transcription factors bind the DNA, thereby switching a gene on or off. A fundamental problem of biological sequence analysis is to discover what the binding sites of a particular transcription factor look like, and where they are located in the genome. The common sequence pattern among the binding sites of a transcription factor is called a "motif". Our computational goals include

  • Motif-scanning: to find matches to known motifs by scanning the genome

  • Motif-finding: to find motifs ab initio from sequences, without any prior knowledge

  • Module-finding: to find clusters of matches to several different motifs; such clusters are called "cis regulatory modules", or CRMs.

Details

An emerging theme of our approach is the probabilistic treatment of the cis-regulatory sequences (CRMs or promoters), which harbor binding sites. We apply a probabilistic model (Hidden Markov Model) to such sequences, which allows all possible combinations of binding sites, of weak as well as strong affinity for their transcription factors, to occur in the sequence. We then work with likelihood scores under such a model, to assess the overall degree of motif presence in a sequence.

MOTIF SCANNING

We have used this approach to solve the motif-scanning problem in a variety of organisms, such as honey bees, flatworms, mouse, and humans. The Stubb program (which implements the HMM) is used to predict the set of genes that contain matches to a particular motif. The motif target set thus obtained is then examined for statistical enrichment for certain annotations, such as Gene Ontology (GO) annotations. This enables us to associate certain motifs (and their transcription factors) with specific cellular functions.

Social behavior in honeybees: We have used the above methodology to elucidate some very interesting aspects of the molecular basis of social behavior in honey bees. In particular, we examined the genes implicated by whole-brain microarray studies to mediate socially regulated behavior in honeybees, and found that transcription factors that perform nervous system-related functions in Drosophila are likely to be regulating these behavior-related genes.


  • "Genome scan for cis-regulatory DNA motifs associated with social behavior in honey bees" - S. Sinha, X. Ling, C.W. Whitfield, C. Zhai, and G. E. Robinson
    PNAS, 103(44), Oct. 2006, pages 16352-16357. In the news
Aging reversal in mouse: We used the above computational approach to predict NF-kB as a potential regulator of the aging process in mouse, a hypothesis that was spectacularly proved by our collaborator Prof. Howard Chang (Stanford University) and his colleagues. They showed that blocking this transcription factor led to a reversal of the aging process in mouse skin cells.


  • "Motif module map reveals enforcement of aging by continual NF-kB activity." - A. Adler, S. Sinha, E. Segal, H. Chang.
    Genes & Development, 2007. 21:3244-3257. Cover article. Recommended by Faculty of 1000. In the news
Cancer regulation in human: The motif map method was applied to promoters of all human genes, to unravel key associations between transcription factors and many different forms of cancer. We also predicted the role of evolutionary conserved motifs (identified by the Kellis lab, Xie et al., Nature 2005) in cell cycle progression, followed by experimental validation of such roles.


  • "Systematic Functional Characterization of cis-Regulatory Motifs in Human Core Promoters" - S. Sinha, A.S. Adler, Y. Field, H. Y. Chang, E. Segal.
    Genome Research, 2007. In press.

MOTIF FINDING

Phylogenetic motif finding: The Hidden Markov model for analyzing regulatory sequences can be extended to multiple species related by a given phylogenetic tree. This is implemented in our "PhyME" software, which has been used to discover motifs ab initio in yeast, fruitfly, as well as vertebrates.


  • "PhyME: A Probabilistic Algorithm for Finding Motifs in Sets of Orthologous Sequences." - S. Sinha, M. Blanchette, M. Tompa. BMC Bioinformatics, 2005. 5(170). Marked by journal as Research Highlight
Discriminative motif finding: The probabilistic model for CRM sequence also allows us to "count" motif matches in a sequence, in the presence of other known motifs. This method of counting motif matches was described in the following paper, and used to solve the discriminative motif finding problem, i.e., finding motifs present in one set of sequences and not in another set:


  • "On counting PWM matches in a sequence, with application to discriminative motif finding - Saurabh Sinha".
    Bioinformatics 22(S1), 2006. (Special issue on ISMB'06, Brazil.)
The PWM search algorithm called "DIPS", developed in the above paper, can be extended to optimize several different objective functions of PWM counts. We are therefore building upon the DIPS framework to solve the motif-finding problem utilizing a variety of additional information, including gene expression data, transcription factor concentration data, etc.

Over-representation based motif finding: The PI's doctoral dissertation proposed novel methods for motif finding (and its discriminative version) within a hypothesis testing framework. For more details, read here.

MODULE FINDING

With known motifs: We have in the past proposed new and accurate algorithms for predicting the locations of "modules" or "CRM"s genome-wide, along with experimental validation. Those approaches, and indeed any current module-finding approaches begin with prior knowledge of the relevant motifs, and find statistically significant clusters of matches to those motifs. Read here for more details.

Without known motifs: In many scenarios today, there is very little knowledge of the important motifs. We are therefore developing algorithms to solve the problem ab initio, without relying on well-characterized motifs.

  • "Computational discovery of cis-regulatory modules in Drosophila, without prior knowledge of motifs" - A. Ivan, M. S. Halfon, S. Sinha.
    In review.

This work relied on our previously published study of cis-regulatory modules (CRMs) in Drosophila, exploiting the REDfly database. We found here that distinct classes of CRMs possess different sequence-level properties. A study of these properties, especially the presence of short words in the sequences, led to the module-prediction method mentioned above.

  • "Large-scale analysis of transcriptional cis-regulatory modules reveals both common features and distinct subclasses." - L. Li, Q. Zhu, X. He, S. Sinha, M.S. Halfon. Genome Biology, 2007. 8, R101. Marked by journal as highly accessed
OVERVIEW
TRANSCRIPTIONAL REGULATION
CRE & ALIGNMENT
ALIGNMENT-FREE COMPARISON
PUBLICATIONS
SOFTWARE DOWNLOADS
PEOPLE
NEWS
PI'S HOME PAGE
SUPPORTED BY