| |||||||||||||||||||||||||||||||||||||||
|
The simplest unit of the regulatory network is the binding site, where proteins called transcription factors bind the DNA, thereby switching a gene on or off.
A fundamental problem of biological sequence analysis is to discover what
the binding sites of a particular transcription factor look like, and where they are located
in the genome. The common sequence pattern among the binding sites of a transcription factor is called
a "motif". Our computational goals include
DetailsAn emerging theme of our approach is the probabilistic treatment of the cis-regulatory sequences (CRMs or promoters), which harbor binding sites. We apply a probabilistic model (Hidden Markov Model) to such sequences, which allows all possible combinations of binding sites, of weak as well as strong affinity for their transcription factors, to occur in the sequence. We then work with likelihood scores under such a model, to assess the overall degree of motif presence in a sequence. MOTIF SCANNING We have used this approach to solve the motif-scanning problem in a variety of organisms, such as honey bees, flatworms, mouse, and humans. The Stubb program (which implements the HMM) is used to predict the set of genes that contain matches to a particular motif. The motif target set thus obtained is then examined for statistical enrichment for certain annotations, such as Gene Ontology (GO) annotations. This enables us to associate certain motifs (and their transcription factors) with specific cellular functions.Social behavior in honeybees: We have used the above methodology to elucidate some very interesting aspects of the molecular basis of social behavior in honey bees. In particular, we examined the genes implicated by whole-brain microarray studies to mediate socially regulated behavior in honeybees, and found that transcription factors that perform nervous system-related functions in Drosophila are likely to be regulating these behavior-related genes.
MOTIF FINDING Phylogenetic motif finding: The Hidden Markov model for analyzing regulatory sequences can be extended to multiple species related by a given phylogenetic tree. This is implemented in our "PhyME" software, which has been used to discover motifs ab initio in yeast, fruitfly, as well as vertebrates.
Over-representation based motif finding: The PI's doctoral dissertation proposed novel methods for motif finding (and its discriminative version) within a hypothesis testing framework. For more details, read here. MODULE FINDING With known motifs: We have in the past proposed new and accurate algorithms for predicting the locations of "modules" or "CRM"s genome-wide, along with experimental validation. Those approaches, and indeed any current module-finding approaches begin with prior knowledge of the relevant motifs, and find statistically significant clusters of matches to those motifs. Read here for more details.
Without known motifs: In many scenarios today, there is very little knowledge of the important motifs. We are therefore developing algorithms to solve the problem ab initio, without relying on well-characterized motifs.
This work relied on our previously published study of cis-regulatory modules (CRMs) in Drosophila, exploiting the REDfly database. We found here that distinct classes of CRMs possess different sequence-level properties. A study of these properties, especially the presence of short words in the sequences, led to the module-prediction method mentioned above.
|
|
||||||||||||||||||||||||||||||||||||||