Traditionally genetic studies focus on the correlation between genetic variations and organism-level phenotypes. The emerging field of systems genetics studies how DNA variations affect molecular phenotypes, such as gene expression, metabolites, and DNA methylation, which provide a bridge from genotypic changes to phenotypes. A large and growing number of system genetic studies have been performed, often focusing on gene expression traits (called eQTL studies). I built a statistical framework, named Sherlock, to jointly analyze eQTL and data from genome-wide association studies (GWAS). This method allows us to effectively combine many weak signals in GWAS to identify disease susceptibility genes. Because many such signals are linked to expression of a gene in trans, Sherlock is able to detect completely new genes from GWAS, and we made promising discoveries in a range of different diseases.
Each person inherits mutations from parents, some of which may predispose the person to certain diseases. Meanwhile, new mutations may occur spontaneously during the reproductive process, and if disrupting key genes, these de novo mutations may increase risks of diseases, especially neurodevelopmental ones. We developed a likelihood model that effectively combines data from multiple sources of the same genes: de novo mutations, inherited variants identified from families, and standing variants in the population (identified with case-control studies). We use a Hierarchical Bayes strategy to borrow information across genes to improve statistical inference. This highly integrative approach greatly increases the power of gene discovery, and predicts promising genes for autism.
Regulatory DNA sequences drive gene expression patterns by integrating information about the environment in the form of the activities of transcription factors. The rules by which regulatory sequences read this type of information, however, are unclear. I developed quantitative models based on physicochemical principles that directly map regulatory sequences to the expression profiles they generate. These models incorporate mechanistic features that attempt to capture how activating and repressing factors work together. By evaluating the importance of these features in the fruit fly segmentation system, we were able to gain insights on the quantitative regulatory rules, including the way repressors prevent transcriptional activation, and the role of cooperative interactions. A simpler model was also applied to ChIP-seq data of transcription factors important for embryonic stem cells, and was shown to be significantly more predictive of DNA binding affinities than other existing methods.
Computational prediction of regulatory sequences may rely on sequence content: whether a sequence contains binding sites that match the specificity of transcritpion factors (TFs). Cross-species genome comparison can further improve prediction because true functional sites tend to be conserved during evolution. We developed computational methods to implement this idea. These methods are built on stochastic models that describe both the sequence content and the evolution of regulatory sequences. In particular, these models capture binding site gain and loss events during evolution. This feature allows our methods to predict partially conserved binding sites, often important in multiple species comparison or comparison of relatively divergent pairs. In a different approach, we built a model to integrate data from in vitro TF binding (protein binding microarray), chromatin accessibility (DNase-seq) and evolutionary conservation. The result is a comprehensive map of binding sites of nearly 200 TFs across 50 different tissues.
Understanding the conservation and change of regulatory sequences is critical to our knowledge of the unity as well as diversity of animal development and phenotypes. We tested key evolutionary hypothesis of cis-regulatory evolution using sequence data of more than 50 developmental enhancers across 12 Drosophila species. We made several interesting findings: for example, there are substantial epistatic interactions among different positions of a transcription factor binding site; loss of functional binding sites roughly follows a molecular clock; and the evolutionary fate of a binding site often depends on its sequence context. In another study, I used both theoretical and simulation studies to demonstrate how redundancy (homotypic clustering of transcription factor binding sites) is built into regulatory sequences by evolution, even though redundancy is never directly selected.
During evolution, the order and relative proximity of genes in genomes are generally not well conserved because of the rapid genome rearrangement events. On the other hand, functionally related genes may be constrained to remain close to each other due to natural selection. Thus, identifying these so-called conserved gene clusters is one way of finding functional gene groups, and can be used to reveal the forces underlying the evolution of genome organization. However, substantial genome rearrangements pose unique computational challenges. I developed a combinatorial algorithm to detect these gene clusters in pairwise genome comparison , allowing genes in the clusters to appear in arbitrary orders. Later, I helped my colleague, Xu Ling, to improve the efficiency of the algorithm, and extend the analysis to a large number of genomes. By combining the algorithmic approach and our newly developed statistical method, we analyzed more than one hundred bacterial genomes and predicted many novel functional gene groups.
Text mining is aimed at extracting information automatically from the vast biological literature. In my RA work with BeeSpace, I was involved in several projects that developed novel text mining methods and practical systems. In one project, we developed the BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology. BSQA recognizes a number of entities and relations, from gene interactions to insect behavior, in Medline documents. For any text query, BSQA is able to automatically identify important concepts associated with this query, arranged in different categories. By utilizing the extracted relations, BSQA is also able to answer many biologically motivated questions, from simple ones such as, where is a gene expressed, to more complex ones involving multiple types of relations. In another project, I proposed a new statistical method that mines biological literature to find important concepts characterizing sets of genes.