Our long-term goal is to develop computational methods to facilitate the advancement of a fundamental challenge in biology and biomedicine, i.e., how the changes in genome sequence give rise to the differences in phenotypes (at both cellular and organism levels). Such insights will shed new light on disease mechanisms. We develop algorithms (especially machine learning methods) to explore the human genome to identify different types of genomic changes and study their impact on genome function, chromatin and nuclear genome organization, gene regulation, and various molecular interaction networks. We develop systems biology approaches to identifying key genetic variants in cancer development and progression. More recently, we have started to explore new algorithms to integrate data from multiple modalities in the context of single cell biology.
The chromosomes of the human genome are organized in three-dimension by compartmentalizing the cell nucleus. Additionally, different loci on the chromosomes are frequently interacting with each other. It is known that such spatial organization of chromosomes in the nucleus is tightly controlled. However, the principles of such complex organization and its functional impacts are poorly understood. Leveraging new genome-wide mapping technologies, we are developing new algorithms utilizing machine learning techniques to probe the 3D genome organization. (1) Revealing spatial genome organization relative to nuclear compartments. (2) Understanding the principles of spatial genome organization and its impact on gene regulation.
Ultimately, human biology must be understood in the context of evolution. Advances in next-generation sequencing technologies have tremendously reduced sequencing costs, which has dramatically expanded the reach of genetic studies to many more non-model organisms. The whole genome sequences of these new species will provide us with unprecedented opportunities to elucidate the trajectory of genome evolution and gene regulation variations that result in phenotypic diversity. We are working on several key problems in comparative genomics that will allow us to address important evolutionary and biomedical questions. (1) Identifying large-scale chromosomal changes to understand the evolutionary history of genome structure and evolutionary breakpoint regions. (2) Discovering gene regulatory elements and their functional roles in the human genome. (3) Understanding the evolutionary history of transcriptional regulation in different tissue types in mammals.
Cancer is a genetic disease. All cancers share a common pathogenesis, which is the outcome of a process of Darwinian evolution occurring among cell populations within the microenvironments provided by certain tissues. The evolutionary process can promote cells carrying advantageous mutations that confer the capability to proliferate and survive more effectively, which may consequently invade tissues to cause cancer and eventually metastasize. Indeed, genomes of somatic cells undergo dramatic changes that promote cancer development and progression, including point mutations, structural variants, copy number alterations, as well as epigenetic aberrations. We are developing new methodologies to determine important alterations in tumor cells. (1) Identifying driver genomic alterations in cancer. (2) Understanding the role of structural variants and copy number alterations in cancer genomes. (3) Discovering key perturbations in transcriptional regulation in cancer.
We have a keen interest in developing novel algorithms and tools for more efficient and more accurate analysis of genomic data, especially for the high-throughput next-generation sequencing reads. We utilize techniques in machine learning, combinatorial optimization, information theory, statistics, and high-performance computing. We are designing new algorithms for doing genome assembly of large genomes, error correction, and data compression. We also work on various methods to handle sequencing reads from RNA-seq, ChIP-seq, as well as data from whole-genome sequencing, in order to more effectively extract biologically significant information from high-throughput datasets. In addition, we continue to work on fundamental problems in computational genomics, e.g., multiple sequence alignment and whole genome comparisons.