My primary research projects during my PhD study in 2016-2019

Phylo-HMGP: Continuous-trait Probabilistic Model for Comparing Multi-Species Functional Genomic Data

alt text 

Multi-species functional genomic data can help us gain better understanding of the molecular mechanisms for phenotypic diversity across species. Prior studies mostly performed cross-species comparison of genomic features based on discrete properties, which may cause loss of information if the original data are continuous values. We developed a new probabilistic model called phylogenetic hidden Markov Gaussian processes (Phylo-HMGP), to simultaneously infer heterogeneous evolutionary states of continuous-trait functional genomic features in a genome-wide manner. The method incorporates Ornstein-Uhlenbeck process with the hidden Markov model, to jointly model the temporal dependencies between genomics features of species and the spatial dependencies between the genomic loci. We applied Phylo-HMGP to analyze a new cross-species DNA replication timing (RT) dataset from the same cell type in five primate species (human, chimpanzee, orangutan, gibbon, and green monkey). The results show that our Phylo-HMGP model enables discovery of genomic regions with distinct evolutionary patterns of RT. Our method provides a generic framework for comparative analysis of multi-species continuous functional genomic signals to help reveal regions with conserved or lineage-specific regulatory roles.
[pdf] [source code]

Phylo-HMRF: Comparing Multi-Species 3D Genome Organization Using Hi-C Data

alt text 

In human and other eukaryotes, chromosomes are organized and folded in three-dimensional (3D) space in the cell nucleus. 3D genome organization is closely related to vital genome functions such as DNA replication timing (RT) and transcription. Recent whole-genome chromatin interaction mapping technologies have offered new insights into 3D genome organization. However, our knowledge of the principles underlying 3D genome organization and how 3D genome evolves in mammals remains limited. The computational methods that facilitate comparison of 3D genome across species are under-explored. We developed a new probabilistic model, phylogenetic hidden Markov random field (Phylo-HMRF), to identify evolutionary patterns of 3D genome based on multi-species Hi-C data by jointly utilizing spatial constraints among genomic loci and continuous-trait evolutionary models. We used Phylo-HMRF to uncover cross-species 3D genome patterns based on Hi-C data from the same cell type in four primate species (human, chimpanzee, bonobo, and gorilla). The identified evolutionary patterns of 3D genome correlate with features of genome structure and function. This work provides a new framework to analyze multi-species continuous genomic features with spatial constraints in the 3D space and has the potential to help reveal the evolutionary principles of 3D genome organization.
[pdf] [source code]

Predicting Long Range Enhancer-Promoter Interactions with Genomics Sequence Features

alt text 

We developed an ensemble learning based framework (PEP) incorporating two feature extraction modules for predicting enhancer-promoter interactions using solely DNA sequence derived features. There are two modules, PEP-Motif and PEP-Word, which use different feature extraction approaches. In PEP-Motif, we search for patterns of known transcription factor binding site (TFBS) motifs in the sequences of EPIs. In PEP-Word, we use the word embedding model for natural language processing to directly embed the sequences into a new feature space, in order to obtain informative continuous distributed feature representation of DNA sequences.
​[pdf] [source code]