Current Research   |  Previous Research

Current Research:

Homology identification for multi-domain proteins

Background: Homology identification is the first step in many genome-scale computational analysis, including construction of comparative maps, analysis of whole genome duplications and functional annotation of genes. Currently, sequence comparison methods are widely used to identify homologous genes. However, two multi-domain proteins may have significant sequence similarity due to a shared domain, in spite of having distinct evolutionary histories. To solve this challenging problem, I designed several methods.

Projects in progress:

  • Domain architecture comparison: I developed several schemes to explicitly compare domain architectures and investigated their effectiveness in predicting homology. The results demonstrate the importance of both weighting critical domains and of compensating for proteins with large numbers of domains. (Adviser : Dannie Durand)

  • Neighborhood Correlation: I designed a novel approach to exploit the structure of the sequence similarity network.The neighborhood of a sequence is the set of adjacent vertexes. The results show that the structure of neighborhoods of non-homologous pairs is characteristically different from that of homologous pairs. The Neighborhood Correlation method is accurate, reliable and efficient in homology identification. (Adviser : Dannie Durand)

  • Protein family classification:I applied various unsupervised, supervised, and semi-supervised machine learning methods to classify proteins and compare their performance in homology identification. The results indicate that the semi-supervised classification has a better performance than other methods. (Adviser : John Lafferty)


Previous Research

  • Oncology: My research focused on the identification of genetic susceptibility factors of lung cancer and the mechanisms of the predisposition. I determined the polymorphism of genes CYP1A1, CYP2E1, GSTs and hOGG1 in a large scale case-control study, through experimental methods. Statistical analysis suggests that their genotypes affect cancer predisposition. I also Investigated the mechanisms of cancer predisposition by analyzing the DNA-adduct and protein expression. (1997-2000, Cancer Institute, Peking Union Medical College, Adviser: Dongxin Lin )

  • Computational docking: I evaluated the scoring function of protein ligand docking for the protein modeling software SLIDE. I analyzed the correct versus incorrect docking ligands for thrombin. The results indicate that RMSD is not a smoothly varying variable for comparing scores as a function of the correctness of docking. I further analyzed ways in which the hydrophobic and hydrogen bonding components of the scoring function tracked well or poorly with the quality of docking. (Spring 2001, Michigan State University, Supervisor: Leslie Kuhn)

  • Automated interpretation of cell images: My work was to simulate time series images of live cells, especially the process which involve objects merging and splitting. I further investigated the features that can effectively track objects in the images. ( Fall 2001, Carnegie Mellon University, Supervisor: Robert Murphy )