Yifeng Tao photo 

Yifeng Tao

Ph.D. Candidate
  MI 654F, CMU, PA 15213


Hello, I am a final year Computer Science Ph.D. student in Computational Biology Department at Carnegie Mellon University. My research interest lies in machine learning in cancer genomics, tumor phylogenetics, and computational healthcare.

I have been working with Prof. Russell S. Schwartz. In my first two years at CMU, I collaborated with Dr. William W. Cohen and Prof. Xinghua Lu. Prior to that, I worked with Prof. Jianyang Zeng during my undergrad. I hold a master's degree in machine learning from CMU and a bachelor's degree in automation (double major in economics) from Tsinghua University. You can find my CV here. I am fortunate to be supported by the CMLH Fellowship in Digital Health in 2019-2020.



  • 2021/02   Our paper on assessing the contribution of tumor mutational phenotypes to cancer progression risk is provisionally accepted by PLOS Computational Biology.
  • 2020/08   Our paper on the neural network deconvolution of bulk tumor transcriptome is accepted by Frontiers in Physiology.
  • 2020/07   Our paper on drug response prediction through attention-based collaborative filtering is accepted by MLHC 2020.
  • 2020/06   Started my summer intern as data scientist at Illumina, working with Dr. Kimberly Gietzen.
  • 2020/03   Our paper on robust and accurate deconvolution of tumor populations is conditionally accepted by ISMB 2020.
  • 2020/03   Our paper on accurate reconstruction of tumor heterogeneity using FISH data released.
  • 2019/12   Proposed my Ph.D. thesis: Genome-driven personalized medicine of cancer via machine learning and phylogenetic models.
Previous news...

Research: Machine Learning in Cancer Genomics

Cancer proceeds from the accumulation of genomic alterations, and develops into heterogeneous cell populations in an evolutionary process. Therefore, the prognoses of cancer patients, such as survival profile, metastasis, and drug response, are encoded by the large-volume genome data. Our research focuses on the personalized medicine of cancer with machine learning and phylogenetic models:
  • Reliable phenotype inference of cancer through well-designed interpretable machine learning models. By leveraging the power of large scale genomic data and external biomedical knowledge base, we have been working on deep learning models for the accurate inference of cancer phenotypes, including transcriptome expression levels (Genomic Impact Transformer; GIT), transcription factor activities, and drug resistance (Contextual Attention-based Drug REsponse; CADRE). We addressed the interpretability of models through techniques such as attention mechanism to identify driver mutations and critical biomarkers.
  • Revealing intra-/inter-tumor heterogeneity and mechanism of tumor progression via robust deconvolution and phylogenetic algorithms. We formulated the deconvolution of bulk tumor molecular data mathematically as a biologically inspired matrix factorization problem, and proposed a neural network (Neural Network Deconvolution; NND) and then an improved hybrid optimizer (Robust and Accurate Deconvolution; RAD) to solve the problem robustly and accurately. We developed and applied a Minimum Elastic Potential (MEP) algorithm to reconstruct the evolutionary trajectory from the unmixed clones. Our ongoing projects focus on the integration of single-cell data for finer resolution of clone deconvolution and phylogeny inference (FISH-Deconv).
  • Improving prognostic prediction of cancer by incorporating machine learning and evolutionary methods. Clinicians traditionally focused on the pathological features and driver-level genomic profiles to facilitate the treatment. However, it is possible that critical clones, instead of the bulk tumor as a whole, affect the prognoses. We explored the questions by integrating both the evolutionary mutational features, driver-level features, and clinical features to improve the prognostic prediction of cancer. We developed an L0-regularized Cox regression model (Phylo-Risk), and found that the evolutionary features account for roughly 1/3 of all the available features, depending on cancer types and sequencing techniques.


Note: * indicates equal contribution, indicates co-corresponding author.

Paper image
Assessing the Contribution of Tumor Mutational Phenotypes to Cancer Progression Risk
PLOS Computational Biology 17(3):e1008777. 2021. Impact Factor=4.4
Paper image
Neural Network Deconvolution Method for Resolving Pathway-Level Progression of Tumor Clonal Expression Programs with Application to Breast Cancer Brain Metastases
Frontiers in Physiology 11:1055. 2020. Impact Factor=4.1
Paper image
Predicting Drug Sensitivity of Cancer Cell Lines via Collaborative Filtering with Contextual Attention
Proceedings of the Machine Learning for Healthcare Conference (MLHC). 2020.
Proceedings of Machine Learning Research (PMLR). 126:660-684. 2020.
Paper image
Robust and Accurate Deconvolution of Tumor Populations Uncovers Evolutionary Mechanisms of Breast Cancer Metastasis
Proceedings of the Conference on Intelligent Systems for Molecular Biology (ISMB). 2020. Oral
Bioinformatics 36:i407-i416. 2020. Impact Factor=5.6
Paper image
Tumor Heterogeneity Assessed by Sequencing and Fluorescence in situ Hybridization (FISH) Data
bioRxiv 2020.02.29.970392. 2020.
Paper image
From Genome to Phenome: Predicting Multiple Cancer Phenotypes based on Somatic Genomic Alterations via the Genomic Impact Transformer
Proceedings of the Pacific Symposium on Biocomputing 25:79-90 (PSB). 2020. Oral
Paper image
Improving Personalized Prediction of Cancer Prognoses with Clonal Evolution Models
bioRxiv 761510. 2019.
Paper image
Phylogenies Derived from Matched Transcriptome Reveal the Evolution of Cell Populations and Temporal Order of Perturbed Pathways in Breast Cancer Brain Metastases
Proceedings of the International Symposium on Mathematical and Computational Oncology 3-28 (ISMCO). 2019. Oral
Paper image
Effective Feature Representation for Clinical Text Concept Extraction
Proceedings of the Clinical Natural Language Processing Workshop 1-14 (NAACL-ClinicalNLP). 2019. Oral
Paper image
Automatic Human-like Mining and Constructing Reliable Genetic Association Database with Deep Reinforcement Learning
Proceedings of the Pacific Symposium on Biocomputing 24:112-123 (PSB). 2019.


Teaching Assistant