Computational Biology Thesis Proposal

  • Remote Access - Zoom
  • Virtual Presentation
  • Ph.D. Student
  • Joint CMU-Pitt Ph.D. Program in Computational Biology
  • Computational Biology Department, Carnegie Mellon University
Thesis Proposals

Algorithms for Transcriptome Analysis

Studying the transcriptome is crucial to understanding functional elements of the genome and elucidating biological pathways associated with disease. High-throughput sequencing technologies such as RNA-seq have become powerful tools for transcriptome analysis. Due to limited read lengths, identifying full-length transcripts from short reads remains challenging. As third-generation sequencing technologies emerged, single-molecule long reads have been used to improve mRNA isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and sequencing length limits. This drives a need for long-read transcript assembly. While the number of RNA-seq samples grows enormously at large sequence databases, most RNA-seq analysis tools are evaluated on limited RNA-seq samples. This leads to a need to select a representative subset from RNA-seq samples at large databases, which effectively summarizes the original collection of RNA-seq samples. As transcriptomic strategies gain momentum in biomarker discovery and disease diagnosis/prognosis, gene expression data has been used in discriminative models to distinguish disease subtypes and predict survival. The high dimensionality of gene expression data from RNA-seq led to various feature selection methods. However, the gene markers identified by gene-based feature selection are often unstable, which suggests a need to include higher-level features such as pathways. We will develop algorithmic methods for (1) transcript assembly on single-molecule RNA-seq long reads, aspiring to discover more novel isoforms, (2) representative set selection of RNA-seq samples from large databases, such that RNA-seq analysis tools can be effectively evaluated on a representative subset of RNA-seq samples, and (3) hierarchical feature selection for gene expressions and pathways, aiming to identify biologically more meaningful signature genes.

Thesis Committe:
Carl Kingsford, (CMU, Chair)
Christopher Langmead (CMU)
James Faeder (PITT)
Rob Patro (University of Maryland)

Zoom Participation. See announcement.

For More Information, Please Contact: