Computational Genomics MSCBIO2070/02-710/10-810, (Spring 2015)

Course Project

Updated: 2-24-2015

Regarding printing the posters:

SCS Computing Facilities has instituted a new procedure for printing posters. The new procedure is intended to make the process of poster printing faster and easier for the SCS community. There will no longer be a need to call Operations in order to print a poster. You can now submit posters via email, to Simply follow the printing procedures that are documented on the SCS Help pages at: and Operations will print the poster and notify you when it is ready for pickup. Please contact SCS Operations at x8-2608 or send mail to with any questions or concerns. Also, the poster boards we use are 32"x 40" Non SCS students will need to contact their departments about resources for printing posters.


Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set.  Projects can be done by you as an individual, or in teams of two students.   Your project will be worth 25% of your final class grade, and will have two final deliverables:

  • a writeup (8 pages maximum), due TBD , worth 80% of the project grade, and
  • an poster presentation of your work, on April 29, worth 20% of the project grade.

In addition, you must turn in a project proposal (1-2 pages) by March 18.

Project Proposal:

You must turn in a brief project proposal (1-2 pages maximum) by March 18.

You are encouraged to come up a topic directly related to your own current research project, but the proposed work must be new and should not be copied from your previous published or unpublished work. 

You may use the list of available dataset provided bellow and pick a “less adventurous” project from the following list of potential project ideas.  These data sets have been successfully used in prior publications, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list.

Project proposal format:  Proposals should be one page maximum.  Include the following information:

  • Project title
  • Team members (including Andrew IDs)
  • Project idea.  This should be approximately two paragraphs.
  • Software you will need to write.
  • Papers to read.  Include 1-3 relevant papers.  You will probably want to read at least one of them before submitting your proposal


Project suggestions:

Ideally, you will want to pick a problem in a domain of your interest, e.g., DNA sequence analysis, genetics polymorphisms, regulatory networks, etc., and formulate your problem using a statistical machine learning formalism. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.   

You can also find some project ideas below.

Project A: Haplotyping blocking and genetic demorgraphical inference

Genetic polymorphisms such as SNPs and Microsatellite carry important information of human evolution and disease propensity. One of the interesting problems in this area is to infer the haplotype of long sequence of ambiguous genotypes based on haplotypes of small overlapping regions. In this project we want to build a haplotype assembler using a partition-ligation scheme and/or a tiling scheme to stitch together short haplotypes inferred by off-the-shelf haplotype inference algorithm; and then, after determining long haplotypes of a long stretch of markers, find the best block structure using dynamic programming and information theoretic scoring. The resulting blocks will provide essential markers for mapping disease genes and for inferring the evolutionary history of given populations. 


Niu et al. Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms, Am J Hum Genet. 2006 Jan;78(1):174

Anderson EC, Novembre J: Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle. American Journal of Human Genetics 2003, 73:336-354.

Project B: Discovering network motifs and recurring subgraphs from sequences of biological networks  

Network motifs refer to recurring subgraphs and connectivity patterns in a single or multiple networks. They usually represent certain pathway components and bio-regulatory mechanisms, and their occurrence profiles are often unique to different networks and imply intrinsic functionalities of the biological networks. Early research in this area focuses on searching for small motif in a single network. In this project we want to develop algorithms for searching large and possibly overlapping subgraphs recurring over multiple graphs. We will explore algorithms for constructing multiple networks, and graph theoretical approaches to mine such networks for motifs. 


Hu H, Yan X, Huang Y, Han J, Zhou XJ (2005) Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics (ISMB 2005), Vol. 21 Suppl. 1 2005, pages 213-221.  Supplementary Material/Software

Zhou XJ, Kao MJ, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH (2005) Functional annotation and network reconstruction through cross-platform integration of microarray data.  Nature Biotechnology 2005 Feb;23(2):238-43.          

Project C: Protein function prediction from interaction network using graph theoretic and statistical latent-space modeling approaches   

Local and global connectivities of an element in a network are often indicative of its functions; and such predictions often going beyond the traditional approaches that are based on physical and sequence properties biological element, but seeks a combination of such properties with its interaction contexts in biological processes, as reflected in the network, and such predictions can often be context-specific. In this project explore algorithms to infer biological functions of proteins from protein-protein interaction networks and other protein attributes. 

E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, A Latent Mixed Membership Model for Relational Data.Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD-2005).


Project D: (please contact Ziv for more details): Dynamic Bayesian networks from time series datasets.

Time series Expression data measures the levels of genes following specific treatment. For example, following pathogen infection such data can provide insight to the set of genes that are responding to the infection and to the immune response system. Using time series data we would like to learn a graphical model that represent the set of interactions that are employed as part of the response. In this project you will explore ways to use time series datasets for determining the structure and parameters of the regulatory network underlying the observed responses.

Project E: (please contact Ziv for more details): Classification using time series expression data.

It has been shown that the type of cancer, and in some cases the right treatment option can be determined by looking at the expression profile of a patient. Many famous classification algorithms have been suggested for this task including SVM, Naïve Bayes and statistical tests. More recently, measurements that follow patients over time are becoming available. This project will explore ways to develop classifiers that are appropriate for time series data.

Project F: (please contact Ziv for more details): Protein interaction networks

Recent experiments have identified many new protein-protein interactions. While the quality of this data is not great, it does serve as a useful source for integration with other available datasets. In this project you will explore the relationship between the interacting proteins and other types of high throughput data (such as expression or binding). Specifically, it is interesting to see of aspects that cannot be inferred from the current interaction data (such as pathways) can be determined by using these complementary data sources.

Project G : Inferring the Drosphila developmental network based on microarray expression profile time series.

We use probabilistic graphical models (e.g., Bayesian Networks), information theoretic approaches (e.g., mutual information minimization) and graph theoretic methods (i.e., path finding) to infer such networks frommicroarray expression profile time series.

Project H: Assembly / inference in next gen sequencing data [contact Marcel Schultz]


Project I : Analysis and transcriptome assembly using next-generation sequencing. [contact Bino]

In this project you will test various next-generation transcriptome assembly software to assemble transcripts, compare their performances. You will be given the dataset.


Project J : Test the hypothesis that next-generation sequencing artifacts are sequence dependent and can be modeled using sequence-specific models (e.g., regression) [contact Bino]

Project K : Develop a hidden markov model to predict the DNA sequence of a given protein (human), given the protein sequence. Compare your results using the actual cDNA sequences [contact Bino]

Project L : use human codon frequencies and a model (e.g., HMM) to predict whether a given sequence is coding or not. Compare the performance of your method to other methods for different size ranges of exons. [contact bino]

Project M : use human microRNA sequence/structure compositions and a model (e.g., HMM) to predict whether a given sequence is a microRNA or not [contact bino]