Advanced Algorithms and Models for Computational Biology10-810, Spring 2006
Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:
1. a writeup in the form of a IEEE paper (8 pages maximum in IEEE format, including references), due May 10, worth 60% of the project grade, and
2. an oral presenting your work for a special class session at the end of the semester, on May 10, worth 20% of the project grade.
In addition, you must turn in a midway progress report (5 pages maximum in IEEE format , including references) describing the results of your first experiments by Mar 29, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.
You must turn in a brief project proposal (1-page maximum) by Feb 20 th.
You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a non-trivial technical component (either theoretical or application-oriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project.
You may use the list of available dataset provided bellow and pick a “less adventurous” project from the following list of potential project ideas. These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset.
Project proposal format: Proposals should be one page maximum. Include the following information:
· Project title
· Project idea. This should be approximately two paragraphs.
· Software you will need to write.
· Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal
· Teammate(s): will you have teammate(s)? If so, whom? Maximum team size is three students.
· Mar 29 milestone: What will you complete by Mar 29? Experimental results of some kind are expected here.
· Ideally, you will want to pick a problem in a domain of your interest, e.g., DNA sequence analysis, genetics polymorphisms, regulatory networks, etc., and formulate your problem using a statistical machine learning formalism. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.
can also find some project ideas below.
Array CGH data are sequences of fluorescence measurements reflecting the DNA copy numbers along the chromosome. The measurements are continuous and can be highly distorted by noises in a complex, non-uniform fashion. Jane Fridlyand proposed a Hidden Markov Models Approach to the Analysis of Array CGH Data, where she implement an HMM model for estimating the CGH copy number. But this model is very restricted.
A switching Hidden Process Model assumes that the hybridization process on each chromosomal region with uniform copy number would ideally follow a standard copy-number-specific linear dynamic model (LDM) [West and Harrison, 1999]. To accommodate outliers and alternative hybridization and signaling dynamics, a mixture of LDMs can be used to model a hidden process that generates fluorescence signals from a chromosomal region with a specific copy number. For a chromosome with stochastic regional amplifications and deletions, a switching HPM assumes that another discrete hidden process is responsible to selecting the corresponding copy-number-specific HPM at each region to generate the signals. The switching HPM model is essentially a special dynamic Bayesian network that allows one to infer the temporalspatially-specific hidden dynamics underlying an observation stream and the ensuing segmentation of the stream. It is a generalization to Ghahramani's SSSM which can be understood as modeling each hidden process using a plain KF. In this project you are asked to formulate this model and implement a variational algorithm for inference with such model.
In the dataset
(log2.ratio.ex), there are two columns of numbers, corresponding to two
sources. Please read the original paper to get a more detailed
the data. You can choose the appropriate number of state you feel
after inspecting the plots of the points.
Time series Expression data measures the levels of genes following specific treatment. For example, following pathogen infection such data can provide insight to the set of genes that are responding to the infection and to the immune response system. Using time series data we would like to learn a graphical model that represent the set of interactions that are employed as part of the response. In this project you will explore ways to use time series datasets for determining the structure and parameters of the regulatory network underlying the observed responses.
It has been shown that the type of cancer, and in some cases the right treatment option can be determined by looking at the expression profile of a patient. Many famous classification algorithms have been suggested for this task including SVM, Naďve Bayes and statistical tests. More recently, measurements that follow patients over time are becoming available. This project will explore ways to develop classifiers that are appropriate for time series data.
Recent experiments have identified many new protein-protein interactions. While the quality of this data is not great, it does serve as a useful source for integration with other available datasets. In this project you will explore the relationship between the interacting proteins and other types of high throughput data (such as expression or binding). Specifically, it is interesting to see of aspects that cannot be inferred from the current interaction data (such as pathways) can be determined by using these complementary data sources.