Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set.  Projects should be done by you as a team three students. We may allow under specific circumstances for less than three member in the team.   Each project will also be assigned a 701 instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 20% of your final class grade, and will have 4 deliverables:

  1. Proposal:1 page (10%), due 10/1 at the beginning of class
  2. Midway Report:3-4 pages (20%), due 11/05 at the beginning of class. Submit with graded project proposal attached.
  3. Final Report: 8 pages (40%), due Wednesday December 5th, 4:30pm 11:59pm (email PDF to instructor mailing list 10701-instructors at cs dot cmu dot edu)
  4. Poster Presentation : (30%), Monday December 3rd, 10am to 2pm in GHC 6115 and 6121.

Note that all write-ups in the form of a NIPS paper. The page limits are strict! Papers over the limit will not be considered. 

Project Proposal

You must turn in a brief project proposal (1-page maximum).  Read the list of available data sets and potential project ideas below.  You are highly recommended to use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on data that has not been collected, so you have to work on existing data sets. It is also possible to propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.

Project proposal format:  Proposals should be one page maximum.  Include the following information:

Midway Report

This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.

Grading scheme for the project report:

Final Report

Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:

Poster Presentation

We will have all projects presenting a poster. At least one project member should be present during the poster hours. The session will be open to everybody.

Poster .ppt template (This is only a rough outline. Change the layout and section titles as appropriate for your project.)

Here are some details on the poster format.

If you are a student outside SCS, you will need to check with your department to see if there are printing facilities for big posters (we're not sure what is offered outside SCS), or print a set of regular sized pages.

Project Suggestions:

 
Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using machine learning techniques. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis. You can also find some project ideas below.


Educational Data Mining on Predicting Student Performance

Data:

Register at the KDD Cup 2010: Educational Data Mining Challenge website, and click on "Get Data".

There are two types of data sets available, development data sets and challenge data sets. Development data sets differ from challenge sets in that the actual student performance values for the prediction column, "Correct First Attempt", are provided for all steps.

The data takes the form of records of interactions between students and computer-aided-tutoring systems. The students solve problems in the tutor and each interaction between the student and computer is logged as a transaction. Four key terms form the building blocks of the data. These are problem, step, knowledge component, and opportunity.

Project idea: How generally or narrowly do students learn? How quickly or slowly? Will the rate of improvement vary between students? What does it mean for one problem to be similar to another? It might depend on whether the knowledge required for one problem is the same as the knowledge required for another. But is it possible to infer the knowledge requirements of problems directly from student performance data, without human analysis of the tasks?

We would like to ask you to predict whether a student is likely to be correct or not on each step given based on previous log data. The problem can be formalized as a classification problem. You could also build a model of students' learning behavior and predict the probability of making an error. The challenge here is to select the correct classifier/model that best represents the data. Moreover, maybe not all given features are informative. Models that are over-complicated may overfit. How to find the relevant features and make good use of them are interesting topics.

References::

Feature Engineering and Classifier Ensemble for KDD Cup 2010, Yu et al., 2010
Using HMMs and bagged decision trees to leverage rich features of user and skill from an intelligent tutoring system dataset, Pardos and Heffernan, 2010
Collaborative Filtering Applied to Educational Data Mining, Toscher and Jahrer, 2010



Inferring Networks of Diffusion and Influence

Data:

Download the data at http://snap.stanford.edu/netinf/#data.

Data contains information about the connectivity of the who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs inferred by NETINF, an algorithm that infers a who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs.

The dataset used by NETINF is called MemeTracker. It can be downloaded from here.

MemeTracker contains two datasets. The first one is a phrase cluster data. For each phrase cluster the data contains all the phrases in the cluster and a list of URLs where the phrases appeared. The second is the raw MemeTracker phrase data, which contains phrases and hyper-links extracted from each article/blogpost.

Project idea: Information diffusion and virus propagation are fundamental processes taking place in networks. In many applications, the underlying network over which the diffusions and propagations spread is hard to find. Finding such underlying network using MemeTracker data would be an interesting and challenging project. Gomez-Rodriguez et al. (2010) have recently published a paper on this topic, and made their code publically accessible. It would be interesting to replicate their result and further improve the proposed algorithm by making use of more informative features (e.g., textual content of postings etc).

References::

Inferring Networks of Diffusion and Influence, Gomez-Rodriguez et al., 2010



Apply NetInf to Other Domains

Data:

Download the data at http://snap.stanford.edu/netinf/#data.

Data contains information about the connectivity of the who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs inferred by NETINF, an algorithm that infers a who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs.

The dataset used by NETINF is called MemeTracker. It can be downloaded from here.

MemeTracker contains two datasets. The first one is a phrase cluster data. For each phrase cluster the data contains all the phrases in the cluster and a list of URLs where the phrases appeared. The second is the raw MemeTracker phrase data, which contains phrases and hyper-links extracted from each article/blogpost.

Project idea: In Gomez-Rodriguez et al.'s (2010) paper, they applied NetInf to Memetracker, and found that clusters of sites related to similar topics emerge (politics, gossip, technology, etc.), and a few sites with social capital interconnect these clusters allowing a potential diffusion of information among sites in different clusters. It would be interesting to see how the proposed algorithm could be used in other networks, and what knowledge could we get from those networks. For example, can we discover users that share similar interest from a social network? Network datasets of different domains can be found at here. Different networks may take different forms, and thus the algorithm may not be directly applicable. How to modify the existing algorithm to support other networks?

References::

Inferring Networks of Diffusion and Influence, Gomez-Rodriguez et al., 2010



Dynamically Inferring Networks of Diffusion and Influence

Data:

Download the data at http://snap.stanford.edu/netinf/#data.

Data contains information about the connectivity of the who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs inferred by NETINF, an algorithm that infers a who-copies-from-whom or who-repeats-after-whom network of news media sites and blogs.

The dataset used by NETINF is called MemeTracker. It can be downloaded from here.

MemeTracker contains two datasets. The first one is a phrase cluster data. For each phrase cluster the data contains all the phrases in the cluster and a list of URLs where the phrases appeared. The second is the raw MemeTracker phrase data, which contains phrases and hyper-links extracted from each article/blogpost.

Project idea: In Gomez-Rodriguez et al.'s (2010) paper, the proposed algorithm currently considers static propagation networks. But real influence networks are dynamic. Is it possible to detect such networks?

References::

Inferring Networks of Diffusion and Influence, Gomez-Rodriguez et al., 2010



Relational Information Retrieval

Data:

2010, yeast2 updated yeast data with extra information about Mesh heading, chemicals and affiliations etc. (321K entities and 6.1M links)

2010, fly a biological literature graph with 770K entities and 3.5M links

2010, yeast a biological literature graph with 164K entities and 2.8M links

All these datasets are relational graph based datasets. Nodes in the graph are of different types (e.g. author, paper, gene, protein, title word, journal, year). Edges between nodes describe relations between two nodes (e.g. AuthorOf, Cites, Mentions).

Project idea: Scientific literature with rich metadata can be represented as a labeled directed graph. Given this graph, can we suggest related work to authors? Can we retrieve relevant papers given some key words? All of these tasks can be formulated as relational retrieval tasks in the graph. How to efficiently retrieve items in the graph given some specific nodes as queries? Random walk with restart (RWR) has been used to model these tasks. Pontential projects include implementing different versions of RWR related work, and further improving them to achieve better retrieval quality.

References::

Ni Lao, William W. Cohen, Relational retrieval using a combination of path-constrained random walks Machine Learning, 2010, Volume 81, Number 1, Pages 53-67  (ECML, 2010 slides poster )



Noun Phrases/Relation classification

Data:

There are two types of data available:

SVO (subject-verb-object) triples data constructed by parsing 50m Web documents from ClueWeb09 (890m sentences, 16B tokens) using the MALT dependency parser, and then extracting SVO triples from these parsed sentences. The SVO triples are aggregated and their frequencies counted using Hadoop. This yields in a dataset with 114m (Subject-Verb-Object) triples and their frequency counts.

Two versions of the SVO data are here (verb non-stemmed and stemmed):
http://rtw.ml.cmu.edu/ppt/v+prep_svo-triples_stemmed.txt.gz (220,015,169 triples)
http://rtw.ml.cmu.edu/ppt/v+prep_svo-triples.txt.gz (220,462,606 triples)

Each line has four fields (tab separated): Subject Verb[+Preposition] Object Count

All-pairs dataset constructed by extracting Noun Phrases and the contexts they appear in on ClueWeb09, and count their frequencies. The data can be downloaded from:
http://rtw.ml.cmu.edu/wk/all-pairs-OC-2011-12-31-big2-gz/

There are smaller versions that are more heavily sampled that might be faster and easier to work with, for instance:
http://rtw.ml.cmu.edu/wk/all-pairs-OC-2010-12-01-small200-gz/

Project idea:  (from Tom Mitchell) The NELL project has produced some corpus statistics describing subject-verb-object (SVO) triples collected by dependency parsing hundreds of millions of sentences. A typical entry is subject=horses, verb=eat, object=hay, frequency=412. Separately, NELL's knowledge base contains instances of hundreds of relations, from riverFlowsThroughCity(river, city), to animalEatsFood(animal, food). (e.g., here is a list of NELL's beliefs about animalEatFood(animal, food): http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:animaleatfood).

In this project, the key idea is to train a "bag of verbs" classifier for each relation in NELL, basing the prediction on the SVO data, and using NELL's existing knowledge base as labeled data. Given a pair of noun phrases such as <cat, fish>, or <cat, house>, your classifier should predict whether the noun phrase pair satisfies the relation "animalEatFood". NELL does not have such a classifier, and if yours works well, we will incorporate it into NELL as a new component, potentially leading to a publication as well. There are several approaches one can take to this problem. For example, a baseline approach could train the relation classifiers independently for each relation, using the raw SVO data. You can refine this by thinking of ways of perhaps preprocessing the data (e.g., use stemming so that the triples <horse eats hay> and <horses eat hay> are merged), and perhaps you can think of ways to couple the training of multiple relations (e.g., how would you couple the training of the relations CEOofCompany(person,company) and WorksAtCompany(person, company).

References::
Verbocean: Mining the web for fine-grained semantic verb relations, EMNLP 2004
Identifying Relations for Open Information Extraction, EMNLP 2011
Learning 5000 relational extractors, ACL 2010
Knowledge-based weak supervision for information extraction of overlapping relations, ACL 2011
NELL Project publications
Past projects from "Machine Learning with Large Datasets (10-605) course in Spring 2012"



Image Categorization

Project idea:  Image categorization/object recognition has been one of the most important research problems in the computer vision community. Researchers have developed a wide spectrum of different local descriptors, feature coding schemes, and classification methods.

In this project, you will implement your own object recognition system. You could use any code from the web for computing image features, such as SIFT, HoG, etc.
For computing SIFT features, you could use http://www.vlfeat.org/~vedaldi/code/sift.html.
Following is a list of data sets you could use.

A list of datasets:

[1] Caltech 101/256:http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html
[2] The PASCAL Object Recognition Database Collection:http://pascallin.ecs.soton.ac.uk/challenges/VOC/databases.html
[3] LabelMe:http://labelme.csail.mit.edu/
[4] CMU face databases:http://vasc.ri.cmu.edu/idb/html/face/
[5] Face in the wild:http://vis-www.cs.umass.edu/lfw/
[6] ImageNet:http://www.image-net.org/index
[7] TinyImage:http://groups.csail.mit.edu/vision/TinyImages/



Human Action Recognition

Project idea: Applications such as surveillance, video retrieval and human-computer interaction require methods for recognizing human actions in various scenarios.
In this project, you will implement your own human action recognition system. You could use any code from the web for computing spatio-temporal features. One good example is the spatio-temporal interest point proposed by Piotr Dollar. Source code available at http://vision.ucsd.edu/~pdollar/research/research.html.
Following is a list of data sets you could use.

A list of datasets:

[1] KTH:http://www.nada.kth.se/cvap/actions/
[2] Weizmann:http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html
[3] Hollywood Human Actions dataset:http://www.irisa.fr/vista/Equipe/People/Laptev/download.html
[4] VIRAT Video Dataset:http://www.viratdata.org/



Single Class Object Detector

Project idea:  Given a random photo shot of a bookcover, movie poster, wine bottle picture, etc, which may have light, scale, angle variation, find the standard image of the poster, logo, etc, in a database that matches the query. This is real application for an iPhone user to identify what they see. For example, if I see a movie poster, or a bookcover, or a wine bottle, I can take a picture and then hit search, and find online information of the original image and other relevant information of the movie, book, and wine.
One intuition here lies in the fact that there are limited number of books in the world. If we have a database containing all book covers in the world, the recognition problem would reduce to a duplicate detection problem, which is much simpler to solve, compared with general purpose object recognition.

In this project, you are encouraged to design an object detector for a single image class using duplicate detection. For example, in the book cover case, you could crawl all pages about books from Amazon.com and store the images as your database of book covers.

Following is a list of possible image classes you could consider in this fashion:
[1] Book cover
[2] Landmark (e.g., Eiffel Tower, Great Wall, White House, etc)
[3] Movie Posters (e.g., crawl images fromhttp://www.movieposter.com)
[4] Wine/beer bottle labels
[5] Logos
[6] Art pieces (e.g., painting, sculpture)

Once you have the database, recognition / detection could be solved using near duplicated image detection.
You could use any algorithm or source code on the Internet, e.g.http://www.mit.edu/~andoni/LSH/.



Exploring the image world

Project idea:  One important aspect of machine learning and computer vision research is to collect proper data sets. For example, ImageNet (http://www.image-net.org/index) is one of the most promising data sets in the image categorization research.

Flickr has about 3.6 Billion photos. Interested in crawling billions of images and build your own image collection? For this project, you are encouraged to crawl images from websites such as Flickr, twitter, Google Image. We will provide disk space for storage if this becomes necessary.

There has been a long debate in the computer vision community about which of a better algorithm or larger data is more important. That is to say, should we focus on developing more and more sophisticated algorithms, or use simple classification methods, such as Nearest Neighbor classifier, on billions of training images. In this project, you will gather as many images as possible, and deploy simple classification methods on the data set to see if the latter philosophy works.

For crawling images from Flickr, you could refer tohttp://graphics.cs.cmu.edu/projects/im2gps/flickr_code.html as a starting point.

References::

[1] 80 millions of tiny images. (http://groups.csail.mit.edu/vision/TinyImages/).



Object based action recognition

Project idea: This is a more advanced topic for students interested in cutting-edge research in computer vision. Most actions are associated with objects. For instance, if someone is kicking, holding, or eating, they are doing it to something. Can we recognize actions through objects and vice-versa?

References::

[1] Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities, Bangpeng Yao and Li Fei-Fei, CVPR 2010.


Data Mining for Social Media

Project idea:  In this project, we encourage students to infer the underlying relations between different modalities of information on the Web. Here are some examples.

(1) Given a photo of movie poster (image), can we retrieve related trailers (video) or latest news articles (documents) of the movie?
To make project simpler, we recommend focusing on less than five movies (eg. 'Rise of the planet of the apes' and 'The smurfs'). You first download posters and trailers from some well-organized sites such as imdb.com or itunes.com. They will be used as training data to learn your classifiers. Now your job is to gather raw data from youtube.com or Flickr, and classify them. In this project, we encourage you to explore the possibility to build classifiers to be learned from one information modality (eg. images), and to be applicable to other modalities (eg. trailer videos).

(2) Given a beer label (image), can we search for which frames of a given video clip the logo or bottle appears?
Suppose that you are a big fan of Guinness beer. You can easily download the clean Guinness logo or cup images by Google image search. These images can be used to learn your detector, which can discover the frames that the logo appears in the video clips. For testing, you can download some video clips from youtube.com.

The above examples are just two possible candidates, and any new ideas or problem definitions are welcome.
For this purpose, one may take advantage of some source codes available on the Web as unit modules (eg. near-duplicated image detection, object recognition, action recognition in video).
Another interesting direction is to improve the current state-of-the-arts methods by considering more practical scenarios.

Related Papers and Software::

- A good example of how a machine learning technique is successfully applied to real systems (ex. Google news recommendation).
[1] Das, Datar, Garg, Rajaram. Google news personalization: scalable online collaborative filtering. WWW 2007.
- One of most popular approaches to near duplicated image detection is LSH families.
[2] http://www.mit.edu/~andoni/LSH/ (This webpage links several introductory articles and source codes).
- Various hashing techniques in computer vision (papers and source codes).
[3] Spectral Hashing (http://www.cs.huji.ac.il/~yweiss/SpectralHashing/)
[4] Kernelized LSH (http://www.eecs.berkeley.edu/~kulis/klsh/klsh.htm)
- Recognition in video
[5] Naming of Characters in Video (http://www.robots.ox.ac.uk/~vgg/data/nface/index.html)
[6] Action recognition in Video (http://www.robots.ox.ac.uk/~vgg/data/stickmen/index.html)
- Recognition in images
[7] Human pose detection (Poselet) (http://www.eecs.berkeley.edu/~lbourdev/poselets/)
[8] General object detection (http://people.cs.uchicago.edu/~pff/latent/)


Object Recognition, Scene Understanding, and More on Twitter

Project idea:  Currently, Twitter does not provide the photo-sharing functionality, which has been supported by several third-party services such as twitpic, yfog, lockerz, instagram. (See the current market-share on these services at http://techcrunch.com/2011/06/02/a-snapshot-of-photo-sharing-market-share-on-twitter/). The main goal of this project is to recognize objects or scenes in user photos by using its contextual information such as author, taken time, and associated tweets. Students may gather data by using Twipho or built-in search engines of the services (eg. http://web1.twitpic.com/search/).
In practice, it is extremely difficult to completely understand the photos in twitter. Hence, we encourage students to come up with good problem definitions so that they can not only be solvable as course projects but also be usable to real applications. Here are some examples.

(1) The photos that are retrieved by querying 'superman' in the http://web1.twitpic.com/search/ are highly variable. But, given an image, you can build a classifier to tell whether 'superman' logos appear on the images or not.
(2) Let's download the photos queried by 'beach'. Observing the images, you may identify what objects are usually shown. Choose some of them as our target objects such as human faces, sand, sea, and sky, and learn your classifier for each object category. Then, your goal is to tell what objects appear where in a twitter image.

Related Papers and Software::

- Some object recognition competition sites will be very helpful.
[1] PASCAL VOC (http://pascallin.ecs.soton.ac.uk/challenges/VOC/)
[2] ImageNet (http://www.image-net.org/challenges/LSVRC/2011/)
[3] MIRFLICKR (http://press.liacs.nl/mirflickr/)
[4] SUN database (http://groups.csail.mit.edu/vision/SUN/)
- Some object detection source codes are available.
[5] Most popular object detection (http://people.cs.uchicago.edu/~pff/latent/)
[6] Object recognition short course pLSA and Boosting (http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html)
[7] Human pose detection (Poselet)(http://www.eecs.berkeley.edu/~lbourdev/poselets/)



Identifying ancestry-informative markers:

Ancestry informative markers are polymorphisms that differ in frequency across populations. They can be used to differentiate between geographical populations and identify their anestry. This is has an important application in a technique called admixture mapping which can be used to identify polymorphisms that contribute to disease risk in populations. Many collections of ancestry-informative markers have been previously identified using statistical methods for resolving ancestry at the continental level and national level. In this project, you will use the techniques learned in class to identify sets of ancestry-informative markers and compare your results to existing methods.

Data:

  1. The HapMap project (http://hapmap.ncbi.nlm.nih.gov/) - The International HapMap Project is analyzing DNA from populations with African, Asian, and European ancestry. Together, these DNA samples should enable HapMap researchers to identify most of the common haplotypes that exist in populations worldwide. The DNA samples in the HapMap project come from 1,301 samples from 11 African, European and Asian populations. The data in phase 3 contains the genotype of the individuals at about 1.5 million SNPs. This data can be used for various population genetic analyses.
  2. The Human Genome Diversity Project (http://www.cephb.fr/en/hgdp/diversity.php) - This project has genetic data from 1050 individuals in 52 world populations. To date, the DNAs have been typed genome wide with almost 1 million SNPs, 843 microsatellites, and 51 small indel loci. Approximately 10,000 CNV (Copy Number Variations) calls from two different laboratories are included in the database.

References:


Genotype imputation :

Genotype datasets available today include geneotype information at hundreds of thousands or even millions of polymorphisms. However, due to noise or choice of genotyping density, some genotype information in the dataset may be missing. This can cause problems when using the data for further analysis such as association studies, where polymorphisms contributing to disease risk can be identified using the genetic data. The problem of identifying the genotype at these missing positions is called genotype imputation. To accomplish this, various statistical methods are used. In this project, you will use machine learning techniques to identify missing genotype data.

Data:

  1. The HapMap project (http://hapmap.ncbi.nlm.nih.gov/) - The International HapMap Project is analyzing DNA from populations with African, Asian, and European ancestry. Together, these DNA samples should enable HapMap researchers to identify most of the common haplotypes that exist in populations worldwide. The DNA samples in the HapMap project come from 1,301 samples from 11 African, European and Asian populations. The data in phase 3 contains the genotype of the individuals at about 1.5 million SNPs. This data can be used for various population genetic analyses.
  2. The Human Genome Diversity Project (http://www.cephb.fr/en/hgdp/diversity.php) - This project has genetic data from 1050 individuals in 52 world populations. To date, the DNAs have been typed genome wide with almost 1 million SNPs, 843 microsatellites, and 51 small indel loci. Approximately 10,000 CNV (Copy Number Variations) calls from two different laboratories are included in the database.

References:

  • Zheng, J., Li, Y., Abecasis, G. R. and Scheet, P. (2011), A comparison of approaches to account for uncertainty in analysis of imputed genotypes. Genetic Epidemiology, 35: 102110. doi: 10.1002/gepi.20552
  • Li Y, Willer CJ, Sanna S, Abecasis GR. Genotype imputation. Annual Review Genomics and Human Genetics 10: 387-406.

Association analysis

Project Idea:The goal of population association studies is to identify patterns of polymorphisms that vary systematically between individuals with different disease states and could therefore represent the effects of risk-enhancing or protective alleles. These studies make use of the statistical correlation between the polymorphism and the trait of interest (usually the presence or absence of disease) to identify these patterns. This project will make use of data from the Personal Genome Project. It contains information about many traits of the individuals from whom the genetic data was obtained. Using techniques such as statistical tests, sparsity-based methods, eigenanalysis, you can try to find the genetic polymorphisms that are likely to be responsible for a particular trait.

Data:

References:

  • D.J.Balding, (2006) A tutorial on statistical methods for population association studies, Nature Reviews Genetics.
  • Stephens M, Balding DJ. (2009) Bayesian statistical methods for genetic association studies, Nature Reviews Genetics.
  • Alkes L Price, Nick J Patterson, Robert M Plenge, Michael E Weinblatt, Nancy A Shadick and David Reich, (2006), Principal components analysis corrects for stratification in genome-wide association studies, Nature Genetics.

Using text and network data to predict and to understand

Description:

Many data sets are heterogeneous, comprising feature vectors, textual data (bag of words), network links, image data, and more. For example, Wikipedia pages contain text, links and images. The challenge is figuring out how to use all these types of data for some machine learning task: data exploration, prediction, etc.. A typical machine learning approach to this problem is "multi-view learning", in which the different data types are assumed to be multiple "views" of the entities of interest (webpages in the case of Wikipedia).

Multi-view learning opens up applications not normally available with single-view datasets. For example, consider a citation recommendation service for academics, which suggests papers you should cite based on the text of your paper draft. Such a service would be trained on a corpus of academic papers, learning how the citations relate to the paper texts. Another example would be interest prediction and advertising in social networks: given a user's friend list, determine what things that user is interested in.

In this project, you will focus on datasets with text and network data, such as (but not limited to) citation networks. As our examples suggest, your primary goal is to design a machine learning algorithm that trains on a subset of the text and network data, and, given text (or network links) from test entities, outputs network link (or text) predictions for them. Alternatively, you could design an algorithm that converts text and network data into "latent space" feature vectors suitable for data visualization (similar to methods such as the Latent Dirichlet Allocation and the Mixed-Membership Stochastic Blockmodel). Note that these goals are not exclusive; your proposed method could even do both. The key challenge in this project is figuring out how to learn from text and network data jointly, even though both data types are fundamentally different.

Recommended reading:

  • Joint Latent Topic Models for Text and Citations (Nallapati, Ahmed, Xing, Cohen, 2008)
  • Multi-view learning over structured and non-identical outputs (Ganchev, Graca, Blitzer, Taskar, 2008)

Suggested Datasets :

  • ACL Anthology citation network
  • arXiv High-Energy Physics citation network (from KDD cup 2003)

Efficient methods for understanding large networks

Description:

Network data is usually presented as a list of edges, each connecting two network actors. These edges represent binary relationships: for example, in a citation network, a directed edge from paper i to paper j indicates that i cited j. One problem with networks is that the binary relationships are inherently difficult to visualize; a network with thousands of edges is difficult to draw without messy edge overlaps.

In order to visualize network data better, statistical models such as the Mixed-Membership Stochastic Blockmodel (closely related to the Latent Dirichlet Allocation model used in NLP) take binary network relationships, and from them learn individual feature vectors for each actor. These individual feature vectors can be interpreted as "communities" or "roles" in the network, and are often easier to cluster or visualize than the original network edges. Furthermore, these feature vectors naturally serve as input to other machine learning methods (such as logistic regression or Naive Bayes), making them highly useful for projects that require learning from network data as well as "conventional" feature vectors. Unfortunately, there is one big drawback to network models such as MMSB: existing learning algorithms require O(N^2) runtime (where N is the number of actors), making them impractical for larger networks with more than 10,000 actors.

In this project, you are to design a network learning algorithm that (1) turns binary network relationships into individual feature vectors for actors, such that (2) the learning algorithm is practical for networks >= 10,000 actors in size (i.e. runtime should be less than O(N^2)). The underlying network model could be anything; you could use MMSB as a starting point, or even build a new model from non-statistical methods such as SVMs. Ideally, the learnt feature vectors should give insight into the network and its actors, or they should be useful for clustering or prediction tasks.

Recommended Reading:

These papers introduce the idea of a "latent space", in which the individual actor feature vectors lie. The feature vectors in the latent space "generate" the observed network edges, much as the feature vectors in Naive Bayes generate the observed data. You could use the latent spaces described here as the foundation for your algorithm, or come up with your own:

  • Mixed Membership Stochastic Blockmodels (Airoldi, Blei, Fienberg, Xing, 2008)
  • Theoretical Justification of Popular Link Prediction Heuristics (Sarkar, Chakrabati, Moore, 2010)

Suggested Datasets :

  • ACL Anthology citation network
  • arXiv High-Energy Physics citation network (from KDD cup 2003)

Project A1: Cognitive State Classification with Magnetoencephalography Data (MEG)

Data:

A zip file containing some example preprocessing of the data into features along with some text file descriptions: LanguageFiles.zip
The raw time data (12 GB) for two subjects (DP/RG_mats) and the FFT data (DP/RG_avgPSD) is located at:
/afs/cs.cmu.edu/project/theo-23/meg_pilot
You should access this directly through AFS space

This data set contains a time series of images of brain activation, measured using MEG. Human subjects viewed 60 different objects divided into 12 categories (tools, foods, animals, etc...). There are 8 presentations of each object, and each presentation lasts 3-4 seconds. Each second has hundreds of measurements from 300 sensors. The data is currently available for 2 different human subjects.

Project A: Building a cognitive state classifier
Project idea: We would like to build classifiers to distinguish between the different categories of objects (e.g. tools vs. foods) or even the objects themselves if possible (e.g. bear vs. cat). The exciting thing is that no one really knows how well this will work (or if it's even possible). This is because the data was only gathered a few weeks ago (Aug-Sept 08). One of the main challenges is figuring out how to make good features from the raw data. Should the raw data just be used? Or maybe it should be first passed through a low-pass filter? Perhaps a FFT should convert the time series to the frequency domain first? Should the features represent absolute sensor values or should they represent changes from some baseline? If so, what baseline? Another challenge is discovering what features are useful for what tasks. For example, the features that may distinguish foods from animals may be different than those that distinguish tools from buildings. What are good ways to discover these features?

This project is more challenging and risky than the others because it is not known what the results will be. But this is also good because no one else knows either, meaning that a good result could lead to a possible publication.
Papers to read:
Relevant but in the fMRI domain:
Learning to Decode Cognitive States from Brain Images, Mitchell et al., 2004,
Predicting Human Brain Activity Associated with the Meanings of Nouns, Mitchell et al., 2008
MEG paper:
Predicting the recognition of natural scenes from single trial MEG recordings of brain activity, Rieger et al. 2008 (access from CMU domain)



Project A2: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.

Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naive Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues you'll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: " Learning to Decode Cognitive States from Brain Images", Mitchell et al., 2004, " Bayesian Network Classifiers", Friedman et al., 1997.



Project A3: Genetic Sequence Analysis

We don't currently have a specific dataset in mind for this project, but if you're interested we'll help you find one (ask Field).

One of the most interesting, and controversial, areas of modern science is using people's genetic code to predict things like their likelihood of getting heart disease, their athletic prowess, and even their personality and intelligence. The movie Gattaca shows some of the downsides of this technology, but it can also be immensely helpful if a person takes preventive measures. Also, many drugs work better for people with certain genes. Insurance problems notwithstanding, genetic screening will play a huge role in medicine in the coming decades.

This project is not as well-defined as many others, but the idea is to get ahold of genetic data from patients, along with some kind of phenotype marker (like whether they got a disease), and try to find patterns within the genetic code which predict the trait. This area is very exciting because in many cases, people have literally no idea what causal links exist between genes and traits, but finding these links can be a huge boost to both medicine and pure science (by telling scientists which particular gene combinations to examine)

Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naive Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues you'll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: " Learning to Decode Cognitive States from Brain Images", Mitchell et al., 2004, " Bayesian Network Classifiers", Friedman et al., 1997.



Project A4: Hierarchical Bayes Topic Models

Statistical topic models have recently gained much popularity in managing large collection of text documents. These models make the fundamental assumption that a document is a mixture of topics(as opposed to clustering in which we assume that a document is generated from a single topic), where the mixture proportions are document-specific, and signify how important each topic is to the document. Moreover, each topic is a multinomial distribution over a given vocabulary which in turn dictates how important each word is for a topic. The document- specific mixture proportions provide a low-dimensional representation of the document into the topic-space. This representation captures the latent semantic of the collection and can then be used for tasks like classifications and clustering, or merely as a tool to structurally browse the otherwise unstructured collection. The most famous of such models is known as LDA ,Latent Dirichlet Allocation (Blei et. al. 2003). LDA has been the basis for many extensions in text, vision, bioiformatic, and social networks. These extensions incorporate more dependency structures in the generative process like modeling authors-topic dependency, or implement more sophisticated ways of representing inter-topic relationships.

Potential projects include
  • Implement one of the models listed below or propose a new latent topic model that suits a data set in your area of interest
  • Implement and Compare approximate inference algorithms for LDA which includes: variational inference (Blei et. al. 2003), collapsed gibbs sampling (Griffth et. al. 2004) and (optionally) collapsed variational inference (Teh. et. al. 2006). You should compare them over simulated data by varying the corpus generation parameters --- number of optics, size of vocabulary, document length, etc --- in addition to comparison over several real world datasets.

Papers:

Inference:
  • D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, January 2003.
    [pdf]
  • Griffiths, T, Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235 2004.
    [pdf]
  • Y.W. Teh, D. Newman and M. Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation.In NIPS 2006.
    [pdf]

Expressive Models:
  • Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. The Author-Topic Model for authors and documents.In UAI 2004.
    [pdf]
  • Jun Zhu, Amr Ahmed and Eric Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. International conference of Machine learning. ICML 2009.
    [pdf]
  • D. Blei, J. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems 21, 2007
    [pdf]
  • Wei Li and Andrew McCallum. Pachinko Allocation: Scalable Mixture Models of Topic Correlations. Submitted to the Journal of Machine Learning Research, (JMLR), 2008
    [pdf]
Application in Vision:
  • L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005. [PDF]
  • L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification . IEEE Intern. Conf. in Computer Vision (ICCV). 2007 [PDF]
Application in Social Networks/relational data:
  • Ramesh Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen, Joint Latent Topic Models for Text and Citations. Proceedings of The Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (KDD 2008) [PDF]
  • Erosheva, Elena A., Fienberg, Stephen E., and Lafferty, John (2004). Mixed-membership models of scientific publications," Proceedings of the National Academy of Sciences, 97, No. 22, 11885-11892. [PDF]
  • E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, Mixed Membership Model for Relational Data. JMLR 2008. [PDF]
  • Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096, 2004. [PDF]
  • E.P. Xing, W. Fu, and L. Song, A State-Space Mixed Membership Blockmodel for Dynamic Network Tomography, Annals of Applied Statistics, 2009. [PDF]
Application in Biology/Bioligical Text:
  • S. Shringarpure and E. P. Xing, mStruct: A New Admixture Model for Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations, Proceedings of the 25th International Conference on Machine Learning (ICML 2008). [PDF]
  • Amr Ahmed, Eric P. Xing, William W. Cohen, Robert F. Murphy. Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature. Proceedings of The Fifteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2009) [PDF]

Project B: Image Segmentation Dataset


The goal is to segment images in a meaningful way.  Berkeleycollected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations).   Two-hundred of these images are training images, and the remaining 100 are test images.  The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions.  It also includes code for a segmentation example.  This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/

Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture.  The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions.  One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments.  Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from Berkeley are available here



Project C: Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles.  For documentation and download, see this website.  This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software: The same website provides an implementation of a Naive Bayes classifier for this text data.  The code is quite robust, and some documentation is available, but it is difficult code to modify.

Project ideas:
 

EM text classification in the case where you have labels for some documents, but not for others  (see McCallum et al, and come up with your own suggestions)
 


Project D: Handwriting Recognition (Lisa Anthony http://www.cs.cmu.edu/~lanthony/)

A general overview of our data: we have approximately 16,000 labeled character samples from 39 middle and high school students, consisting of x-coord, y-coord, and time per point in each stroke. They are grouped into sets of 45 equations that each student copied. The symbols in our dataset are: 0-9, x, y, a, b, c, +, -, _ (fraction bar), =, (, ).

There are 3 main ideas for projects:

1. HOW MUCH DATA
: All our data is currently hand-labeled, and we have lots of it. One question might be, if the data wasn't labeled, what would be the added value of additional data? That is, what would be the optimal or minimal dataset? This could be defined along several axes: the number of users, the number of samples per character, or the number of samples per symbol per user. We have done a few preliminary experiments where it is clear that there is a leveling off point for test accuracy -- likely caused by the increase in variability of adding  new samples (especially by new users with differing handwriting styles), which harms the classification algorithm (see #3). For future studies and domains it might be useful to get a general sense of "data saturation" -- a recommended canonical corpus size
2. HOW MUCH LABELED DATA AND/OR AUTOMATIC LABELING: Hand-labeling all our data took quite a bit of time. What possibilities exist for an
automated, semi-supervised labeling algorithm that could tell us how much data we need to label in advance and how much human verification is needed on the automatically labeled stuff? A side note is that the collection of this data (for the sake of the users) was in the form of one equation at a time rather than one character at a time, so the characters needed to be segmented at the time of labeling since the strokes all ran together in the logs. An automated segmenting approach would be very helpful to us in the future!

3. MULTIPLE CLASSIFIERS: Finally, there is quite a bit of variance between users in that their handwriting styles differ and the particular means of executing a style differs across users. We hypothesize that multiple classifiers trained per user would have higher walk-up-and-use accuracy on a set of independent users than one classifier that has to generalize across all user styles. So this could also be an interesting area to explore.


Project E: Character recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

Project suggestion:

  • Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

Project F: NBA statistics data

This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all of the regular season

Project idea:

  • outlier detection on the players; find out who are the outstanding players.
  • predict the game outcome.

Project G: Precipitation data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

http://www.jisao.washington.edu/data_sets/widmann/

Project ideas:

Weather prediction: Learn a probabilistic model to predict rain levels

Sensor selection: Where should you place sensor to best predict rain  


Project H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/
 

Project ideas:

  • Learning classifiers to predict the type of webpage from the text
  • Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?

Papers:


Project I: Deduplication


The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

Project Ideas:
  • One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".

Papers:

Project J: Email Annotation


The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

http://www.cs.cmu.edu/~einat/datasets.html

Project Ideas:
  •  Model the task as a Sequential Labeling problem, where each email is a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".

Papers: http://www.cs.cmu.edu/~einat/email.pdf


Project K: Netflix Prize Dataset

The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize

Project idea:

  • Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?

  • Can you discover clusters of similar movies or users?

  • Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website

Project L: Physiological Data Modeling (bodymedia)

Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.


1. Which sensors correspond to each column?

characteristic1 age
characteristic2 handedness
sensor1 gsr_low_average
sensor2 heat_flux_high_average
sensor3 near_body_temp_average
sensor4 pedometer
sensor5 skin_temp_average
sensor6 longitudinal_accelerometer_SAD
sensor7 longitudinal_accelerometer_average
sensor8 transverse_accelerometer_SAD
sensor9 transverse_accelerometer_average


2. What are the activities behind each annotation?

The annotations for the contest were:
5102 = sleep
3104 = watching TV

Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/

 

Project idea:

  • behavior classification; to classify the person based on the sensor measurements 

Project M: Object Recognition

The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Project ideas:

  • You can try to create an object recognition system which can identify which object category is the best match for a given test image.
  • Apply clustering to learn object categories without supervision

Project N: Learning POMDP structure so as to maximize utility


Hoey & Little (CVPR 04) show how to learn the state space, and parameters, of a POMDP so as to maximize utility in a visual face gesture recognition task. (This is similar to the concept of "utile distinctions" developed in Andrew McCallum's PhD thesis.) The goal of this project is to reproduce Hoey's work in a simpler (non-visual) domain, such as McCallum's driving task.


Project O: Learning partially observed MRFs: the Langevin algorithm


In the recently proposed exponential family harmonium model (Welling et. al., Xing et. al.), a constructive divergence (CD) algorithm was used to learn the parameters of the model (essentially a partially observed, two-layer MRF). In Xing et. al., a comparison to variational learning was performed. CD is essentially a gradient ascent algorithm of which the gradient is approximated by a few samples. The Langevin method adds a random perturbation to the gradient and can often help to get the learning process out of local optima. In this project you will implement the Langevin learning algorithm for Xings dual wing harmonium model, and test your algorithm on the data in my UAI paper. See Zoubin Ghahramanis paper of Bayesian learning of MRF for reference.


Project P: Context-specific independence

We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?

Project Q: Enron E-mail Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data


Project ideas:

  • Can you classify the text of an e-mail message to decide who sent it? 


Project R: More data

There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets out there:

http://www.cs.toronto.edu/~roweis/data.html


© 2012 Eric Xing @ School of Computer Science, Carnegie Mellon University
[validate xhtml]