Datasets and project
suggestions: Below are descriptions of several data sets,
and some suggested projects. The first few are spelled out in greater
detail. You are encouraged to select and flesh out one of these
projects, or make up you own well-specified project using these datasets. To
work on alternative datasets has to been approved by the instructors.
This data set contains a time series of images of brain
activation, measured using fMRI, with one image every 500 msec. During this
time, human subjects performed 40 trials of a sentence-picture comparison
task (reading a sentence, observing a picture, and determining whether the
sentence correctly described the picture). Each of the 40 trials lasts
approximately 30 seconds. Each image contains approximately 5,000 voxels (3D
pixels), across a large portion of the brain. Data is available for 12
different human subjects.
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).
Project A1: Bayes network classifiers for fMRI
Project idea: Gaussian Naïve Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naïve Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set. Midpoint milestone: By Nov 8 you should have run at least one classification algorithm on this data and measured its accuracy using a cross validation test. This will put you in a good position to explore refinements of the algorithm, alternative feature encodings for the data, or competing algorithms, by the end of the semester. Project: Reducing dimensionality and classification accuracy.
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian Network Classifiers" Friedman et al., 1997.
Project A2: Dimensionality reduction for fMRI data
Project idea: Explore the use of dimensionality-reduction methods to improve classification accuracy with this data. Given the extremely high dimension of the input (5000 voxels times 8 images) to the classifier, it is sensible to explore methods for reducing this to a small number of dimension. For example, consider PCA, hidden layers of neural nets, or other relevant dimensionality reducing methods. PCA is an example of a method that finds lower dimension representations that minimize error in reconstructing the data. In contract, neural network hidden layes are lower dimensional representations of the inputs that minimize classification error (but only find a local minimum). Does one of these work better? Does it depend on parameters such as the number of training examples?
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers and textbook on PCA, neural nets, or whatever you propose to try.
Project A3: Feature selection/feature invention for fMRI classification.
Project idea: As in many high dimensional data sets, automatic selection of a subset of features can have a strong positive impact on classifier accuracy. We have found that selecting features by the difference in their activity when the subject performs the task, relative to their activity while the subject is resting, is one useful strategy [Mitchell et al., 2004]. In this project you could suggest, implement, and test alternative feature selection strategies (eg., consider the incremental value of adding a new feature to the current feature set, instead of scoring each feature independent of other features that are being selected), and see whether you can obtain higher classification accuracies. Alternatively, you could consider methods for synthesizing new features (e.g., define the 'smoothed value' of a voxel in terms of a spatial Gaussian kernel function applied to it and its neighbors, or define features by averaging voxels whose time series are highly correlated).
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers on feature selection
This data set consists of two main parts:
IMDB data: a movie database consists of many different attributes about movies. For example, movie title, genre, actors/actresses, directors, company, year, etc.
User rating data: user ratings on different movies
Project B1: Classification: To predict the user rating based on movie information. This involves feature selection and classification.
Project B2: Clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.
This data set consists of computer science bibliography data. The data is well-organized and can be downloaded at http://dblp.uni-trier.de/xml/. An interesting browser for viewing this dataset is available too.
Project C1: clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.
Project C2: distance measure study: a good distance function is crucial for the success of any learning algorithm. It is especially true for heterogeneous dataset like this, where naive distance function such as Euclidian distance is undefined.
This data set contains 1000 text articles posted to each of
20 online newgroups, for a total of 20,000 articles. For documentation and
this website. This data is useful for a variety of text classification
and/or clustering projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website provides an implementation of a Naive Bayes classifier for this text data. The code is quite robust, and some documentation is available, but it is difficult code to modify.
EM for text classification in the case where you have
labels for some documents, but not for others (see McCallum et al, and
come up with your own suggestions)
Make up your own text learning problem/approach
This dataset comes from a real data mining competition recently (http://mill.ucsd.edu/).
When a company is
evaluating whether an individual is a 'good' or 'bad' customer, it uses
historical information from that customer's account. For example, a credit
card company might be interested in identifying customers that are likely to
go bankrupt. The company will use past transaction information to predict
future bankruptcy. Once a potentially 'bad' account is predicted, the
company will take additional steps to verify the actual nature of the
Our goal is to develop a system that uses historical data and accurately predicts which accounts are likely to be 'bad.'
There is a cost for the company if we inaccurately
predict an account to be a 'good' account. For example, the credit
card company will have to pay for a customer's bankruptcy. There is
also a cost if we inaccurately predict an account to be 'bad'. In
our example, the company might launch a costly investigation or
prematurely cut off a good customer's account. Also, we are
interested in detecting 'bad' behavior as soon as possible. For
example, suppose a customer unknowingly has her identity stolen. We
want to take action, such as calling the customer, as soon as
For each customer, there is a time series of between
1 and 10,000 records.
Each record contains 41 pieces of information:
The first value is the account id.
The second value is the record id. These are consecutive in time, but not sampled at any regular intervals.
Values 3 through 41 are data (boolean, real, integer) associated with each record.
The training data also has a 42nd value which is the record label.
You don't need any specific domain knowledge about
values 3 through 41 to solve the problem. This is where machine
learning is useful.
Each record has a binary record label. This label is '1' for 'bad' and '0' for 'good'. The first record of an account can either start out labeled as 'good' or 'bad', but once there is a 'bad' record, all the following records for the account will also have a 'bad' label. A 'bad' account is one which has at least one record with a 'bad' label.
There are two separate competitions:
You are given the account information for a number of customers and must predict who are the 'bad' customers (i.e. customers that have accounts with at least one 'bad' record label).
Project F2: Time Series
You are given account information for a number of customers and must determine when the customer becomes 'bad' (i.e. when the first 'bad' record occurs).
Note that these tasks are not independent of one another.
Datasets can be downloaded at http://mill.ucsd.edu/index.php?page=Datasets&subpage=AllData
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.
Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/
Project G1: behavior classification; to classify the person based on the sensor measurements.
G: NBA statistics data
This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season
Project H1: outlier detection on the players; find out who are the outstanding players.
Project H2: predict the game outcome.
H: Web->KB data