Datasets and project suggestions:  Below are descriptions of several data sets, and some suggested projects.  The first few are spelled out in greater detail.  You are encouraged to select and flesh out one of these projects, or make up you own well-specified project using these datasets. To work on alternative datasets has to been approved by the instructors.


time series: A:fMRI F:bodymedia G:NBA

relational data: B:IMDB C:DBLP E:creditCard

text data: D:Newsgroup H:WebKB

A: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects. 
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project A1: Bayes network classifiers for fMRI
Project idea: Gaussian Naïve Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naïve Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set. Midpoint milestone: By Nov 8 you should have run at least one classification algorithm on this data and measured its accuracy using a cross validation test. This will put you in a good position to explore refinements of the algorithm, alternative feature encodings for the data, or competing algorithms, by the end of the semester. Project: Reducing dimensionality and classification accuracy.
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian Network Classifiers" Friedman et al., 1997.

Project A2: Dimensionality reduction for fMRI data
Project idea:  Explore the use of dimensionality-reduction methods to improve classification accuracy with this data.  Given the extremely high dimension of the input (5000 voxels times 8 images) to the classifier, it is sensible to explore methods for reducing this to a small number of dimension. For example, consider PCA, hidden layers of neural nets, or other relevant dimensionality reducing methods.  PCA is an example of a method that finds lower dimension representations that minimize error in reconstructing the data.  In contract, neural network hidden layes are lower dimensional representations of the inputs that minimize classification error (but only find a local minimum).  Does one of these work better?  Does it depend on parameters such as the number of training examples?
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers and textbook on PCA, neural nets, or whatever you propose to try.

Project A3: Feature selection/feature invention for fMRI classification.
Project idea:  As in many high dimensional data sets, automatic selection of a subset of features can have a strong positive impact on classifier accuracy.  We have found that selecting features by the difference in their activity when the subject performs the task, relative to their activity while the subject is resting, is one useful strategy [Mitchell et al., 2004].  In this project you could suggest, implement, and test alternative feature selection strategies (eg., consider  the incremental value of adding a new feature to the current feature set, instead of scoring each feature independent of other features that are being selected), and see whether you can obtain higher classification accuracies.   Alternatively, you could consider methods for synthesizing new features (e.g., define the 'smoothed value' of a voxel in terms of a spatial Gaussian kernel function applied to it and its neighbors, or define features by averaging voxels whose time series are highly correlated).
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers on feature selection


B: IMDB Movie database and user rating data

This data set consists of two main parts:

Project ideas:

Project B1: Classification: To predict the user rating based on movie information. This involves feature selection and classification.

Project B2: Clustering evolution:  a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.

C: DBLP Research publication database

This data set consists of computer science bibliography data. The data is well-organized and can be downloaded at An interesting browser for viewing this dataset is available too.

Project ideas:

Project C1: clustering evolution: a lot of study on social network tries to identify the community structure in the social relation between among people. The most common structure is to cluster the people based on their interaction. A interesting study will be to model the changes of the clusters over time.

Project C2: distance measure study: a good distance function is crucial for the success of any learning algorithm. It is especially true for heterogeneous dataset like this, where naive distance function such as Euclidian distance is undefined.

D:Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles.  For documentation and download, see this website.  This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software: The same website provides an implementation of a Naive Bayes classifier for this text data.  The code is quite robust, and some documentation is available, but it is difficult code to modify.

Project ideas:

E: Credit card data

This dataset comes from a real data mining competition recently (

When a company is evaluating whether an individual is a 'good' or 'bad' customer, it uses historical information from that customer's account.  For example, a credit card company might be interested in identifying customers that are likely to go bankrupt.  The company will use past transaction information to predict future bankruptcy.  Once a potentially 'bad' account is predicted, the company will take additional steps to verify the actual nature of the account.
Our goal is to develop a system that uses historical data and accurately predicts which accounts are likely to be 'bad.'

Costs for Wrong Predictions

There is a cost for the company if we inaccurately predict an account to be a 'good' account. For example, the credit card company will have to pay for a customer's bankruptcy. There is also a cost if we inaccurately predict an account to be 'bad'. In our example, the company might launch a costly investigation or prematurely cut off a good customer's account. Also, we are interested in detecting 'bad' behavior as soon as possible. For example, suppose a customer unknowingly has her identity stolen.  We want to take action, such as calling the customer, as soon as possible.

Technical Details

For each customer, there is a time series of between 1 and 10,000 records.
Each record contains 41 pieces of information:

  • The first value is the account id.

  • The second value is the record id.  These are consecutive in time, but not sampled at any regular intervals.

  • Values 3 through 41 are data (boolean, real, integer) associated with each record.

  • The training data also has a 42nd value which is the record label.

You don't need any specific domain knowledge about values 3 through 41 to solve the problem. This is where machine learning is useful.

Each record has a binary record label.  This label is '1' for 'bad' and '0' for 'good'. The first record of an account can either start out labeled as 'good' or 'bad', but once there is a 'bad' record, all the following records for the account will also have a 'bad' label.  A 'bad' account is one which has at least one record with a 'bad' label.

Project ideas:

There are two separate competitions:

Project F1: Classification Task
You are given the account information for a number of customers and must predict who are the 'bad' customers (i.e. customers that have accounts with at least one 'bad' record label).

Project F2: Time Series Analysis Task
You are given account information for a number of customers and must determine when the customer becomes 'bad' (i.e. when the first 'bad' record occurs).

Note that these tasks are not independent of one another.

Datasets can be downloaded at

F: Physiological Data Modeling (bodymedia)

Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.

1. Which sensors correspond to each column?

characteristic1 age
characteristic2 handedness
sensor1 gsr_low_average
sensor2 heat_flux_high_average
sensor3 near_body_temp_average
sensor4 pedometer
sensor5 skin_temp_average
sensor6 longitudinal_accelerometer_SAD
sensor7 longitudinal_accelerometer_average
sensor8 transverse_accelerometer_SAD
sensor9 transverse_accelerometer_average

2. What are the activities behind each annotation?

The annotations for the contest were:
5102 = sleep
3104 = watching TV

Datasets can be downloaded from

Project idea:

Project G1: behavior classification; to classify the person based on the sensor measurements.

G: NBA statistics data

This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all of the regular season

Project idea:

Project H1: outlier detection on the players; find out who are the outstanding players.

Project H2: predict the game outcome.

H: Web->KB data


To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.


The first experiments consisted in extracting knowledge about computer science departments. We have assembled two data sets for this task: