Following are potential course projects for 15-781, Machine Learning. If you would like to pursue one of these ideas, contact the corresponding faculty by email. You may also propose your own project, subject to approval by the instructor.

In either case, please turn in a one-page project proposal to Tom.Mitchell@cmu.edu by the beginning of class on Oct 29. Your proposal should explain (1) the problem you will look at, (2) the approach and algorithm(s) you will use, and (3) how you will evaluate the results.

============================================================================

TITLE: Text Classification with Bayesian Methods

PROJECT SUPERVISOR: Andrew McCallum (andrew.mccallum@cs.cmu.edu)

DESCRIPTION: Given the growing volume of online text, automatic document classification is of great practical value, and an increasingly important area for research. Naive Bayes has been applied to this problem with considerable success, however, naive Bayes makes many assumptions about data distribution that are clearly not true of real-world text. This project will aim to improve upon naive Bayes by selectively removing some of these assumptions. I imagine beginning the project by removing the assumption that document length is indepedent of class---thus, designing a new version of naive Bayes that uses document length in order to help classify more accurately. If we finish this, we'll move on to other assumptions, such as the word independence assumption, and experiment with methods that capture some dependencies between words. The paper at http://www.cs.cmu.edu/~mccallum/papers/multinomial-aaai98w.ps is a good place to start some reading. You should be highly proficient in C programming, since you'll be modifying rainbow (http://www.cs.cmu.edu/~mccallum/bow/rainbow).

============================================================================

TITLE: Support vector machines for face recognition

PROJECT SUPERVISOR: Rahul Sukthankar (rahuls@cs) Adjunct Faculty, Robotics Institute

DESCRIPTION: Face recognition is a learning problem that has recently received a lot of attention. One standard approach involves reducing the dimensionality of the problem using PCA and then selecting the nearest class (eigenfaces). Support Vector Machines (SVM) are becoming very popular in the machine learning community as a technique for tackling high-dimensional problems. No one has yet (to my knowledge) applied SVMs to face recognition. Can SVMs outperform standard face recognition algorithms?

Issues that the student should address:
- How best to apply SVM to the n-class problem of face recognition;
- Figure out training and/or image preprocessing strategies (wavelets?);
- Compare how SVMs compare to other techniques (see notes)
Notes:
- A good implementation of SVMs is available (Thorsten's SVMlight);
- We can give the student access to two datasets used widely in the community, ORL and FERET, for training & testing;
- We have results for eigenfaces, fisherfaces and JPRC's face recognition system on these datasets, as well as implementations, so comparing SVM to other algorithms will be straightforward.
- I can recommend tutorials and papers on SVMs to supplement what was covered in class, if needed.

============================================================================

TITLE: Predictive Exponential Models

PROJECT SUPERVISOR: Roni Rosenfeld

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see www.cs.cmu.edu/~roni/wsme.ps). Although the model was originally developed for language modeling, it can be used for prediction or classification in any domain. In this project you will be expected to read and understnad this paper, and to apply the model to a Machine Learning problem of your choice. For example, you could choose one of the ML problem cases used in the course, and try to improve on the existing, "baseline", solution.

COMMENT: This project is open to more than one student. Each student could work on their own ML problem, or (subject to Tom's approval) we can choose a larger problem for joint work.

============================================================================

TITLE: Natural Language Feature Selection for Exponential Models

PROJECT SUPERVISOR: Roni Rosenfeld

DESCRIPTION: A new predictive model recently introduced by Chen & Rosenfeld can incorporate arbitrary features using exponential distributions and sampling (see www.cs.cmu.edu/~roni/wsme.ps). The model was originally developed for modeling of natural language, and has highlighted feature selection as the main challenge in that domain. In this project you will be expected to read and understnad this paper. Then, you will be given two corpora. The first one consists of transcribed over-the-phone conversations. The second corpus is artificial, and was generated from the best existing language model (which was trained on the first corpus). Your job is to use machine learning and statistical methods of your choice (and other methods if you wish) to find systematic differences between the two corpora. These differences translate directly into new features, which will be added to the model in an attempt to improve on it (an improvement in language modeling can increase the quality of language technologies such as speech recognition, machine translation, text classification, spellchecking etc.)

COMMENT: This project is open to several students, who would be working separately.

============================================================================

Learning of strategies for energy trading

Prof. Sarosh Talukdar, ECE (talukdar@cmu.edu)

In my design course, 39-405, students write software agents for energy trading. each agent is given an energy quota and money to spend to fill this quota, for each of about 50 consecutive periods. There are penalties for not filling the quota, including death (elimination) if 5 the quotas are not filled in 5 consecutive periods. Agents obtain energy through a double auction. they submit bids and the highest bids win. After each round, all the bids are made public, so each agent knows what its competitors did in the past. The purpose is to spend as little money as possible and still meet one's quotas, that is, to anticipate what the other agents will bid in the next period, and then bid just enough to get the energy one needs.

The students in 39-405 know little or nothing about automatic learning. They rely strictly on their intuition to devise their bidding algorithms. It would be interesting to see if some of your students could use automatic learning technique to build winning agents. (We provide an auction simulator with a very easy to use interface--it is simple to connect agents in a variety of languages to the simulator.) If necessary, we can have practice sessions to give agents with the ability to learn enough data to do their learning. Sarosh

Sarosh Talukdar ECE Dept., Carnegie Mellon University Pittsburgh, PA 15213 Tel: 412 268 8778, fax: 412 268 2860 e-mail: talukdar@cmu.edu

www.ece.cmu.edu/~talukdar/

============================================================================

TITLE: Learning from labeled and unlabeled data

PROJECT SUPERVISOR: Prof. Tom Mitchell, tom.mitchell@cs.cmu.edu

DESCRIPTION: The recent paper by Blum & Mitchell on co-training proposes an algorithm for learning from unlabeled as well as labeled data in certain problem settings (see www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/colt98_final.ps). In this project you will be expected to read and understand this paper, and to extend the experimental results in this paper. In particular, I have some ideas for creating synthetic data sets that test the robustness of the algorithm to changes in the problem setting discussed in the paper.

============================================================================

TITLE: Similarity matching in high-dimensional space on discrete data

PROJECT SUPERVISOR: Prof. Latanya Sweeney, Heinz School, latanya.sweeney@cmu.edu

DESCRIPTION: Given a database with hundreds of attributes (or fields) and thousands of tuples (or records), finding similar tuples (records) is very difficult and we do not have any efficient algorithms to accomplish this task. I have some ideas for new algorithms that may prove to be effective. In this project, you will implement these algorithms and explore variants to determine their effectiveness.

============================================================================

TITLE: Using a repository of old text to answer new questions

PROJECT SUPERVISOR: Latanya Sweeney

DESCRIPTION: Consider a repository of email messages in which discussion center around living with a disease, such as celiac, heart disease or diabetes. Frequently new people become diagnosed and join the list, resulting in a good number of questions being asked repeatedly. Unfortunately, messages do not adhere to a restricted vocabulary and so traditional web-based keyword searching is often ineffective. In this project, you will use and evaluate algorithms to generate responses to new email messages based on the repository of old email messages. You can begin with a Bayesian text classifier [as discussed in class: Lewis, 1991; Lang, 1995; Joachims, 1996] and a semantic generalization algorithm I have constructed and based on your analysis, explore interesting variants to determine the effectiveness of this new approach.

============================================================================

Datamining of consumer purchase data

Professor Kannan Srinivasan, GSIA, kannans@smtp2.andrew.cmu.edu

Tom: The easiest thing would be to ask any one who is interested in looking at consumer purchase (frequently bought or frequently transacted) to see me. I can explain to them my research work in this area. I have several ideas and expect that at least some of them would be interested in that. Cheers. Kannan

PS If they want to see me as a group, that would be fine too.