| 
 
        
          
            |  |  
            |  |  
            |  |  
      | Course Project Guidelines 
 
 |  
      | Your class project is an opportunity for you to explore an interesting machine
        learning problem of your choice in the context of a real-world data set. Below,
        you will find some project ideas, but the best idea would be to combine machine
        learning with problems in your own research area. Your class project must be
        about new things you have done this semester; you can't use results you have
        developed in previous semesters. 
          Your project will be worth 25% of your final class grade, and will have 4
        deliverables:Projects can be done individually, or in teams of two students. For a
          two-person group, group members are responsible for dividing up the work
          equally and making sure that each member contributes. If you are having trouble writing a proposal, feel free to
          consult with a TA or the instructor.  Once you submit your proposal,
          we will give you feedback.  Of course the final responsibility to
          define and execute an interesting piece of work is yours Each project will be assigned a 10701 TA as a project
          consultant/mentor.  .You are strongly urged to consult a TA or the instructor early on if your
          project will rely purely on simulated data or if you intend to do a learning
          theory related project. 
          Note that all write-ups must be in the form of a NIPS paper. The page limits are
        strict! Papers over the limit will not be considered. Each deliverable of your
        project will be evaluated based on several factors:Proposal, March 24: 1 page (10%)Midway Report, April 12: 4-5 pages (25%)Poster Presentation, May 3, 2-5 pm, Gates 6115: (20%)Final Report, May 9: 8 pages (45%) 
          The novelty of the project ideas and applications. The groups are
          encouraged to come up with original ideas and novel applications for the
          projects. A project with new ideas (algorithms, methods, theory) on ML or new,
          interesting applications of existing algorithms is scored higher than a project
          without much new idea/application.The extensiveness of the study and experiments. A project that
          produces a more intelligent system by combining several ML
          techniques together, or a project that involves well-designed
          experiments and thorough analysis of the experimental results, or a
          project that nicely incorporates various real world applications,
          are scored higher.The writing style and the clarity of the written paper. |  
      |  |  
      | Project Proposal (Due Date: March 24)
 
 A list of suggested projects and data sets are posted below. Read the
        list carefully. You are encouraged to use one of the suggested data sets, because
        we know that they have been successfully used for machine learning in the past.
        If you prefer to use a different data set, we will consider your proposal, but
        you must have access to this data already, and present a clear proposal for what
        you would do with it.
 
 Page limit: Proposals should be one page maximum.
 
 Include the following information:
 
          Project titleData setProject idea. This should be approximately two paragraphs.Software you will need to write.Papers to read.Include 1-3 relevant papers.You will probably want to read
          at least one of them before submitting your proposalTeammate (if any) and work division. We expect projects done in a group to
          be more substantial than projects done individually.Midterm milestone: What will you complete by April 12? Experimental
          results of some kind are expected here. |  
      |  |  
      | Midway Report (Due Date: April 12)
 
 This should be a 4-5 pages short report in the form of
        a NIPS
        paper, and it serves as a check-point. It should consist of the
        same sections as your final report (background, method, experiment,
        conclusion and references), with a few sections `under
        construction'. Specifically, the introduction and related work
        sections should be in their almost final form; the section on the
        proposed method should be almost finished; the sections on the
        experiments and conclusions will have whatever results you have
        obtained, as well as `place-holders' for additional results you
        plan/hope to obtain.
 
 Page limit: Midway report should be 4-5 pages (5 pages maximum).
 
 Grading scheme for the project report:
 
 
          70% for proposed method (should be almost finished) and experiments done so
          far25% for the design of upcoming experiments5% for plan of activities (in an appendix, please show the old plan and the
          revised one, along with the activities of each group member) |  
      |  |  
      | Poster Presentation (Date: May 3, 2-5 pm, Gates 6115)
 
 
 All project members should be present during the poster hours. The session will
        be open to everybody.
 
 Here are some details on the poster format.
 
 
          If you are a student outside SCS, you will need to check with your
        department to see if there are printing facilities for big posters (we're not
        sure what is offered outside SCS), or just print a collection of
        regular sized pages and attach them to the poster board.The two most common ways to make your poster are:
            
              You can create a huge slide (in powerpoint, using beamer, etc) and
              print it out at one of the poster printers available (for SCS, details
              below). You'd want the shorter dimension to be less than 36'' since the
              special poster printers use paper rolls of width 36''.You can create a bunch of "normal" presentation slides, print out each
              one on a piece of (letter-sized) paper, and put them all together on a
              poster board. This is somewhat messier (and not as pretty) but you don't
              need to wait for the special poster printer to print.We will provide you with foam boards onto which you can tack on your poster
          (or your slides for the poster). The foam boards are 32'' by 40''.Printing: Unfortunately, we don't have a budget to pay for printing. If you
          are an SCS student, SCS has a poster printer you can use which prints on a 36''
          wide roll of paper. You have to call Operations and request to be added to the
          queue and they will call you back and tell you when you are able to print. It
          takes a while to print however especially if there are people ahead of you
          already printing so you can't wait until the last hour (or last day). But they
          are open 24/7. The printer sometimes runs out of paper, so you should really
          start early!!! The number of operations is 8-2607 or you can call the help desk
          (8-4231). |  
      |  |  
      | Project Suggestions
 
 
 |  
      | Here are some project ideas and datasets: 
 Vector Space Models for Natural Language Processing
 
 Many tasks in natural language processing can be performed with only very shallow
      understanding of text. The vector space model is one example of a useful, but
      shallow, data representation that has been successfully used for many tasks,
      including detecting synonyms, finding analogies, and learning properties of noun
      phrases. The vector space model represents the meaning of a noun phrase(NP) (e.g.
      "the New York Yankees" or "house") as a vector of co-occurrence counts with
      contexts. A context is a short snippet of text like "alex rodriguez plays for _" or
      "_ on the street". The model is essentially a (very) large matrix A, whose rows
      represent noun phrases and whose columns represent contexts. The value of entry
      A_{i,j} is the number of times noun phrase i occurred with context j in a large
      corpus of documents (e.g., the Web). Intuitively, this model contains useful
      information because some contexts only occur with certain types of noun phrases;
      for example, the context "athletes, such as _" only occurs with athletes.
 
 Download Dataset and
      Related Tools
 
 Project ideas:
 In this project, we will provide you with an NP-context co-occurrence matrix and
      ask you to do something interesting with it. Possible applications include finding
      synonyms, finding members of categories (i.e., "is this noun phrase an athlete?"),
      or clustering noun phrases to automatically induce categories. For more ideas, we
      recommend reading (1).
 
 Another interesting set of projects could use this data as a case study for
      learning in high-dimensions. These projects could examine how dimensionality
      reduction or regularization affect performance on one of the above tasks.
 
 Vector space models are also used for many other tasks, including document
      classification and topic modeling. The 20 Newsgroups data is a classic data set for
      these tasks. You can obtain this dataset from
      here
 
 (1) Peter D. Turney and Patrick Pantel (2010). From Frequency to Meaning: Vector
      Space Models of Semantics. Journal of Artificial Intelligence Research 37, pp.
      141-188.
 
 
 |  
      | Semi-Supervised Learning 
 In many applications, it is easy to obtain a large amount of unlabeled data, but
      difficult or costly to label this data. Semi-supervised learning studies algorithms
      which learn from a small amount of labeled data and a large pool of unlabeled data.
      Interestingly, semi-supervised learning is not always successful, and unlabeled
      data points do not always improve performance. Semi-supervised learning algorithms
      typically make an assumption about the data distribution which enables learning --
      for example, several algorithms assume that the decision boundary should not pass
      through regions with high data density. When this assumption is satisfied, the
      algorithms perform better than supervised learning.
 
 The goal of this project is to experiment with semi-supervised learning algorithms
      on a data set of your choice. Some algorithms you can consider using are:
      co-training, self-training, transductive SVMS (S3VMs), or one of the many
      graph-based algorithms. (We recommend reading (1) for a survey of the many
      approaches to semi-supervised learning.) You may compare several semi-supervised
      and supervised algorithms on your data set, and perhaps draw some general
      conclusions about semi-supervised learning.
 
 This project can use essentially any data set. For some ideas, we recommend
      consulting the UC Irvine Machine Learning
      Repository.
 
 (1) Xiaojin Zhu. Semi-supervised Learning Literature Survey. Available Here.
 
 
 |  
      | Privacy-Related Project Ideas 
 A recent paper shows how decision trees may be approximated in the "secure
      multiparty" setting, by using tools of cryptography. This allows several parties to
      compute a decision tree on the union of their data without requiring them to share
      the data itself with each other. See here for the algorithm and
      
      here for a general discussion.
 
 An alternative notion of privacy which has received much attention lately is
      so-called "Differential Privacy" (see e.g, here.
      So far approaches have been studied for many machine learning methods under this
      model of privacy (such as logistic regression: see here and svm: see here. This notion of privacy is fundamentally
      different to the usual cryptographic one.
 
 The datasets for this project can be found at the UCI machine learning archive
      (Please consult Rob Hall for more details about the datasets.).
 
 Project Idea 1: Differentially Private Decision Trees
 See whether it is possible to implement a decision tree learner in a
      differentially-private way. This would entail creating a randomized algorithm which
      outputs a decision tree. Furthermore, when one of the elements of the input data is
      changed, the distribution over the outputs should not change by much (cf, the
      definition of differential privacy). Analyze under what conditions the approach
      will work and analyze the error rate relative to that of the non-private decision
      tree.
 
 Project Idea 2: Secure Multiparty EM
 Implement EM for some kind of mixture model in a secure multiparty way (i.e.,
      making use of cryptography to protect the intermediate quantities). This has been
      studied already in the context of imputation (see here). Although
      in principle, existing cryptographic primitives may be used to compute any
      funciton, in order to get a reasonably efficient algorithm you will have to come up
      with approximations which are less expensive to compute. Analyze the effects of any
      such approximations you make.
 
 Project Idea 3: Differentially Private Sparse Covariance Estimation
 The estimation of gaussian covariance matrices has received ample attention lately
      (see e.g, 
      here). When the covariance matrix is sparse (i.e., the off-diagonal elements
      are mostly 0) then there is a nice correspondence between the multivariate
      gaussian, and a certain type of undirected graphical model. See whether it is
      possible to estimate sparse covariance matrices in a way which satisfies the
      definition of differential privacy as shown above. Analyze how much error will be
      in the algorithm compared with the non-private one.
 
 
 |  
      | KDD cup 2010 (Provided by Datashop in Pittsburgh Science
      Learning Center) 
 The KDD Cup is the annual Data Mining and Knowledge Discovery competition in which
      some of the best data mining teams in the world compete to solve an practical data
      mining problem of some importance. The 2010 challenge was about discovering how
      generally or narrowly do students learn? How quickly or slowly? Will the rate of
      improvement vary between students? What does it mean for one problem to be similar
      to another? It might depend on whether the knowledge required for one problem is
      the same as the knowledge required for another. But is it possible to infer the
      knowledge requirements of problems directly from student performance data, without
      human analysis of the tasks? This year's challenge asks you to predict student
      performance on mathematical problems from logs of student interaction with
      Intelligent Tutoring Systems. This task presents interesting technical challenges,
      has practical importance, and is scientifically interesting.
 
 [Download the datasets:] Click here to download the
      data.
 
 More information about the datasets can be found by Click here.
 
 You will apply the various machine learning techniques to the KDD datasets to
      improve the accuracy on predicting "Correct First Attempt values" for the test
      portion of the data. Please report the Root Mean Squared Error (RMSE).
 
 
 |  
      | 30-40 Datasets for Education Data Mining 
 [Dataset Webpage:] Go to the webpage https://pslcdatashop.web.cmu.edu/ (You may
      need log in through WebISO) and click "Public Datasets".
 
 There are about 30-40 public datasets available from the webpage (Only select the
      datasets whose status is labeled "complete"). If you clicking each datasets, you
      will find a general description and the related publications. To look at the
      datasets, click "Export" link.
 
 Project Idea 1:
 For each dataset, you can compare various machine learning techniques (at least
      five to seven different ML methods) on predicting "Correct First Attempt values"
      (Generally listed in the column "Outcome"). Please report the Root Mean Squared
      Error (RMSE).
 
 Project Idea 2:
 Across datasets, you can compare several machine learning techniques (at least two
      to three different ML methods) on predicting "Correct First Attempt
      values"(Generally listed in the column "Outcome"). Please report the Root Mean
      Squared Error (RMSE) on the test data. The hypothesis here is that there may not be
      an absolute winner, different machine learning techniques may be effective on
      different task domains. For example, you can split the datasets into science
      (physics & math) vs. second language learning (Chinese, French).
 
 
 |  
      | 
 Image Segmentation Dataset
 The goal is to segment images in a meaningful way. Berkeley collected three hundred
      images and paid students to hand-segment each one (usually each image has multiple
      hand-segmentations). Two-hundred of these images are training images, and the
      remaining 100 are test images. The dataset includes code for reading the images and
      ground-truth labels, computing the benchmark scores, and some other utility
      functions. It also includes code for a segmentation example. This dataset is new
      and the problem unsolved, so there is a chance that you could come up with the
      leading algorithm for your project.
 
 Download
      Dataset
 
 Project idea: Region-Based Segmentation
 Most segmentation algorithms have focused on segmentation based on edges or based
      on discontinuity of color and texture. The ground-truth in this dataset, however,
      allows supervised learning algorithms to segment the images based on statistics
      calculated over regions. One way to do this is to "oversegment" the image into
      superpixels (Felzenszwalb 2004, code available) and merge the superpixels into
      larger segments. Graphical models can be used to represent smoothness in clusters,
      by adding appropriate potentials between neighboring pixels. In this project, you
      can address, for example, learning of such potentials, and inference in models with
      very large tree-width.
 
 
 |  
      | Character recognition (digits) data Optical character recognition, and the simpler digit recognition task, has been the
      focus of much ML research. We have two datasets on this topic. The first tackles
      the more general OCR task, on a small vocabulary of words: (Note that the first
      letter of each word was removed, since these were capital letters that would make
      the task harder for you.)
 
 Download dataset.
 
 Project suggestion:
 * Use an HMM to exploit correlations between neighboring letters in the general OCR
      case to improve accuracy. (Since ZIP codes don't have such constraints between
      neighboring digits, HMMs will probably not help in the digit case.)
 
 
 
 |  
      | NBA statistics data This
      download contains 2004-2005 NBA and ABA stats for:
 
 -Player regular season stats
 -Player regular season career totals
 -Player playoff stats
 -Player playoff career totals
 -Player all-star game stats
 -Team regular season stats
 -Complete draft history
 -coaches_season.txt - nba coaching records by season
 -coaches_career.txt - nba career coaching records
 Currently all of the regular season
 
 Project idea:
 * outlier detection on the players; find out who are the outstanding players.
 * predict the game outcome.
 
 
 |  
      | Precipitation data This dataset has includes 45 years of daily precipitation data from the Northwest
      of the US:
 
 Download
      Dataset
 
 Project ideas:
 Weather prediction: Learn a probabilistic model to predict rain levels.
 Sensor selection: Where should you place sensor to best predict rain.
 
 
 |  
      | WebKB This dataset contains webpages from 4 universities, labeled with whether they are
      professor, student, project, or other pages.
 Download Dataset.
 
 Project ideas:
 * Learning classifiers to predict the type of webpage from the text.
 * Can you improve accuracy by exploiting correlations between pages that point to
      each other using graphical models?
 
 Papers:
 * http://www-2.cs.cmu.edu/~webkb/.
 
 
 
 |  
      | Email Annotation The datasets provided below are sets of emails. The goal is to identify which parts
      of the email refer to a person name. This task is an example of the general problem
      area of Information Extraction.
 
 Download Dataset
 
 Project Ideas:
 * Model the task as a Sequential Labeling problem, where each email is a sequence
      of tokens, and each token can have either a label of "person-name" or
      "not-a-person-name".
 
 Papers: http://www.cs.cmu.edu/~einat/email.pdf
 
 
 |  
      | Netflix Prize Dataset The Netflix Prize data set gives 100 million records of the form "user X rated
        movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize.
 
 Project idea:
 
 
          Can you predict the rating a user will give on a movie from the movies that
          user has rated in the past, as well as the ratings similar users have given
          similar movies?
Can you discover clusters of similar movies or users?
Can you predict which users rated which movies in 2006? In other words,
          your task is to predict the probability that each pair was rated in 2006. Note
          that the actual rating is irrelevant, and we just want whether the movie was
          rated by that user sometime in 2006. The date in 2006 when the rating was given
          is also irrelevant. The test data can be found at this website. 
 
 |  
      | Physiological Data Modeling (bodymedia) Physiological data offers many challenges to the machine learning community
      including dealing with large amounts of data, sequential data, issues of sensor
      fusion, and a rich domain complete with noise, hidden variables, and significant
      effects of context.
 
 1. Which sensors correspond to each column?
 characteristic1 age
 characteristic2 handedness
 sensor1 gsr_low_average
 sensor2 heat_flux_high_average
 sensor3 near_body_temp_average
 sensor4 pedometer
 sensor5 skin_temp_average
 sensor6 longitudinal_accelerometer_SAD
 sensor7 longitudinal_accelerometer_average
 sensor8 transverse_accelerometer_SAD
 sensor9 transverse_accelerometer_average
 
 
 2. What are the activities behind each annotation?
 The annotations for the contest were:
 5102 = sleep
 3104 = watching TV
 
 Datasets can be downloaded from here.
 
 Project idea:
 * behavior classification; to classify the person based on the sensor
      measurements.
 
 
 |  
      | Object Recognition The Caltech 256 dataset contains images of 256 object categories taken at varying
      orientations, varying lighting conditions, and with different backgrounds.
 
 Download
      Dataset.
 
 Project ideas:
 * You can try to create an object recognition system which can identify which
      object category is the best match for a given test image.
 * Apply clustering to learn object categories without supervision.
 
 
 
 |  
      | Enron E-mail Dataset The Enron E-mail data set contains about 500,000 e-mails from about 150
      users.
 
 The data set is available here
 
 Project ideas:
 * Can you classify the text of an e-mail message to decide who sent it?
 
 
 |  
      | More data There are many other datasets out there. UC Irvine has a repository that could be
      useful for you project:
 http://www.ics.uci.edu/~mlearn/MLRepository.html.
 
 Sam Roweis also has a link to several datasets out there:
 http://www.cs.toronto.edu/~roweis/data.html.
 
 Dr. Jan Wiebe MPQA opinion annotated corpus:
 http://www.cs.pitt.edu/mpqa/
 
 
 
 
 |  
            |  |  |