Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2013 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

SUGGESTED TOPICS

People who take the class for their master's degree, are strongly recommended to choose one of the two default projects, with the first one being the most recommended. They are both well defined, with a lot of implementation, and rather predictable outcomes.

The rest of the projects are more open-ended, and they are more suitable for people who want to do research in data mining.

1. DEFAULT PROJECTS -  for people in M.Sc. programs.

1.1  Default project #1: UCR insect dataset

Given a large collection of labeled insect sound-clips, design a good distance function, to distinguish between malaria-carrying mosquitos, versus other insects. See the full description of the Insect Mining project here, in pdf.

1.2 Default project #2: Graph mining using RDBMS

Given about 100 real graphs, do we see common trends? do they all have small diameter ('six degrees')? if not, which ones deviate? and why? Answer all these questions, using traditional SQL, which, as it turns out, is powerful enough to answer a long list of graph-mining queries (with query optimization coming for free!) Implement pageRank, diameter, connected components, etc, in SQL, and apply your code to a long list of graph datasets, to spot general patterns, and deviations. See the full description of the Graph Mining project here, in pdf.




2. OPEN-ENDED PROJECTS - GRAPH MINING

2.1 Spam Detection for Review Data

2.2. Is modern spam detection research actually working? (Bipartite core detection)

2.3. Adversarial Spam Injection

2.4. Outliers: Scalable low-rank plus sparse matrix decompositions using hadoop

 



3. OPEN ENDED PROJECTS - STREAM MINING

3.1 Change Detection for Product Ratings

3.2 Guess the next flu spike: Co-evolving time series mining

 

4. OPEN ENDED PROJECTS - TENSORS

4.1 Tensors on hadoop - 'sparse-3'

4.2. Tensor decomposition using RDBMS

5. OPEN ENDED PROJECT - BIO-INFORMATICS


DATASETS

Unless explicitly mentioned, the datasets are either  'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Spatial data

Graph data - need NDA

Graph Data - public

Miscellaneous:


SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.


BIBLIOGRAPHICAL RESOURCES:


Last modified Sept. 16, 2013, by Christos Faloutsos.