Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2008 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

SUGGESTED TOPICS

0. HADOOP AND PARALLELISM

The projects below are mainly designed for a traditional, single-machine architecture. However, 'hadoop' allows relatively easy parallel execution, implementing the  map-reduce system  of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we can provide some lab machines for you to install 'hadoop', or we can give you access to a 50-node hadoop cluster at INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000 machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see the press release  at Yahoo, CMU Scientific American, etc). You are welcome to try any of these projects below, on a hadoop cluster.

1. SPATIO/TEMPORAL AND STREAM MINING

1.1. [*] Disk access traffic patterns, and the Self-* project

2. GRAPHS - LARGE GRAPH MINING

2.1. Large/parallel graph mining, possibly using 'hadoop'

2.2. [**] Large Graph Visualization


2.3. [*] Best -Effort Pattern Match

2.4. [*] Relational databases as graphs, 'fuzzy queries' and Center-Piece Subgraphs (CePS)

2.5. [*] Cross-Association/Co-clustering


3. GRAPHS - INFLUENCE  PROPAGATION, GENERATORS, MODELS

3.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

3.2. Generation of Realistic Labeled Graphs

4. MULTIMEDIA - BIOLOGICAL IMAGES

4.1. [**] Feature Extraction for analyzing Drosophila Embryo Images

4.2. Fast implementations of RWR (for gCap)

5. MISCELLANEOUS

5.1. Fraud detection in on-line auctions - hijacked accounts

5.2. Auction fraud - detecting networks of 1-cent auctions


DATASETS

Unless explicitly mentioned, the datasets are either  'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Spatial data

Images/video

Graph-like data

Miscellaneous:

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

BIBLIOGRAPHICAL RESOURCES:


Last modified Jan. 20, 2008, by Christos Faloutsos.