Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2010 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

SUGGESTED TOPICS

0. HADOOP AND PARALLELISM

The projects below are mainly designed for a traditional, single-machine architecture. However, 'hadoop' allows relatively easy parallel execution, implementing the  map-reduce system  of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we have a small cluster where we can give you an account, or we can give you access to a 50-node hadoop cluster at INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000 machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see the press release  at Yahoo, Scientific American, etc). You are welcome to try any of these projects below, on a hadoop cluster.


1. SPATIO/TEMPORAL AND STREAM MINING

1.1 [*] Automating BGP-anomaly detection


1.2. Disk access traffic patterns, and the Self-* project


1.3. Astrophysics data mining


2. HADOOP AND LARGE GRAPH MINING

2.1. [*] Large/parallel graph mining, possibly using 'hadoop'

 


3. GRAPHS - PATTERNS, OUTLIERS AND GENERATORS

3.1. [*] Anomaly detection in weighted graphs


3.2. [*] Patterns and ``laws'' in weighted graphs


3.3. [*] Model fitting (for Kronecker and RTG)



3.4 `PaC' model for graph generation


4. BLOGS AND INFLUENCE PROPAGATION

4.1. [*] Cascades and Network Topology


5. GRAPH ANALYSIS TOOLS AND VISUALIZATION

5.1. [*] Large Graph Visualization


5.2. [*] Fast implementations of RWR (for gCap)


5.3. 'NetFlix' competition: Collaborative Filtering and link prediction with side information


5.4. Graph similarity, summarization and approximation.


6. MULTIMEDIA - BIOLOGICAL AND MEDICAL  IMAGES

6.1. [*+] Visualization, Summarization and Mining of Drosophila Embryo Images


6.2. [+] Multimodel tensor analysis for fMRI brain scans


DATASETS

Unless explicitly mentioned, the datasets are either  'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Spatial data

Images/video

Graph data

Miscellaneous:


SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.


BIBLIOGRAPHICAL RESOURCES:


Last modified Feb. 9, 2010, by Christos Faloutsos.