project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2010 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.S10/CMU-ONLY/projlist.html
Feel free to propose projects outside this list, as long as they have to do with mining and indexing large datasets. In that case, contact the instructor as early as possible.
An asterisk [*] in the project title signify that this project is related to the phd dissertation of the contact person. A cross [+] means that this is a group project, with several potential collaborators. Feel free to consider non-asterisked projects, too, if they are related to your interests or your dissertation.
Please form groups of 3-4 people.
Please check the 'blackboard' system, where we will create one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

SUGGESTED TOPICS

0. HADOOP AND PARALLELISM

The projects below are mainly designed for a traditional, single-machine architecture. However, 'hadoop' allows relatively easy parallel execution, implementing the map-reduce system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we have a small cluster where we can give you an account, or we can give you access to a 50-node hadoop cluster at INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000 machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see the press release at Yahoo, Scientific American, etc). You are welcome to try any of these projects below, on a hadoop cluster.

1. SPATIO/TEMPORAL AND STREAM MINING

1.1 [*] Automating BGP-anomaly detection

Problem: Find interesting patterns and/or anomalies given a 2-year archive of BGP (Border Gateway Protocol) update messages between routers. Provide a monitoring tool, that we could deploy. The findings would be relevant to the network administrators as well. Finding such anomalies will go a long way in automating monitoring of routers and helping catch major problems. We have right now developed a tool 'BGP-lens' in MATLAB at CMU which is able to find 'clotheslines' (IPs sending persistent near-constant number of updates over a long period of time) and 'prolonged spikes' (IPs sending a short high-burst of updates - probably relating to some malfunction/event). For this we use an aggregated form of the update data - number of updates per 600s etc. Note the data has millions of updates - so straightforward methods don't work. The project can be sub-divided into many interesting paths:
- Studying the effect of parameters and thresholds on discovery of clotheslines and prolonged spikes. For e.g. clotheslines discovery relies on moving-window median filtering etc. How will the window size affect the algorithm? and such questions.
- We want to deploy such a tool so that the admins use them, but we would need an online and incremental version of the algorithms so that the tool can quickly work on incoming update data. Also, this should be done in a non-MATLAB script like perl/python/ruby as they are light and also everyone can't afford MATLAB :).
- Also a BGP-lens with a GUI will be more easily used. A nice work would be to develop a visualization package for it. Note you would have to deal with representing really large time series and the GUI should provide sensitivity-knobs (suitable params in the tool's algorithms) for BGP-lens so that events at different time scales etc. can be identified.
- The algorithms used in BGP-lens are more general. Hence, one can study where else can such methods be employed (specifically, other datasets etc.) Can the methods be used as is or we need to tweak/change the algorithms?
Data: We shall use BGP router data from the Abilene network, a research network, over a period of 2 years. It can be seen at the Datapository project. Check out the BGP-Monitor there: you can run some queries too. A snippet can be downloaded from (CMU only) here, it contains raw and aggregated updates.
Introductory paper(s): [Prakash+, KDD'09] (or here, for an earlier, more detailed version)
Comments: Very high practical interest, with good problems from both the algorithmic as well as the system side. Also nice visualization challenges. There is a lot of room ~ 3-4 people.
Contact person: B. Aditya Prakash

**1.2. Disk access traffic patterns, and the Self-* project**

Problem: Given traces from real workstations (tuples of the form <disk-id, track-id, R/W-flag, timestamp>), find patterns; do predictions; use them to design better buffering and prefetching algorithms. Try 'blind signal separation'/ICA, to distinguish between 'reads' and 'writes', or between interactive and database accesses. A related problem is to forecast the response time for a given disk request, given a training set with their response times. The main problem is to extract good features. In fact, this is a small part of the the Self-* project, which also has numerous co-evolving time sequences from a prototype data center with multiple 'intelligent' storage units: cpu utilizations, network traffic measurements, room temperature sensor measurements, humidity measurements, etc. The goal is to find patterns, correlations, lag-correlations, anomalies, to help the data center self-organize, self-detect upcoming (or existing) failures and attacks, to self-optimize its performance.
Data: See the 'Disk Access Traces' below. Also, the web site of the Self-* project, with a lot of measurement data, that we already have. We also have traces from an MS SQL Server.
Introductory paper(s): the 'PQRS' model [Wang et al, PEVA 2001]; see also the use of CART [Wang et al SIGMETRICS 2004] and the follow-up work [Mesnier+, '05]. Check the SPIRIT project, and the corresponding publications (VLDB06, OSR06) on Jimeng's page under 'InteMon'. Also the lag-correlation paper [Sakurai+ SIGMOD'05], and the DynaMMo method (Kalman filters for missing values [ Li+ KDD'09 ]).
Comments: The general case is hard, and in fact, is the topic of dissertations (Dr. Mengzhi Wang, Dr. Jimeng Sun). However, there are a lot of initial ideas that you could try within a semester, and a lot of industrial interest in the topic. One idea is to use multi-resolution analysis, like the AWSOM paper [Papadimitriou+, VLDB 2003]
Contact person:: Lei Li

1.3. Astrophysics data mining

Problem: Develop algorithms like 'friends-of-friends', for Tb astrophysics data. We have galaxy data as (x,y,z) triplets and we want to extract statistics, like number of pairs of neighbors within epsilon; characteristic lengths (eg., average diameter of galaxy clusters), etc. The main idea is to use hadoop, to process such large amounts of data.
Data: Sloan Digital Sky survey (SDSS); synthetic data (200MB compressed; here is a snippet)
Introductory papers: Fractal dimension estimations [Belussi+ VLDB'95], spatial join estimations [ SIGMOD00 ]; 2-point and n-point correlation functions [Gray+Moore, NIPS00]
Comments: A lot of interest recently, with the McWilliams Center for Cosmology at CMU. The goal is to try several cosmology theories, generate through simulation a 'universe' according to each theory, and reject theories whose 'universe' does not match the statistical properties of the real universe. Our challenges are (a) to compute the statistics that astrophycisists prefer, quickly, on billions and trillions of particles (galaxies/stars) and (b) to propose additional statistical measures.
Contact person: Bin Fu; Robson Cordeiro

2. HADOOP AND LARGE GRAPH MINING

2.1. [*] Large/parallel graph mining, possibly using 'hadoop'

Problem: Given a large graph with billions of edges and tens of billions of nodes, and several share-nothing machines, parallelize the typical graph mining algorithms, to be as fast as you can. We want to compute the in- and out-degree distributions, the diameter of the graph, the first several eigenvalues, the 'network value' of each node, the 'clustering coefficient', the node- and edge-betweeness. The diameter and the connected components have been done by Mr. U Kang (contact person), but even there, there is room for optimizations.
- The first step is to do timing of several possible architectures: with, or without a relational DBMS; with, or without replication of the data; using the PIG system; using 'hbase'
- Also, what is the best way to store the data (e.g., as <from,to> pairs in a flat file; as an adjacency list, hashed on the 'from' node-id, or as something else.)
Data: We shall start with synthetic data, using an existing generator [Leskovec+, PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real CMU IP traffic (will need NDA). Finally, we also have a who-talks-to-whom social network with 270 million nodes and 8 billion edges (60Gb of data)
Introductory paper(s): The generator above; the Gamma database machine papers [Dewitt+, IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+, vldb'90] the RMAT paper [Chakrabarti+ SIAM-DM'04], the connection sub-graph paper [Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper [Dean + Ghemawat, OSDI'04] and the documentation about the add-ons to hadoop, PIG and hbase.
Comments: Very high practical interest, with hard problems from both the algorithmic as well as the system side. There is a lot of room, even for 4 or more people.
Contact persons: U Kang

3. GRAPHS - PATTERNS, OUTLIERS AND GENERATORS

**3.1. [*] Anomaly detection in weighted graphs**

Problem: Given a graph data set that grows over time, with weights on edges, how can we find anomalous/ interesting/extreme nodes/edges at a given time snapshot? What kind of features would be the most informative for unusual behavior detection? How can we generalize this idea to time-evolving graphs to track for anomalous nodes/edges?
Data: Enron emails, FEC campaigns, DBLP, and any weighted data set with possible interesting nodes you might have.
Introductory papers:
- Caleb C. Noble and Diane J. Cook. Graph-based anomaly detection. In KDD, pages 631–636, 2003.
- William Eberle and Lawrence B. Holder. Discovering structural anomalies in graph-based data. In ICDM Workshops, pages 393–398, 2007
- Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Faloutsos. Neighborhood Formation and Anomaly Detection in Bipartite Graphs, Proc. ICDM, pp. 418-425, Houston, Texas, Nov 27-30, 2005
- William Eberle and Lawrence B. Holder. Detecting anomalies in cargo shipments using graph properties. In ISI, pages 728–730, 2006.
- J. Adibi J. Shetty. Discovering important nodes through graph entropy: The case of enron email database. In KDD, Proceedings of the 3rd International Workshop on Link Discovery, pages 74–81, 2005.
- Anomaly detection in graphs: Oddball paper, by Leman Akoglu et al (to appear, PAKDD'10)
Comments: One challenge is that some graphs are very large that do not fit in the main memory. What pre-processing/ filtering/sampling techniques can be helpful to reduce feature extraction time, so that not all the nodes/edges are processed at the end?
Contact Person: Leman Akoglu

**3.2. [*] Patterns and ``laws'' in weighted graphs**

Problem: How can we model weighted graphs -for example with network packages flowing between nodes- for future prediction? Is there any pattern concerning weights? What kind of patterns would be expect? How are the weights distributed on the incident edges of a given node? For a given edge, do weight arrivals show any interesting behavior besides being bursty? How can we do a "microscopic" analysis so that to model a given weighted graph over time?
Data: Network traffic, Campaign donations
Introductory papers:
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Weighted graphs and disconnected components: Patterns and a model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws and a recursive generator for weighted time-evolving graphs. In ICDM: International Conference on Data Mining, Pisa, Italy, December 2008.
- Microscopic Evolution of Social Networks Jure Leskovec, Lars Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell and Kleinberg 2003

[please keep CONFIDENTIAL] Phonecall patterns - reciprocity : Leman's submission on reciprocity (if I call you 50 times, how many times you called me back?)
[please keep CONFIDENTIAL] Phonecall patterns - duration: Pedro's submission (if a person did 20 phonecalls, what can you say about their durations?)

Comments: Ideas can be borrowed from many graphs generators in the literature. The major part is to work with weights. Contact the instructor or Leman, for an idea about a new graph generator, using 'monkeys on a typewriter' approach.
Contact Person: Leman Akoglu

3.3. [*] Model fitting (for Kronecker and RTG)

Problem : Given a real graph G how can we generate another graph that looks like it? Kronecker graphs is is such a generator. The existing approach of fitting Kronecker graphs relies on the Maximum Likelihood approach. This approach is slow, even though successful. Can we find an algorithm that generates such graphs with the same success and being faster? An idea is to use SVD or the spectra of graphs to come up with such an algorithm
Data: Several real graphs (Epinions, Oregon AS, Flickr etc)
Introductory papers:
Comments: Good knowledge of linear algebra.
Contact persons: Leman Akoglu

3.4 `PaC' model for graph generation

Problem: Augment the 'pay and call' model [Du+ KDD09], to make it match more patterns in real networks
Data: The usual graph data; also, confidential who-calls-whom data (but needs NDA)
Introductory papers: [Du+ KDD09]
Comments: The next goal is to make the inter-arrival times of 'phonecalls', more realistic
Contact person: Instructor

4. BLOGS AND INFLUENCE PROPAGATION

4.1. [*] Cascades and Network Topology

Problem: How does the popularity of a 'fad' grows and drops over time? Exponentially? or like a power-law? Does its shape depend on the topology of the network? If yes, does it depend on the average degree / diameter / or something else? We understand that information and "fads" seem to follow an epidemic-type pattern, and that the shape of the "cascades" follow patterns. Using snapshots of real graphs, simulate information ('fads'/ viruses) traveling across a network using SIS/SIR infection models and note what "cascades" are formed. Then, run experiments on the same graph with some edges or nodes removed randomly (simulating immunization or quarantine). How do the cascades change (in size, shape, etc)? What if instead of removing nodes or edges randomly, we remove them according to some rule (nodes with highest degree, edges with highest weight, edges with weight less than k...)?
Data: Any graph data, ideally weighted (for certain experiments), see "Graph-like data" below. Check the memetracker web site at Cornell, and download its dataset.
Introductory papers: Hethcote (for SIS/SIR), Leskovec+PAKDD06 (cascade algs), Leskovec+SDM07 and McGlohon+ICWSM07(cascades in blogs). Also, the memetracker paper [KDD'09]
Comments: Alternatively, instead of modifying real graphs, pick a number of different real (blogs, academic citations, network traffic) and synthetic (preferential attachment, Erdos-Renyi, small-world) networks and compare the cascades formed. How do the cascades vary, and what graph properties yield what cascade shapes?
Contact: Aditya Prakash

5. GRAPH ANALYSIS TOOLS AND VISUALIZATION

5.1. [*] Large Graph Visualization

Problem: Given a huge graph (say, millions of nodes), help visualize it. Start with the ICML03 paper below; then, try to extend it for huge graphs: try some partitioning/grouping method, and/or some fish-eye ideas.
Data: epinions.com; DBLP citation information, etc
Introductory paper(s): Check the paper on GMine. Also [Takeshi Yamada, Kazumi Saito, Naonori Ueda: Cross-Entropy Directed Embedding of Network Data. ICML'03]. Also, the visualization tools from CAIDA, and specifically 'walrus'. Check the graphdrawing organization and the corresponding sequence of conferences on Graph Drawing (GD) from there. Our goal here is different, though: we want to large graphs, that don't fit in memory, nor on the screen, nor in the human mind, unless we summarize them somehow.
Comments: Open-ended problem, but very useful - could lead to publication. The first step could be to implement the method by [Yamada et al].
Contact person: Polo Chau

5.2. [*] Fast implementations of RWR (for gCap)

Problem: In the 'gCap' paper (see below), and in the Drosophila Embryo project above, we need to compute the steady-state probability of each node in a random walk with restarts, when the restarting node is unknown. Thus, naively, we need n² steady-state probabilities. How can we do better than that? How can we save computation, if, for a given node i, we only want the top 10 closest nodes and their scores?
Data: DBLP, Corel image data, and more.
Introductory paper(s): Probably the fastest algorithm so far is by Hanghang Tong, in ICDM'06. The goal is to make it even faster, and/or to implement it in C/C++, or 'hadoop', so that it can run on huge graphs (Gb-size). Related papers: gCap [Pan et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02] ; fast algorithms for topic sensitive PageRank [Hawelivala + '03]; graph partitioning [Sun+, '05].
Comments: Mainly, implementation - but it has room for innovation. Closely related to the dissertation of Hanghang Tong, who will help along. Also, it could be used immediately by the Drosophila Embryo project above.
Contact person: Hanghang Tong.

5.3. 'NetFlix' competition: Collaborative Filtering and link prediction with side information

Problem: Given a user-movie rating matrix with many missing entries, how to predict the missing ratings. Here, we want to investigate how to incorporate the side-information (such as users/movies attributes, - job title, movie genre) to improve the prediction accuracy.
Data sets: Netflix data set, Movie-lens data set
Introductory papers: the papers by the leading team of Netflix competition [Koren et al KDD’08, Bell et al KDD’07]; the collective matrix factorization paper (one possible way to deal with the side-information) [Singh et al KDD’08].
Comments: This is the well publicized, $1M Netflix Prize competition - visit that site for papers, data etc.
Contact person: Hanghang Tong

5.4. Graph similarity, summarization and approximation.

Problem: Given a large graph, how to find patterns (e.g., community and anomalies) in an intuitive and efficient way? How to track the pattern of interest if the graph is evolving-over time? The problem of graph similarity is also subtly related: given two graphs, how similar are they? The number of edges they differ, is not necessarily a good measure. One way to attack the problem of approximatio is through example-based low-rank approximation for the adjacency matrix of the graph.
Data set: DBLP data set; Network Traffic Data set.
Introductory papers: CUR paper [Drineas et al SIAM’05]; Colibri paper [Tong et al, KDD’08]; Non-negative CUR paper [Hyvonen et al KDD’08]
Comments: The problem of graph similarity seems easy but it is very subtle.
Contact person: Hanghang Tong

6. MULTIMEDIA - BIOLOGICAL AND MEDICAL IMAGES

6.1. [*+] Visualization, Summarization and Mining of Drosophila Embryo Images

Problem: Given a set of annotated 2D images of Drosophila embryos (352*160 gray scale), one of the problems is to help biologist clean up a dataset (some images are very noisy). What you can do is (a) implement `Multi-dimensional scaling' (MDS), to help plot our images on the screen, and hopefully see outliers (b) to design a system to help us summarize a collection of images (say, by finding clusters and reporting the 'typical' image in each cluster). The ultimate, 50-year-horizon goal is to find how genes affect each other, in the early stages of life in drosophila (and help us extrapolate about human genes).
Data: Check the BDGP site here ; we also have preprocessed data available in which low quality images were removed and fly embryos were already scaled and aligned.
Introductory paper(s): FEMine (Our previous work published in KDD'06, details on data preprocessing; baseline for feature extraction algorithm design); Zhou and Peng, 2007 , and Peng et al, 2007 (Recent work on automatic fly embryo image analysis); also Tomancak et al, 2002 (Papers by the BDGP group; Good references if you want to know more about the dataset)
Comments: The dataset contains more than 10k images, you may start from a smaller subset, and write your own algorithm to do further data cleaning.
Contact Persons: Fan Guo and Lei Li

6.2. [+] Multimodel tensor analysis for fMRI brain scans

Problem: Getting the best of both worlds (tensors and wavelets) seems to be a promising way to handle multidimensional time series. In this problem, we want to perform a multimodal analysis in fMRI scans from eleven (11) subjects that perform four (4) different tasks. Therefore we have an 11x4xXxYxZxT tensor where the last dimension is the time aspect. The goal is to find patterns in such a dataset (like, eg., `left-handed people have more activation in their right part of their brain')
Data: fMRI brain scans from Temple University
Introductory papers:
Comments: In collaboration with Temple University (Michael Barnathan, Prof. Vasilis Megalooikonomou)
Contact person: Instructor

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.
Network traffic data from datapository.net at CMU
Motion-capture data from CMU mocap.cmu.edu

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Synthetic astrophysics data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft (CMU). The full dataset is 200Mb compressed - contact instructor.
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Images/video

Biological data: images of proteins, with ~50 attributes each.
- 'Owner': Prof. Bob Murphy.
Video/image/sound data, from Informedia. 2Tb of video, segmented; 1M images with features; 10^4 faces. Extract features; design good similarity functions; do the named-entity analysis.

Graph data

Web-log and click-stream data (NDA: needed).
Snapshots of 2 anonymized (and anonymous) social networks (NDA)
Visit patterns for a large web site: for 300 pages, and thousands of users, we record how many times a user visited a specific site. Find patterns, clusters, fractal dimensions, regularities in the SVD etc.
Netflix competition dataset (users, movies and ratings) - needs easy registration - we have a cleaned-up version available, locally.
Enron email dataset (400 MB compressed)
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.
More graph datasets from Mark Newman (U. Michigan) - including popular test-beds like the Zachary's karate club social network etc.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
  - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
  - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
  - Tensor toolkit from Tamara Kolda
- FRACTALS
  - Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
  - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
  - the PEGASUS package for graph mining on hadoop.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.
- hadoop, PIG and hbase
- pajek, jung, graphviz, guess, for (small) graph visualization

BIBLIOGRAPHICAL RESOURCES:

Last modified Feb. 9, 2010, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Spring 2010 - C. Faloutsos