project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2009 - C. Faloutsos

List of suggested projects

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.S09/CMU-ONLY/projlist.html
Feel free to propose projects outside this list, as long as they have to do with mining and indexing large datasets. In that case, contact the instructor as early as possible.
An asterisk [*] in the project title signify that this project is related to the phd dissertation of the contact person. A cross [+] means that this is a group project, with several potential collaborators. But feel free to consider non-asterisked projects, too, if they are related to your interests or your dissertation.
Please form groups of 3 people.
Please check the ' blackboard' system, where there is one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

SUGGESTED TOPICS

0. HADOOP AND PARALLELISM

The projects below are mainly designed for a traditional, single-machine architecture. However, 'hadoop' allows relatively easy parallel execution, implementing the map-reduce system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is open source; we have a small cluster where we can give you an account, or we can give you access to a 50-node hadoop cluster at INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000 machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see the press release at Yahoo, Scientific American, etc). You are welcome to try any of these projects below, on a hadoop cluster.

1. SPATIO/TEMPORAL AND STREAM MINING

1.1 [*] Automating BGP-anomaly detection

Problem: Find interesting patterns and/or anomalies given a 2-year archive of BGP (Border Gateway Protocol) update messages between routers. The findings should be relevant to the network administrators as well. Finding such things will go a long way in automating monitoring of routers and helping catch major problems. We have right now developed a tool 'BGP-lens' in MATLAB at CMU which is able to find 'clotheslines' (IPs sending persistent near-constant number of updates over a long period of time) and 'prolonged spikes' (IPs sending a short high-burst of updates - probably relating to some malfunction/event). For this we use an aggregated form of the update data - number of updates per 600s etc. Note the data has millions of updates - so straightforward methods don't work. The project can be sub-divided into many interesting paths:
- Studying the effect of parameters and thresholds on discovery of clotheslines and prolonged spikes. For e.g. clotheslines discovery relies on moving-window median filtering etc. How will the window size affect the algorithm? and such questions.
- We want to deploy such a tool so that the admins use them, but we would need an online and incremental version of the algorithms so that the tool can quickly work on incoming update data. Also, this should be done in a non-MATLAB script like perl/python/ruby as they are light and also everyone can't afford MATLAB :).
- Also a BGP-lens with a GUI will be more easily used. A nice work would be to develop a visualization package for it. Note you would have to deal with representing really large time series and the GUI should provide sensitivity-knobs (suitable params in the tool's algorithms) for BGP-lens so that events at different time scales etc. can be identified.
- We believe that the algorithms used in BGP-lens are more general. Hence, one can study where else can such methods be employed (specifically, other datasets etc.)? Can the methods be used as is or we need to tweak/change the algorithms?
Data: We shall use BGP router data from the Abilene network, a research network, over a period of 2 years. It can be seen at the Datapository project. Check out the BGP-Monitor there: you can run some queries too. A snippet can be downloaded from (CMU only) here, it contains raw and aggregated updates.
Introductory paper(s): Draft paper . (under submission - internal to CMU - please do not disseminate)
Comments: Very high practical interest, with good problems from both the algorithmic as well as the system side. Also nice visualization challenges. There is a lot of room ~ 3-4 people.
Contact person: B. Aditya Prakash

**1.2. Disk access traffic patterns, and the Self-* project**

Problem: Given traces from real workstations (tuples of the form <disk-id, track-id, R/W-flag, timestamp>), find patterns; do predictions; use them to design better buffering and prefetching algorithms. Try 'blind signal separation'/ICA, to distinguish between 'reads' and 'writes', or between interactive and database accesses. A related problem is to forecast the response time for a given disk request, given a training set with their response times. The main problem is to extract good features. In fact, this is a small part of the the Self-* project, which also has numerous co-evolving time sequences from a prototype data center with multiple 'intelligent' storage units: cpu utilizations, network traffic measurements, room temperature sensor measurements, humidity measurements, etc. The goal is to find patterns, correlations, lag-correlations, anomalies, to help the data center self-organize, self-detect upcoming (or existing) failures and attacks, to self-optimize its performance.
Data: See the 'Disk Access Traces' below. Also, the web site of the Self-* project, with a lot of measurement data, that we already have. We also have traces from an MS SQL Server.
Introductory paper(s): the 'PQRS' model [Wang et al, PEVA 2001]; see also the use of CART [ Wang et al SIGMETRICS 2004] and the follow-up work [Mesnier+, '05]. Check the SPIRIT project, or the live Intemon system, and the corresponding publications (VLDB06, OSR06) on Jimeng's page under 'InteMon'. Also the lag-correlation paper [ Sakurai+ SIGMOD'05].
Comments: The general case is hard, and in fact, is the topic of dissertations (Dr. Mengzhi Wang, Dr. Mike Mesnier, Dr. Jimeng Sun). However, there are a lot of initial ideas that you could try within a semester, and a lot of industrial interest in the topic. One idea is to use multi-resolution analysis, like the AWSOM paper [Papadimitriou+, VLDB 2003]
Contact person: Lei Li

2. HADOOP AND LARGE GRAPH MINING

2.1. [*] Large/parallel graph mining, possibly using 'hadoop'

Problem: Given a large graph with billions of edges and tens of billions of nodes, and several share-nothing machines, parallelize the typical graph mining algorithms, to be as fast as you can. We want to compute the in- and out-degree distributions, the diameter of the graph, the first several eigenvalues, the 'network value' of each node, the 'clustering coefficient', the node- and edge-betweeness. The diameter and the connected components have been done by Mr. U Kang (contact person), but even there, there is room for optimizations. The first step is to do timing of several possible architectures: with, or without a relational DBMS; with, or without replication of the data. Also, what is the best way to store the data (e.g., as <from,to> pairs in a flat file; as an adjacency list, hashed on the 'from' node-id, or as something else.)
Data: We shall start with synthetic data, using an existing generator [Leskovec+, PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real CMU IP traffic (will need NDA). Finally, we also have a who-talks-to-whom social network with 270 million nodes and 8 billion edges (60Gb of data)
Introductory paper(s): The generator above; the Gamma database machine papers [Dewitt+, IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+, vldb'90] the RMAT paper [Chakrabarti+ SIAM-DM'04], the connection sub-graph paper [ Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper [Dean + Ghemawat, OSDI'04].
Comments: Very high practical interest, with hard problems from both the algorithmic as well as the system side. There is a lot of room, even for 4 or more people.
Contact persons: Charalampos (Babis) Tsourakakis, U Kang

2.2. [*] Eigenvalues in Hadoop

Problem: Spectra are very informative in many real world problems. Latent Semantic Indexing, Spectral Clustering and Spectral Cuts using the Cheeger Inequality rely on the eigendecomposition of the underlying matrix. In this project we aim to develop an EigenSolver for real, symmetric matrices (for example undirected graphs have this matrix representation) that computes the top-k eigenvalues.
Data: Abudant for smaller sizes, large scale data, Yahoo WEB GRAPH (120G)
Introductory papers
- Numerical Methods for Eigenvalue problems
- the documentation of the PIG system
Comments: Hard, but significant help by contact person in both theoretical and practical point of view. We'd like to contribute open source to HaMa (Hadoop Matrix Algebra)
Contact Persons: Charalampos (Babis) Tsourakakis

3. GRAPHS - PATTERNS, OUTLIERS AND GENERATORS

**3.1. [*] Anomaly detection in weighted graphs**

Problem: Given a graph data set that grows over time, with weights on edges, how can we find anomalous/ interesting/extreme nodes/edges at a given time snapshot? What kind of features would be the most informative for unusual behavior detection? How can we generalize this idea to time-evolving graphs to track for anomalous nodes/edges?
Data: Enron emails, FEC campaigns, DBLP, and any weighted data set with possible interesting nodes you might have.
Introductory papers:
- Caleb C. Noble and Diane J. Cook. Graph-based anomaly detection. In KDD, pages 631–636, 2003.
- William Eberle and Lawrence B. Holder. Discovering structural anomalies in graph-based data. In ICDM Workshops, pages 393–398, 2007
- Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Faloutsos. Neighborhood Formation and Anomaly Detection in Bipartite Graphs, Proc. ICDM, pp. 418-425, Houston, Texas, Nov 27-30, 2005
- William Eberle and Lawrence B. Holder. Detecting anomalies in cargo shipments using graph properties. In ISI, pages 728–730, 2006.
- J. Adibi J. Shetty. Discovering important nodes through graph entropy: The case of enron email database. In KDD, Proceedings of the 3rd International Workshop on Link Discovery, pages 74–81, 2005.
Comments: One challenge is that some graphs are very large that do not fit in the main memory. What pre-processing/ filtering/sampling techniques can be helpful to reduce feature extraction time, so that not all the nodes/edges are processed at the end?
Contact Person: Leman Akoglu

**3.2. [*] Patterns and ``laws'' in weighted graphs**

Problem: How can we model weighted graphs -for example with network packages flowing between nodes- for future prediction? Is there any pattern concerning weights? What kind of patterns would be expect? How are the weights distributed on the incident edges of a given node? For a given edge, do weight arrivals show any interesting behavior besides being bursty? How can we do a "microscopic" analysis so that to model a given weighted graph over time?
Data: Network traffic, Campaign donations
Introductory papers:
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Weighted graphs and disconnected components: Patterns and a model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws and a recursive generator for weighted time-evolving graphs. In ICDM: International Conference on Data Mining, Pisa, Italy, December 2008.
- Microscopic Evolution of Social Networks Jure Leskovec, Lars Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell and Kleinberg 2003
Comments: Ideas can be borrowed from many graphs generators in the literature. The major part is to work with weights. Contact the instructor or Leman, for an idea about a new graph generator, using 'monkeys on a typewriter' approach.
Contact Person: Leman Akoglu

3.3. Fast KronFIT

Problem : Given a real graph G how can we generate another graph that looks like it? Kronecker graphs is is such a generator. The existing approach of fitting Kronecker graphs relies on the Maximum Likelihood approach. This approach is slow, even though successful. Can we find an algorithm that generates such graphs with the same success and being faster? An idea is to use SVD or the spectra of graphs to come up with such an algorithm
Data: Several real graphs (Epinions, Oregon AS, Flickr etc)
Introductory papers:
Comments: Good knowledge of linear algebra. It will be very useful, because the current KronFit method is often slow.
Contact persons: Charalampos (Babis) Tsourakakis

4. BLOGS AND INFLUENCE PROPAGATION

4.1. [*] Propagation of Influence/Information in Networks and weblogs ('blogs')

Problem: We want to find patters of propagation of information (or viruses, influence, etc.) in a network. We can start by limiting ouselves to trees. For example, in a web-log influence tree, what is the most typical form of influence: a 'star' topology? a 'string' topology? something in-between? How to generate such realistic patterns, from first principles? Also, we want to model the temporal aspects: how often do bloggers post messages? are the posts uniformly distributed over time? (probably not, probably bursty). How can we spot abnormal/surprising patterns?
Data: social networks, citation networks, weblog influence data.
Introductory paper(s): [McGlohon+07], [Leskovec et al, PAKDD 2006] and (internal to CMU) [ Leskovec+ SDM07 - full version],
Contact persons: Mary McGlohon

4.2. [*] Cascades and Network Topology

Problem: We understand that information and "fads" seem to follow an epidemic-type pattern, and that the shape of the "cascades" follow patterns. How do the cascades change when we modify a real graph? What graph properties are critical to cascade size? Can we reverse-engineer the topology of a graph, if we are given information about cascades (eg., size distribution, shape information)? Using snapshots of real graphs, simulate information traveling across a network using SIS/SIR infection models and note what "cascades" are formed. Then, run experiments on the same graph with some edges or nodes removed randomly. How do the cascades change (in size, shape, etc)? What if instead of removing nodes or edges randomly, we remove them according to some rule (nodes with highest degree, edges with highest weight, edges with weight less than k...)?
Data: Any graph data, ideally weighted (for certain experiments), see "Graph-like data" below
Introductory papers: Hethcote (for SIS/SIR), Leskovec+PAKDD06 (cascade algs), Leskovec+SDM07 and McGlohon+ICWSM07(cascades in blogs),
Comments: Alternatively, instead of modifying real graphs, pick a number of different real (blogs, academic citations, network traffic) and synthetic (preferential attachment, Erdos-Renyi, small-world) networks and compare the cascades formed. How do the cascades vary, and what graph properties yield what cascade shapes?
Contact: Mary McGlohon

5. GRAPH ANALYSIS TOOLS AND VISUALIZATION

5.1. [*] Large Graph Visualization

Problem: Given a huge graph (say, millions of nodes), help visualize it. Start with the ICML03 paper below; then, try to extend it for huge graphs: try some partitioning/grouping method, and/or some fish-eye ideas.
Data: epinions.com; DBLP citation information, etc
Introductory paper(s): Check the paper on GMine. Also [Takeshi Yamada, Kazumi Saito, Naonori Ueda: Cross-Entropy Directed Embedding of Network Data. ICML'03]. Also, the visualization tools from CAIDA, and specifically 'walrus'. Check the graphdrawing organization and the corresponding sequence of conferences on Graph Drawing (GD) from there. Our goal here is different, though: we want to large graphs, that don't fit in memory, nor on the screen, nor in the human mind, unless we summarize them somehow.
Comments: Open-ended problem, but very useful - could lead to publication. The first step is to implement the method by [Yamada et al].
Contact person: Polo Chau (also Mary McGlohon)

5.2. [*] Fast implementations of RWR (for gCap)

Problem: In the 'gCap' paper (see below), and in the Drosophila Embryo project above, we need to compute the steady-state probability of each node in a random walk with restarts, when the restarting node is unknown. Thus, naively, we need n² steady-state probabilities. How can we do better than that? How can we save computation, if, for a given node i, we only want the top 10 closest nodes and their scores?
Data: DBLP, Corel image data, and more.
Introductory paper(s): Probably the fastest algorithm so far is by Hanghang Tong, in ICDM'06. The goal is to make it even faster, and/or to implement it in C/C++, or 'hadoop', so that it can run on huge graphs (Gb-size). Related papers: gCap [Pan et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02] ; fast algorithms for topic sensitive PageRank [Hawelivala + '03]; graph partitioning [Sun+, '05].
Comments: Mainly, implementation - but it has room for innovation. Closely related to the dissertation of Hanghang Tong, who will help along. Also, it could be used immediately by the Drosophila Embryo project above.
Contact person: Hanghang Tong.

5.3. 'NetFlix' competition: Collaborative Filtering and link prediction with side information

Problem: Given a user-movie rating matrix with many missing entries, how to predict the missing ratings. Here, we want to investigate how to incorporate the side-information (such as users/movies attributes, - job title, movie genre) to improve the prediction accuracy.
Data sets: Netflix data set, Movie-lens data set
Introductory papers: the papers by the leading team of Netflix competition [Koren et al KDD’08, Bell et al KDD’07]; the collective matrix factorization paper (one possible way to deal with the side-information) [Singh et al KDD’08].
Comments: This is the Netflix competition. Might win 1M $!
Contact person: Hanghang Tong

**5.4. [*] Proximity Tracking on Graphs**

Problem: Given an author-conference network that evolves over time, which are the conferences that a given author is most closely related with, and how do they change over time?
Data set: DBLP data set
Introductory papers: gCap [Pan et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02]; pTrack paper [Tong et al, SDM’08]
Comments: there are some possible generalizations to the current methods. Might lead to publications.
Contact person: Hanghang Tong

5.5. Graph summarization and approximation.

Problem: Given a large graph, how to find patterns (e.g., community and anomalies) in an intuitive and efficient way? How to track the pattern of interest if the graph is evolving-over time. One powerful way to attack this problem is through example-based low-rank approximation for the adjacency matrix of the graph.
Data set: DBLP data set; Network Traffic Data set.
Introductory papers: CUR paper [Drineas et al SIAM’05]; Colibri paper [Tong et al, KDD’08]; Non-negative CUR paper [Hyvonen et al KDD’08]
Comments: there are some possible generalizations to the current methods. Might lead to publications.
Contact person: Hanghang Tong

6. MULTIMEDIA - BIOLOGICAL AND MEDICAL IMAGES

6.1. [*+] Feature Extraction for analyzing Drosophila Embryo Images

Problem: Given a set of annotated 2D images of Drosophila embryos (352*160 gray scale), how can we extract good numerical features that capture the characteristics of each image? Is there a proper distance function that combine local features and global features to determine the "closeness" between two images? Good feature extraction algorithms would help to improve the performance of a number of mining tasks such as automatic captioning and multi-modal querying
Data: Check the BDGP site here ; we also have preprocessed data available in which low quality images were removed and fly embryos were already scaled and aligned.
Introductory paper(s): FEMine (Our previous work published in KDD'06, details on data preprocessing; baseline for feature extraction algorithm design); Zhou and Peng, 2007 , and Peng et al, 2007 (Recent work on automatic fly embryo image analysis); also Tomancak et al, 2002 (Papers by the BDGP group; Good references if you want to know more about the dataset)
Comments: The dataset contains more than 10k images, you may start from a smaller subset, and write your own algorithm to do further data cleaning.
Contact Persons: Fan Guo and Lei Li

6.2. Multimodel tensor analysis for fMRI brain scans

Problem: Getting the best of both worlds (tensors and wavelets) seems to be a promising way to handle multidimensional time series. In this problem ,we want to perform a multimodel analysis in fMRI scans from eleven (11) subjects that perform four (4) different tasks. Therefore we have an 11x4xXxYxZxT tensor where the last dimension is the time aspect.
Data: fMRI brain scans from Temple University
Introductory papers:
Comments: In collaboration with Temple University (Michael Barnathan, Prof. Vasilis Megalooikonomou)
Contact person : Charalampos (Babis) Tsourakakis

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Images/video

Biological data: images of proteins, with ~50 attributes each.
- 'Owner': Prof. Bob Murphy.
Video/image/sound data, from Informedia. 2Tb of video, segmented; 1M images with features; 10^4 faces. Extract features; design good similarity functions; do the named-entity analysis.

Graph data

Web-log and click-stream data (NDA: needed).
Snapshots of 2 anonymized (and anonymous) social networks (NDA)
Visit patterns for a large web site: for 300 pages, and thousands of users, we record how many times a user visited a specific site. Find patterns, clusters, fractal dimensions, regularities in the SVD etc.
Netflix competition dataset (users, movies and ratings) - needs easy registration - we have a cleaned-up version available, locally.
Enron email dataset (400 MB compressed)
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
  - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
  - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
  - Tensor toolkit from Tamara Kolda
- FRACTALS
  - Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
  - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.

BIBLIOGRAPHICAL RESOURCES:

Last modified Jan. 19, 2009, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Spring 2009 - C. Faloutsos