Carnegie Mellon University
15826 Multimedia Databases and Data Mining
Spring 2009  C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
 URL for this very page (internal to CMU  please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.S09/CMUONLY/projlist.html
 Feel free to propose projects outside this list, as long
as they have to do with mining and indexing large datasets.
In that case, contact the instructor as early as possible.
 An asterisk [*] in the project title signify that this
project is related to the phd dissertation of the contact person. A
cross [+] means that this
is a group project, with several potential collaborators. But feel
free to consider nonasterisked projects, too, if they are related
to your interests or your dissertation.
 Please form groups of 3 people.
 Please check the '
blackboard' system, where there is one thread for each of the
projects below. Please indicate your interest, by posting in the
appropriate thread(s), so that you can find partners.
SUGGESTED TOPICS
0. HADOOP AND PARALLELISM
The projects below are mainly designed for a traditional,
singlemachine architecture. However, 'hadoop' allows relatively
easy parallel execution, implementing the mapreduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we have a small cluster where we can give you an
account, or we can give you access to a 50node hadoop cluster at
INTELPittsburgh, and maybe access to the 'M45' of Yahoo (1000
machines, 4 cores each, 1Tb total RAM and over 3Pb storage  see
the press release at Yahoo,
Scientific American, etc). You are welcome to try any of these
projects below, on a hadoop cluster.
1. SPATIO/TEMPORAL AND STREAM MINING
1.1 [*] Automating BGPanomaly detection
 Problem: Find interesting patterns and/or anomalies
given a 2year archive of BGP (Border Gateway Protocol) update
messages between routers. The findings should be relevant to the
network administrators as well. Finding such things will go a long
way in automating monitoring of routers and helping catch major
problems. We have right now developed a tool 'BGPlens' in MATLAB
at CMU which is able to find 'clotheslines' (IPs sending persistent
nearconstant number of updates over a long period of time) and
'prolonged spikes' (IPs sending a short highburst of updates 
probably relating to some malfunction/event). For this we use an
aggregated form of the update data  number of updates per 600s
etc. Note the data has millions of updates  so straightforward
methods don't work. The project can be subdivided into many
interesting paths:
 Studying the effect of parameters and thresholds on discovery
of clotheslines and prolonged spikes. For e.g. clotheslines
discovery relies on movingwindow median filtering etc. How will
the window size affect the algorithm? and such questions.
 We want to deploy such a tool so that the admins use them, but
we would need an online and incremental version of the algorithms
so that the tool can quickly work on incoming update data. Also,
this should be done in a nonMATLAB script like perl/python/ruby as
they are light and also everyone can't afford MATLAB :).
 Also a BGPlens with a GUI will be more easily used. A nice
work would be to develop a visualization package for it. Note you
would have to deal with representing really large time series and
the GUI should provide sensitivityknobs (suitable params in the
tool's algorithms) for BGPlens so that events at different time
scales etc. can be identified.
 We believe that the algorithms used in BGPlens are more
general. Hence, one can study where else can such methods be
employed (specifically, other datasets etc.)? Can the methods be
used as is or we need to tweak/change the algorithms?
 Data: We shall use BGP
router data from the Abilene network, a research network, over a
period of 2 years. It can be seen at the Datapository project. Check out
the BGPMonitor there: you can run some queries too. A snippet can
be downloaded from (CMU only) here,
it contains raw and aggregated updates.
 Introductory
paper(s): Draft
paper . (under submission  internal to CMU  please do not
disseminate)
 Comments: Very high
practical interest, with good problems from both the algorithmic as
well as the system side. Also nice visualization challenges. There
is a lot of room ~ 34 people.
 Contact person:
B. Aditya
Prakash
1.2. Disk access
traffic patterns, and the Self* project
 Problem: Given traces
from real workstations (tuples of the form <diskid, trackid,
R/Wflag, timestamp>), find patterns; do predictions; use them
to design better buffering and prefetching algorithms. Try 'blind
signal separation'/ICA, to distinguish between 'reads' and
'writes', or between interactive and database accesses. A related
problem is to forecast the response time for a given disk request,
given a training set with their response times. The main problem is
to extract good features. In fact, this is a small part of the the
Self* project,
which also has numerous coevolving time sequences from a prototype
data center with multiple 'intelligent' storage units: cpu
utilizations, network traffic measurements, room temperature sensor
measurements, humidity measurements, etc. The goal is to find
patterns, correlations, lagcorrelations, anomalies, to help the
data center selforganize, selfdetect upcoming (or existing)
failures and attacks, to selfoptimize its performance.
 Data: See the
'Disk Access Traces' below.
Also, the web site of the Self* project, with a lot
of measurement data, that we already have. We also have traces from
an MS SQL Server.
 Introductory
paper(s): the 'PQRS' model [Wang et
al, PEVA 2001]; see also the use of CART [
Wang et al SIGMETRICS 2004] and the followup work [Mesnier+,
'05]. Check the SPIRIT
project, or the live Intemon
system, and the corresponding publications (VLDB06,
OSR06) on
Jimeng's page under
'InteMon'. Also the lagcorrelation paper [
Sakurai+ SIGMOD'05].
 Comments: The general
case is hard, and in fact, is the topic of dissertations
(Dr. Mengzhi
Wang, Dr. Mike
Mesnier, Dr. Jimeng
Sun). However, there are a lot of initial ideas that you could
try within a semester, and a
lot of industrial interest in the topic. One idea is to use
multiresolution analysis, like the AWSOM paper [Papadimitriou+,
VLDB 2003]
 Contact
person: Lei Li
2. HADOOP AND LARGE GRAPH MINING
2.1. [*] Large/parallel graph mining, possibly using
'hadoop'
 Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several sharenothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. We want to
compute the in and outdegree distributions, the diameter of the
graph, the first several eigenvalues, the 'network value' of each
node, the 'clustering coefficient', the node and edgebetweeness.
The diameter and the connected components have been done by Mr. U
Kang (contact person), but even there, there is room for
optimizations. The first step is to do timing of several possible
architectures: with, or without a relational DBMS; with, or without
replication of the data. Also, what is the best way to store the
data (e.g., as <from,to> pairs in a flat file; as an
adjacency list, hashed on the 'from' nodeid, or as something
else.)
 Data: We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real
CMU IP traffic (will need NDA). Finally, we also have a
whotalkstowhom social network with 270 million nodes and 8
billion edges (60Gb of data)
 Introductory paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hashjoins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAMDM'04], the connection subgraph paper [
Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the mapreduce paper
[Dean + Ghemawat, OSDI'04].
 Comments: Very high
practical interest, with hard problems from both the algorithmic as
well as the system side. There is a lot of room, even for 4 or
more people.
 Contact persons:
Charalampos (Babis)
Tsourakakis, U
Kang
2.2. [*] Eigenvalues in Hadoop
 Problem:
Spectra are very informative in many real world problems. Latent
Semantic Indexing, Spectral Clustering and Spectral Cuts using the
Cheeger Inequality rely on the eigendecomposition of the underlying
matrix. In this project we aim to develop an EigenSolver for real,
symmetric matrices (for example undirected graphs have this matrix
representation) that computes the topk eigenvalues.
 Data: Abudant
for smaller sizes, large scale data, Yahoo WEB GRAPH (120G)
 Introductory
papers

 Comments: Hard,
but significant help by contact person in both theoretical and
practical point of view. We'd like to contribute open source to
HaMa (Hadoop
Matrix Algebra)
 Contact Persons:
Charalampos (Babis)
Tsourakakis
3. GRAPHS  PATTERNS, OUTLIERS AND GENERATORS
3.1. [*] Anomaly detection in weighted graphs
 Problem: Given
a graph data set that grows over time, with weights on edges, how
can we find anomalous/ interesting/extreme nodes/edges at a given
time snapshot? What kind of features would be the most informative
for unusual behavior detection? How can we generalize this idea to
timeevolving graphs to track for anomalous nodes/edges?
 Data: Enron
emails, FEC campaigns, DBLP, and any weighted data set with
possible interesting nodes you might have.
 Introductory papers:
 Caleb C. Noble and Diane J. Cook. Graphbased anomaly
detection. In KDD, pages 631–636, 2003.
 William Eberle and Lawrence B. Holder. Discovering structural
anomalies in graphbased data. In ICDM Workshops, pages
393–398, 2007

Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Faloutsos.
Neighborhood Formation and Anomaly Detection
in Bipartite Graphs,
Proc. ICDM, pp. 418425, Houston, Texas, Nov 2730, 2005
 William Eberle and Lawrence B. Holder. Detecting anomalies in
cargo shipments using graph properties. In ISI, pages
728–730, 2006.
 J. Adibi J. Shetty. Discovering important nodes through graph
entropy: The case of enron email database. In KDD, Proceedings of
the 3rd International Workshop on Link Discovery, pages
74–81, 2005.
 Comments: One
challenge is that some graphs are very large that do not fit in the
main memory. What preprocessing/ filtering/sampling techniques can
be helpful to reduce feature extraction time, so that not all the
nodes/edges are processed at the end?
 Contact Person:
Leman
Akoglu
3.2. [*] Patterns and ``laws'' in weighted graphs
 Problem: How
can we model weighted graphs for example with network packages
flowing between nodes for future prediction? Is there any pattern
concerning weights? What kind of patterns would be expect? How are
the weights distributed on the incident edges of a given node? For
a given edge, do weight arrivals show any interesting behavior
besides being bursty? How can we do a "microscopic" analysis so
that to model a given weighted graph over time?
 Data: Network
traffic, Campaign donations
 Introductory
papers:

 Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: Patterns and a
model. In ACM SIGKDD, Las Vegas, Nev., USA, August 2008.
 Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM:
Laws and a recursive generator for weighted timeevolving
graphs. In ICDM: International Conference on Data Mining, Pisa,
Italy, December 2008.
 Microscopic
Evolution of Social Networks Jure Leskovec, Lars
Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
 For a good survey on link prediction, see LibenNowell
and Kleinberg 2003
 Comments: Ideas
can be borrowed from many graphs generators in the literature. The
major part is to work with weights. Contact the instructor or
Leman, for an idea about a new graph generator, using 'monkeys on a
typewriter' approach.
 Contact Person:
Leman
Akoglu
3.3. Fast KronFIT
 Problem : Given a real
graph G how can we generate another graph that looks like it?
Kronecker graphs is is such a generator. The existing approach of
fitting Kronecker graphs relies on the Maximum Likelihood approach.
This approach is slow, even though successful. Can we find an
algorithm that generates such graphs with the same success and
being faster? An idea is to use SVD or the spectra of graphs to
come up with such an algorithm
 Data: Several real
graphs (Epinions, Oregon AS, Flickr etc)
 Introductory
papers:
 Comments: Good
knowledge of linear algebra. It will be very useful, because the
current KronFit method is often slow.
 Contact persons:
Charalampos (Babis)
Tsourakakis
4. BLOGS AND INFLUENCE PROPAGATION
4.1. [*] Propagation of Influence/Information in Networks and
weblogs ('blogs')
 Problem: We want to
find patters of propagation of information (or viruses, influence,
etc.) in a network. We can start by limiting ouselves to trees. For
example, in a weblog influence tree, what is the most typical
form of influence: a 'star' topology? a 'string' topology?
something inbetween? How to generate such realistic patterns, from
first principles? Also, we want to model the temporal aspects: how
often do bloggers post messages? are the posts uniformly
distributed over time? (probably not, probably bursty). How can
we spot abnormal/surprising patterns?
 Data: social networks,
citation networks, weblog influence data.
 Introductory paper(s):
[McGlohon+07],
[Leskovec
et al, PAKDD 2006] and (internal to CMU) [
Leskovec+ SDM07  full version],
 Contact persons:
Mary
McGlohon
4.2. [*] Cascades and Network Topology
 Problem: We understand that information and "fads" seem
to follow an epidemictype pattern, and that the shape of the
"cascades" follow patterns. How do the cascades change when we
modify a real graph? What graph properties are critical to cascade
size? Can we reverseengineer the topology of a graph, if we are
given information about cascades (eg., size distribution, shape
information)? Using snapshots of real graphs, simulate information
traveling across a network using SIS/SIR infection models and note
what "cascades" are formed. Then, run experiments on the same graph
with some edges or nodes removed randomly. How do the cascades
change (in size, shape, etc)? What if instead of removing nodes or
edges randomly, we remove them according to some rule (nodes with
highest degree, edges with highest weight, edges with weight less
than k...)?
 Data: Any graph data, ideally weighted (for certain
experiments), see "Graphlike data" below
 Introductory papers: Hethcote
(for SIS/SIR), Leskovec+PAKDD06
(cascade algs), Leskovec+SDM07
and McGlohon+ICWSM07(cascades
in blogs),
 Comments: Alternatively, instead of modifying real
graphs, pick a number of different real (blogs, academic citations,
network traffic) and synthetic (preferential attachment,
ErdosRenyi, smallworld) networks and compare the cascades formed.
How do the cascades vary, and what graph properties yield what
cascade shapes?
 Contact: Mary McGlohon
5. GRAPH ANALYSIS TOOLS AND VISUALIZATION
5.1. [*] Large Graph Visualization
 Problem: Given a huge graph (say, millions of nodes),
help visualize it. Start with the ICML03 paper below; then, try to
extend it for huge graphs: try some partitioning/grouping method,
and/or some fisheye ideas.
 Data: epinions.com; DBLP citation information,
etc
 Introductory paper(s): Check the paper on GMine. Also [Takeshi
Yamada, Kazumi Saito, Naonori Ueda:
CrossEntropy Directed Embedding of Network Data.
ICML'03]. Also, the visualization tools
from CAIDA, and specifically 'walrus'.
Check the graphdrawing
organization and the corresponding sequence of conferences
on Graph Drawing (GD) from there. Our goal here is different,
though: we want to large graphs, that don't fit in memory,
nor on the screen, nor in the human mind, unless we summarize them
somehow.
 Comments: Openended problem, but very useful
 could lead to publication. The first step is to implement
the method by [Yamada et al].
 Contact person:
Polo Chau (also
Mary McGlohon)
5.2. [*] Fast implementations of RWR (for gCap)
 Problem: In the 'gCap'
paper (see below), and in the Drosophila Embryo project above, we
need to compute the steadystate probability of each node in a
random walk with restarts, when the restarting node is unknown.
Thus, naively, we need n^{2} steadystate
probabilities. How can we do better than that? How can we save
computation, if, for a given node i, we only want the top 10 closest
nodes and their scores?
 Data: DBLP, Corel
image data, and more.
 Introductory paper(s): Probably the fastest algorithm so
far is by Hanghang Tong, in ICDM'06.
The goal is to make it even faster, and/or to implement it in
C/C++, or 'hadoop', so that it can run on huge graphs
(Gbsize). Related papers: gCap [Pan
et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02] ; fast
algorithms for topic sensitive PageRank [Hawelivala + '03]; graph partitioning
[Sun+,
'05].
 Comments: Mainly,
implementation  but it has room for innovation. Closely related to
the dissertation of Hanghang Tong, who will help along. Also, it
could be used immediately by the Drosophila Embryo project
above.
 Contact person:
Hanghang Tong.
5.3. 'NetFlix' competition:
Collaborative Filtering and link prediction with side
information
 Problem: Given a usermovie rating matrix with
many missing entries, how to predict the missing ratings. Here, we
want to investigate how to incorporate the sideinformation (such
as users/movies attributes,  job title, movie genre) to improve
the prediction accuracy.
 Data sets: Netflix data set, Movielens data
set
 Introductory papers: the papers by the leading
team of Netflix competition [Koren et
al KDD’08, Bell et
al KDD’07]; the collective matrix factorization paper
(one possible way to deal with the sideinformation) [Singh et
al KDD’08].
 Comments: This is
the Netflix competition. Might win 1M $!
 Contact person:
Hanghang
Tong
5.4. [*] Proximity Tracking
on Graphs
 Problem: Given an authorconference network
that evolves over time, which are the conferences that a given
author is most closely related with, and how do they change over
time?
 Data set: DBLP data set
 Introductory papers: gCap [Pan
et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02]; pTrack
paper [Tong et al,
SDM’08]
 Comments: there are some possible
generalizations to the current methods. Might lead to
publications.
 Contact person: Hanghang
Tong
5.5. Graph summarization and approximation.
 Problem: Given a large graph, how to find
patterns (e.g., community and anomalies) in an intuitive and
efficient way? How to track the pattern of interest if the graph is
evolvingover time. One powerful way to attack this problem is
through examplebased lowrank approximation for the adjacency
matrix of the graph.
 Data set: DBLP data set; Network Traffic Data
set.
 Introductory papers: CUR paper [Drineas
et al SIAM’05]; Colibri paper [Tong et al,
KDD’08]; Nonnegative CUR paper [Hyvonen et al
KDD’08]
 Comments: there are some possible
generalizations to the current methods. Might lead to
publications.
 Contact person: Hanghang
Tong
6. MULTIMEDIA  BIOLOGICAL AND MEDICAL IMAGES
6.1. [*+] Feature Extraction for analyzing Drosophila Embryo
Images
 Problem: Given a set
of annotated 2D images of Drosophila embryos (352*160 gray scale),
how can we extract good numerical features that capture the
characteristics of each image? Is there a proper distance function
that combine local features and global features to determine the
"closeness" between two images? Good feature extraction algorithms
would help to improve the performance of a number of mining tasks
such as automatic captioning and multimodal querying
 Data: Check the
BDGP site
here ; we also have preprocessed data available in which low
quality images were removed and fly embryos were already scaled and
aligned.
 Introductory paper(s):
FEMine (Our previous work published in KDD'06, details on data
preprocessing; baseline for feature extraction algorithm design);
Zhou and Peng, 2007 , and Peng
et al, 2007 (Recent work on automatic fly embryo image
analysis); also
Tomancak et al, 2002 (Papers by the BDGP group; Good references
if you want to know more about the dataset)
 Comments: The dataset
contains more than 10k images, you may start from a smaller subset,
and write your own algorithm to do further data cleaning.
 Contact Persons:
Fan Guo and
Lei Li
6.2. Multimodel tensor analysis for fMRI brain scans
 Problem: Getting
the best of both worlds (tensors and wavelets) seems to be a
promising way to handle multidimensional time series. In this
problem ,we want to perform a multimodel analysis in fMRI scans
from eleven (11) subjects that perform four (4) different tasks.
Therefore we have an 11x4xXxYxZxT tensor where the last dimension
is the time aspect.
 Data: fMRI brain scans
from Temple University
 Introductory
papers:
 Comments: In
collaboration with Temple University (Michael Barnathan, Prof.
Vasilis Megalooikonomou)
 Contact person :
Charalampos (Babis)
Tsourakakis
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Nondisclosure agreements' (NDAs).
Time sequences
 Time series
repository at UCR.
 KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
 Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highwaylane.
Find patterns, outliers; do data cleansing.
 Riverlevel / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
 Sunspots: number
of sunspots per unit time. Some data are here.
Sunspots seem to have an 11year periodicity, with high
spikes.
 Time sequences from the
SanteFe Institute forecasting competition (financial data,
laserbeam oscillation data, patients' apnea data etc)
 Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the blockid, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
Spatial data
 Astrophysics data  thousands of galaxies, with
coordinates, redshift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
 Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, xy
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Images/video

Biological data: images of proteins, with ~50 attributes
each.
 'Owner': Prof. Bob Murphy.
 Video/image/sound data, from Informedia. 2Tb of video,
segmented; 1M images with features; 10^4 faces. Extract features;
design good similarity functions; do the namedentity
analysis.
Graph data
 Weblog and clickstream data (NDA: needed).
 Snapshots of 2 anonymized (and anonymous) social networks
(NDA)
 Visit patterns for a large web site: for 300 pages, and
thousands of users, we record how many times a user visited a
specific site. Find patterns, clusters, fractal dimensions,
regularities in the SVD etc.
 Netflix competition
dataset (users, movies and ratings)  needs easy registration  we
have a cleanedup version available, locally.
 Enron email
dataset (400 MB compressed)
 Movieactor data from imdb.com (we have a cleanedup snapshot
of it)
 DBLP authorpaperconference data from the DBLP site of Mike
Ley (records in XML,
and their
DTD). For 'egosurfing', try this java app or the
java applet
at U. Alberta.
 Graph
datasets at U.Mass (Amherst), by Prof. Dave Jensen.
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor  ideally, we would like to use these
packages as black boxes.
 Readily available:
 ACCESS METHODS

DRtree : Rtree code; searches for range and nearestneighbor
queries. In C.

kdtree code
 OMNI
trees  a faster version of metric trees.
 Btree code, for text (should be easily changed to handle
numbers, too). In C.
 SVD AND TENSORS:
 FRACTALS
 GRAPHS
 the
NetMine network topology analysis package
 GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
 the '
crossAssociation' package for graph partitioning.
 Outside CMU:
 GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
Rtrees.
BIBLIOGRAPHICAL RESOURCES:
Last modified Jan. 19, 2009, by Christos Faloutsos.