Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2010 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.S10/CMU-ONLY/projlist.html
- Feel free to propose projects outside this list, as long
as they have to do with mining and indexing large datasets.
In that case, contact the instructor as early as possible.
- An asterisk [*] in the project title signify that this
project is related to the phd dissertation of the contact person. A
cross [+] means that this
is a group project, with several potential collaborators. Feel free
to consider non-asterisked projects, too, if they are related to
your interests or your dissertation.
- Please form groups of 3-4
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
0. HADOOP AND PARALLELISM
The projects below are mainly designed for a traditional,
single-machine architecture. However, 'hadoop' allows relatively
easy parallel execution, implementing the map-reduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we have a small cluster where we can give you an
account, or we can give you access to a 50-node hadoop cluster at
INTEL-Pittsburgh, and maybe access to the 'M45' of Yahoo (1000
machines, 4 cores each, 1Tb total RAM and over 3Pb storage - see
the press release at Yahoo,
Scientific American, etc). You are welcome to try any of these
projects below, on a hadoop cluster.
1. SPATIO/TEMPORAL AND STREAM MINING
1.1 [*] Automating BGP-anomaly detection
- Problem: Find interesting patterns and/or anomalies
given a 2-year archive of BGP (Border Gateway Protocol) update
messages between routers. Provide a monitoring tool, that we could
deploy. The findings would be relevant to the network
administrators as well. Finding such anomalies will go a long way
in automating monitoring of routers and helping catch major
problems. We have right now developed a tool 'BGP-lens' in MATLAB
at CMU which is able to find 'clotheslines' (IPs sending persistent
near-constant number of updates over a long period of time) and
'prolonged spikes' (IPs sending a short high-burst of updates -
probably relating to some malfunction/event). For this we use an
aggregated form of the update data - number of updates per 600s
etc. Note the data has millions of updates - so straightforward
methods don't work. The project can be sub-divided into many
interesting paths:
- Studying the effect of parameters and thresholds on discovery
of clotheslines and prolonged spikes. For e.g. clotheslines
discovery relies on moving-window median filtering etc. How will
the window size affect the algorithm? and such questions.
- We want to deploy such
a tool so that the admins use them, but we would need an online and
incremental version of the algorithms so that the tool can quickly
work on incoming update data. Also, this should be done in a
non-MATLAB script like perl/python/ruby as they are light and also
everyone can't afford MATLAB :).
- Also a BGP-lens with a GUI will be more easily used. A nice
work would be to develop a visualization package for it. Note you
would have to deal with representing really large time series and
the GUI should provide sensitivity-knobs (suitable params in the
tool's algorithms) for BGP-lens so that events at different time
scales etc. can be identified.
- The algorithms used in BGP-lens are more general. Hence,
one can study where else can such methods be employed
(specifically, other datasets etc.) Can the methods be used as is
or we need to tweak/change the algorithms?
- Data: We shall use BGP
router data from the Abilene network, a research network, over a
period of 2 years. It can be seen at the Datapository project. Check out
the BGP-Monitor there: you can run some queries too. A snippet can
be downloaded from (CMU only) here,
it contains raw and aggregated updates.
- Introductory paper(s):
[Prakash+,
KDD'09] (or here, for an
earlier, more detailed version)
- Comments: Very high
practical interest, with good problems from both the algorithmic as
well as the system side. Also nice visualization challenges. There
is a lot of room ~ 3-4 people.
- Contact person:
B. Aditya
Prakash
1.2. Disk access
traffic patterns, and the Self-* project
- Problem: Given traces
from real workstations (tuples of the form <disk-id, track-id,
R/W-flag, timestamp>), find patterns; do predictions; use them
to design better buffering and prefetching algorithms. Try 'blind
signal separation'/ICA, to distinguish between 'reads' and
'writes', or between interactive and database accesses. A related
problem is to forecast the response time for a given disk request,
given a training set with their response times. The main problem is
to extract good features. In fact, this is a small part of the the
Self-* project,
which also has numerous co-evolving time sequences from a prototype
data center with multiple 'intelligent' storage units: cpu
utilizations, network traffic measurements, room temperature sensor
measurements, humidity measurements, etc. The goal is to find
patterns, correlations, lag-correlations, anomalies, to help the
data center self-organize, self-detect upcoming (or existing)
failures and attacks, to self-optimize its performance.
- Data: See the
'Disk Access Traces' below.
Also, the web site of the Self-* project, with a lot
of measurement data, that we already have. We also have traces from
an MS SQL Server.
- Introductory
paper(s): the 'PQRS' model [Wang et
al, PEVA 2001]; see also the use of CART [Wang
et al SIGMETRICS 2004] and the follow-up work [Mesnier+,
'05]. Check the SPIRIT
project,
and the corresponding publications (VLDB06,
OSR06) on
Jimeng's page under
'InteMon'. Also the lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments: The general
case is hard, and in fact, is the topic of dissertations
(Dr. Mengzhi
Wang,
Dr. Jimeng Sun).
However, there are a lot of initial ideas that you could try within
a semester, and a lot of
industrial interest in the topic. One idea is to use
multi-resolution analysis, like the AWSOM paper [Papadimitriou+,
VLDB 2003]
- Contact person:: Lei Li
1.3. Astrophysics data mining
- Problem: Develop
algorithms like 'friends-of-friends', for Tb astrophysics data. We
have galaxy data as (x,y,z) triplets and we want to extract
statistics, like number of pairs of neighbors within epsilon;
characteristic lengths (eg., average diameter of galaxy clusters),
etc. The main idea is to use hadoop, to process such large amounts
of data.
- Data: Sloan Digital
Sky survey (SDSS); synthetic data (200MB compressed; here is a snippet)
- Introductory papers:
Fractal dimension estimations [Belussi+ VLDB'95], spatial join
estimations [
SIGMOD00 ]; 2-point and n-point correlation functions [Gray+Moore,
NIPS00]
- Comments: A lot of
interest recently, with the McWilliams Center for
Cosmology at CMU. The goal is to try several cosmology
theories, generate through simulation a 'universe' according to
each theory, and reject theories whose 'universe' does not match
the statistical properties of the real universe. Our challenges are
(a) to compute the statistics that astrophycisists prefer, quickly,
on billions and trillions of particles (galaxies/stars) and (b) to
propose additional statistical measures.
- Contact person: Bin Fu; Robson Cordeiro
2. HADOOP AND LARGE GRAPH MINING
2.1. [*] Large/parallel graph mining, possibly using
'hadoop'
- Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several share-nothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. We want to
compute the in- and out-degree distributions, the diameter of the
graph, the first several eigenvalues, the 'network value' of each
node, the 'clustering coefficient', the node- and edge-betweeness.
The diameter and the connected components have been done by Mr. U
Kang (contact person), but even there, there is room for
optimizations.
- The first step is to do timing of several possible
architectures: with, or without a relational DBMS; with, or without
replication of the data; using the PIG system;
using 'hbase'
- Also, what is the best way to store the data (e.g., as
<from,to> pairs in a flat file; as an adjacency list, hashed
on the 'from' node-id, or as something else.)
- Data: We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real
CMU IP traffic (will need NDA). Finally, we also have a
who-talks-to-whom social network with 270 million nodes and 8
billion edges (60Gb of data)
- Introductory paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAM-DM'04], the connection sub-graph paper [Faloutsos+,
KDD'04]. If you plan to use 'hadoop', get the map-reduce paper
[Dean + Ghemawat, OSDI'04] and the documentation about the
add-ons to hadoop, PIG and hbase.
- Comments: Very high
practical interest, with hard problems from both the algorithmic as
well as the system side. There is a lot of room, even for 4 or
more people.
- Contact persons:
U Kang
3. GRAPHS - PATTERNS, OUTLIERS AND GENERATORS
3.1. [*] Anomaly detection in weighted graphs
- Problem: Given
a graph data set that grows over time, with weights on edges, how
can we find anomalous/ interesting/extreme nodes/edges at a given
time snapshot? What kind of features would be the most informative
for unusual behavior detection? How can we generalize this idea to
time-evolving graphs to track for anomalous nodes/edges?
- Data: Enron
emails, FEC campaigns, DBLP, and any weighted data set with
possible interesting nodes you might have.
- Introductory papers:
- Caleb C. Noble and Diane J. Cook. Graph-based anomaly
detection. In KDD, pages 631–636, 2003.
- William Eberle and Lawrence B. Holder. Discovering structural
anomalies in graph-based data. In ICDM Workshops, pages
393–398, 2007
-
Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Faloutsos.
Neighborhood Formation and Anomaly Detection
in Bipartite Graphs,
Proc. ICDM, pp. 418-425, Houston, Texas, Nov 27-30, 2005
- William Eberle and Lawrence B. Holder. Detecting anomalies in
cargo shipments using graph properties. In ISI, pages
728–730, 2006.
- J. Adibi J. Shetty. Discovering important nodes through graph
entropy: The case of enron email database. In KDD, Proceedings of
the 3rd International Workshop on Link Discovery, pages
74–81, 2005.
- Anomaly detection in graphs: Oddball paper, by Leman Akoglu et al (to appear, PAKDD'10)
- Comments: One
challenge is that some graphs are very large that do not fit in the
main memory. What pre-processing/ filtering/sampling techniques can
be helpful to reduce feature extraction time, so that not all the
nodes/edges are processed at the end?
- Contact Person:
Leman
Akoglu
3.2. [*] Patterns and ``laws'' in weighted graphs
- Problem: How
can we model weighted graphs -for example with network packages
flowing between nodes- for future prediction? Is there any pattern
concerning weights? What kind of patterns would be expect? How are
the weights distributed on the incident edges of a given node? For
a given edge, do weight arrivals show any interesting behavior
besides being bursty? How can we do a "microscopic" analysis so
that to model a given weighted graph over time?
- Data: Network
traffic, Campaign donations
- Introductory
papers:
-
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: Patterns and a
model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws
and a recursive generator for weighted time-evolving graphs. In
ICDM: International Conference on Data Mining, Pisa, Italy,
December 2008.
- Microscopic
Evolution of Social Networks Jure Leskovec, Lars
Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell
and Kleinberg 2003
- [please keep CONFIDENTIAL] Phonecall patterns - reciprocity : Leman's submission on reciprocity (if I call you 50 times, how many times you called me back?)
- [please keep CONFIDENTIAL] Phonecall patterns - duration: Pedro's submission (if a person did 20 phonecalls, what can you say about their durations?)
- Comments: Ideas
can be borrowed from many graphs generators in the literature. The
major part is to work with weights. Contact the instructor or
Leman, for an idea about a new graph generator, using 'monkeys on a
typewriter' approach.
- Contact Person:
Leman
Akoglu
3.3. [*] Model fitting (for Kronecker and RTG)
- Problem : Given a real
graph G how can we generate another graph that looks like it?
Kronecker graphs is is such a generator. The existing approach of
fitting Kronecker graphs relies on the Maximum Likelihood approach.
This approach is slow, even though successful. Can we find an
algorithm that generates such graphs with the same success and
being faster? An idea is to use SVD or the spectra of graphs to
come up with such an algorithm
- Data: Several real
graphs (Epinions, Oregon AS, Flickr etc)
- Introductory
papers:
- Comments: Good
knowledge of linear algebra.
- Contact
persons: Leman
Akoglu
3.4 `PaC' model for graph generation
- Problem: Augment the
'pay and call' model [Du+ KDD09], to make it match more patterns in
real networks
- Data: The usual graph
data; also, confidential who-calls-whom data (but needs NDA)
- Introductory papers:
[Du+
KDD09]
- Comments: The next
goal is to make the inter-arrival times of 'phonecalls', more
realistic
- Contact person:
Instructor
4. BLOGS AND INFLUENCE PROPAGATION
4.1. [*] Cascades and Network Topology
- Problem: How does the popularity of a 'fad' grows
and drops over time? Exponentially? or like a power-law? Does its
shape depend on the topology of the network? If yes, does it depend
on the average degree / diameter / or something else? We understand
that information and "fads" seem to follow an epidemic-type
pattern, and that the shape of the "cascades" follow patterns.
Using snapshots of real graphs, simulate information ('fads'/
viruses) traveling across a network using SIS/SIR infection
models and note what "cascades" are formed. Then, run experiments
on the same graph with some edges or nodes removed randomly
(simulating immunization or quarantine). How do the cascades change
(in size, shape, etc)? What if instead of removing nodes or edges
randomly, we remove them according to some rule (nodes with highest
degree, edges with highest weight, edges with weight less than
k...)?
- Data: Any graph data, ideally weighted (for certain
experiments), see "Graph-like data" below. Check the memetracker web site at Cornell,
and download its dataset.
- Introductory papers: Hethcote
(for SIS/SIR), Leskovec+PAKDD06
(cascade algs), Leskovec+SDM07
and McGlohon+ICWSM07(cascades
in blogs). Also, the memetracker paper
[KDD'09]
- Comments: Alternatively, instead of modifying real
graphs, pick a number of different real (blogs, academic citations,
network traffic) and synthetic (preferential attachment,
Erdos-Renyi, small-world) networks and compare the cascades formed.
How do the cascades vary, and what graph properties yield what
cascade shapes?
- Contact: Aditya Prakash
5. GRAPH ANALYSIS TOOLS AND VISUALIZATION
5.1. [*] Large Graph Visualization
- Problem: Given a huge graph (say, millions of nodes),
help visualize it. Start with the ICML03 paper below; then, try to
extend it for huge graphs: try some partitioning/grouping method,
and/or some fish-eye ideas.
- Data: epinions.com; DBLP citation information,
etc
- Introductory paper(s): Check the paper on GMine. Also [Takeshi
Yamada, Kazumi Saito, Naonori Ueda:
Cross-Entropy Directed Embedding of Network Data.
ICML'03]. Also, the visualization tools
from CAIDA, and specifically 'walrus'.
Check the graphdrawing
organization and the corresponding sequence of conferences
on Graph Drawing (GD) from there. Our goal here is different,
though: we want to large graphs, that don't fit in memory,
nor on the screen, nor in the human mind, unless we summarize them
somehow.
- Comments: Open-ended problem, but very useful
- could lead to publication. The first step could be to
implement the method by [Yamada et al].
- Contact person:
Polo Chau
5.2. [*] Fast implementations of RWR (for gCap)
- Problem: In the 'gCap'
paper (see below), and in the Drosophila Embryo project above, we
need to compute the steady-state probability of each node in a
random walk with restarts, when the restarting node is unknown.
Thus, naively, we need n2 steady-state
probabilities. How can we do better than that? How can we save
computation, if, for a given node i, we only want the top 10 closest
nodes and their scores?
- Data: DBLP, Corel
image data, and more.
- Introductory paper(s): Probably the fastest algorithm so
far is by Hanghang Tong, in ICDM'06.
The goal is to make it even faster, and/or to implement it in
C/C++, or 'hadoop', so that it can run on huge graphs
(Gb-size). Related papers: gCap [Pan
et al, KDD'04]; topic sensitive PageRank [Haveliwala WWW'02] ; fast
algorithms for topic sensitive PageRank [Hawelivala + '03]; graph partitioning
[Sun+,
'05].
- Comments: Mainly,
implementation - but it has room for innovation. Closely related to
the dissertation of Hanghang Tong, who will help along. Also, it
could be used immediately by the Drosophila Embryo project
above.
- Contact person:
Hanghang Tong.
5.3. 'NetFlix' competition: Collaborative Filtering and link
prediction with side information
- Problem: Given a user-movie rating matrix with
many missing entries, how to predict the missing ratings. Here, we
want to investigate how to incorporate the side-information (such
as users/movies attributes, - job title, movie genre) to improve
the prediction accuracy.
- Data sets: Netflix data set, Movie-lens data
set
- Introductory papers: the papers by the leading
team of Netflix competition [Koren et
al KDD’08, Bell et
al KDD’07]; the collective matrix factorization paper
(one possible way to deal with the side-information) [Singh et
al KDD’08].
- Comments: This is
the well publicized, $1M Netflix Prize competition -
visit that site for papers, data etc.
- Contact person:
Hanghang
Tong
5.4. Graph similarity, summarization and
approximation.
- Problem: Given a large graph, how to find
patterns (e.g., community and anomalies) in an intuitive and
efficient way? How to track the pattern of interest if the graph is
evolving-over time? The problem of graph similarity is also subtly
related: given two graphs, how similar are they? The number of
edges they differ, is not necessarily a good measure. One way
to attack the problem of approximatio is through example-based
low-rank approximation for the adjacency matrix of the graph.
- Data set: DBLP data set; Network Traffic Data
set.
- Introductory papers: CUR paper [Drineas
et al SIAM’05]; Colibri paper [Tong et al,
KDD’08]; Non-negative CUR paper [Hyvonen et al
KDD’08]
- Comments: The problem of graph similarity seems
easy but it is very subtle.
- Contact person: Hanghang
Tong
6. MULTIMEDIA - BIOLOGICAL AND MEDICAL IMAGES
6.1. [*+] Visualization, Summarization and Mining of Drosophila
Embryo Images
- Problem: Given a set
of annotated 2D images of Drosophila embryos (352*160 gray scale),
one of the problems is to help biologist clean up a dataset (some
images are very noisy). What you can do is (a) implement
`Multi-dimensional scaling' (MDS), to help plot our images on the
screen, and hopefully see outliers (b) to design a system to help
us summarize a collection of images (say, by finding clusters and
reporting the 'typical' image in each cluster). The ultimate,
50-year-horizon goal is to find how genes affect each other, in the
early stages of life in drosophila (and help us extrapolate about
human genes).
- Data: Check the
BDGP site
here ; we also have preprocessed data available in which low
quality images were removed and fly embryos were already scaled and
aligned.
- Introductory paper(s):
FEMine (Our previous work published in KDD'06, details on data
preprocessing; baseline for feature extraction algorithm design);
Zhou and Peng, 2007 , and Peng
et al, 2007 (Recent work on automatic fly embryo image
analysis); also
Tomancak et al, 2002 (Papers by the BDGP group; Good references
if you want to know more about the dataset)
- Comments: The dataset
contains more than 10k images, you may start from a smaller subset,
and write your own algorithm to do further data cleaning.
- Contact Persons:
Fan Guo and Lei Li
6.2. [+] Multimodel tensor analysis for fMRI brain scans
- Problem: Getting
the best of both worlds (tensors and wavelets) seems to be a
promising way to handle multidimensional time series. In this
problem, we want to perform a multimodal analysis in fMRI scans
from eleven (11) subjects that perform four (4) different tasks.
Therefore we have an 11x4xXxYxZxT tensor where the last dimension
is the time aspect. The goal is to find patterns in such a dataset
(like, eg., `left-handed people have more activation in their right
part of their brain')
- Data: fMRI brain scans
from Temple University
- Introductory
papers:
- Comments: In
collaboration with Temple University (Michael Barnathan, Prof.
Vasilis Megalooikonomou)
- Contact person:
Instructor
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots:
number of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft (CMU). The full dataset is 200Mb compressed - contact instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Images/video
-
Biological data: images of proteins, with ~50 attributes
each.
- 'Owner': Prof. Bob Murphy.
- Video/image/sound data, from Informedia. 2Tb of video,
segmented; 1M images with features; 10^4 faces. Extract features;
design good similarity functions; do the named-entity
analysis.
Graph data
- Web-log and click-stream data (NDA: needed).
- Snapshots of 2 anonymized (and anonymous) social networks
(NDA)
- Visit patterns for a large web site: for 300 pages, and
thousands of users, we record how many times a user visited a
specific site. Find patterns, clusters, fractal dimensions,
regularities in the SVD etc.
- Netflix competition
dataset (users, movies and ratings) - needs easy registration - we
have a cleaned-up version available, locally.
- Enron email
dataset (400 MB compressed)
- Movie-actor data from imdb.com (we have a cleaned-up snapshot
of it)
- DBLP author-paper-conference data from the DBLP site of Mike
Ley (records in XML,
and their
DTD). For 'ego-surfing', try this java app or the
java applet
at U. Alberta.
- Graph
datasets at U.Mass (Amherst), by Prof. Dave Jensen.
- More graph datasets
from Mark
Newman (U. Michigan) - including popular test-beds like the
Zachary's karate club social network etc.
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- the PEGASUS
package for graph mining on hadoop.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, for (small)
graph visualization
BIBLIOGRAPHICAL RESOURCES:
Last modified Feb. 9, 2010, by Christos Faloutsos.