Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Fall 2011 - C. Faloutsos
List of suggested projects
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.F11/CMU-ONLY/projlist.html
- Feel free to propose projects outside this list, as long
as they have to do with mining and indexing large datasets.
In that case, contact the instructor as early as possible.
- A [P] in the project title signify that this project is
related to the phd dissertation of the contact person.
- Please form groups of 3-4
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
0. HADOOP AND PARALLELISM
The projects below are mainly designed for a traditional,
single-machine architecture. However, 'hadoop' allows relatively
easy parallel execution, implementing the map-reduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we have a small cluster where we can give you an
account, or make some other arrangement.
1. HADOOP AND LARGE GRAPH MINING
1.1. [P] Large/parallel graph mining, possibly using
'hadoop'
- Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several share-nothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. Our 'pegasus' system already
computes the in- and out-degree distributions, the diameter of the
graph, the first several eigenvalues, and runs on top of hadoop.
- The first step is to do timing of several possible
architectures: with, or without a relational DBMS; with, or without
replication of the data; using the PIG system;
using 'hbase'
- Also, what is the best way to store the data (e.g., as
<from,to> pairs in a flat file; as an adjacency list, hashed
on the 'from' node-id, or as something else.)
- Data: We shall start
with synthetic data, using an existing generator [Leskovec+,
PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real
CMU IP traffic (will need NDA). Finally, we also have a
who-talks-to-whom social network with 270 million nodes and 8
billion edges (60Gb of data)
- Introductory paper(s):
The generator above; the Gamma database machine papers [Dewitt+,
IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+,
vldb'90] the RMAT paper [Chakrabarti+
SIAM-DM'04], the connection sub-graph paper [Faloutsos+,
KDD'04]. If you plan to use 'hadoop', get the map-reduce paper
[Dean + Ghemawat, OSDI'04] and the documentation about the
add-ons to hadoop, PIG and hbase.
- Comments: Very high
practical interest, with hard problems from both the algorithmic as
well as the system side. There is a lot of room, even for 4 or
more people.
- Contact persons:
U Kang
1.2. [P] Anomaly detection in weighted and/or attributed
graphs
- Problem: Given
a graph data set, with weights on edges, how can we find anomalous/
interesting/extreme nodes/edges? What kind of features would be the
most informative for unusual behavior detection? How can we
generalize this idea to time-evolving graphs to track for anomalous
nodes/edges? The same questions are of interest, when the nodes
have attributes (eg., facebook users, with gender, age,
political-leaning, etc).
- Data: Enron
emails, FEC campaigns, DBLP, and any weighted data set with
possible interesting nodes you might have.
- Introductory papers:
- Caleb C. Noble and Diane J. Cook. Graph-based anomaly
detection. In KDD, pages 631–636, 2003.
- William Eberle and Lawrence B. Holder. Discovering structural
anomalies in graph-based data. In ICDM Workshops, pages
393–398, 2007
-
Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, Christos Faloutsos.
Neighborhood Formation and Anomaly Detection
in Bipartite Graphs,
Proc. ICDM, pp. 418-425, Houston, Texas, Nov 27-30, 2005
- William Eberle and Lawrence B. Holder. Detecting anomalies in
cargo shipments using graph properties. In ISI, pages
728–730, 2006.
- J. Adibi J. Shetty. Discovering important nodes through graph
entropy: The case of enron email database. In KDD, Proceedings of
the 3rd International Workshop on Link Discovery, pages
74–81, 2005.
- Anomaly detection in graphs: Oddball
paper, by Leman Akoglu et al (PAKDD'10)
- Comments: One
challenge is that some graphs are very large that do not fit in the
main memory. What pre-processing/ filtering/sampling techniques can
be helpful to reduce feature extraction time, so that not all the
nodes/edges are processed at the end?
- Contact Person:
Leman
Akoglu, instructor.
1.3. [P] Weighted graphs over time
- Problem: How
can we model weighted graphs -for example with network packages
flowing between nodes- for future prediction? Is there any pattern
with respect to the weights? What kind of patterns would be expect?
How are the weights distributed on the incident edges of a given
node? For a given edge, do weight arrivals show any interesting
behavior besides being bursty? How can we do a "microscopic"
analysis so that to model a given weighted graph over time?
- Data:
call-graph (needs NDA), Campaign donations
- Introductory
papers:
-
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: Patterns and a
model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws
and a recursive generator for weighted time-evolving graphs. In
ICDM: International Conference on Data Mining, Pisa, Italy,
December 2008.
- Microscopic
Evolution of Social Networks Jure Leskovec, Lars
Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell
and Kleinberg 2003
-
- [please keep CONFIDENTIAL] Phonecall patterns - reciprocity :
Leman's submission on reciprocity (if
I call you 50 times, how many times will you call me
back?)
- Phonecall patterns - duration: In [ PKDD2010]
the duration seemed to be log-logistic. What can we say about
correlations: if I make a 30' phonecall now, what can you say about
the duration of my next one?
- Comments: There
are several generators in the literature, but the point is is to
model the weights.
- Contact Person:
Leman
Akoglu, instructor.
1.4. Graph similarity, summarization and
approximation.
- Problem: Given a large graph, how to find
patterns (e.g., community and anomalies) in an intuitive and
efficient way? How to track the pattern of interest if the graph is
evolving-over time? The problem of graph similarity is also subtly
related: given two graphs, how similar are they? The number of
edges they differ, is not necessarily a good measure. One way
to attack the problem of approximation is through example-based
low-rank approximation for the adjacency matrix of the graph.
- Data set: DBLP data set; Network Traffic Data
set.
- Introductory papers:
- Comments: The problem of graph similarity seems
easy but it is very subtle. Start from the squared differences of
the eigenvalues; consider the generalized eigenvalue version (see
instructor)
- Contact person: Instructor
1.5 [P] Attention Routing
- Problem: Once we
detect an anomaly (say, too many web-pages/nodes have in-degree '128',
indicating a link-farm), what
should we show to the user? In general, given a large graph
(possibly, with weighted edges and/or node-attributes), where
should we route the attention of the user?
- Data set: The YahooWeb
dataset; DBLP; IMDB.
- Introductory papers:
The `apolo' paper [Chau+,
CHI'11]; all the papers on anomaly detection, earlier (see
project 1.2)
- Comments: The work may require user studies, or at least
'friends and family' studies. Ideally, it should (a) spot the top,
say 5, outliers, and seed the 'apolo' system with those 5
nodes.
- Contact person: Polo Chau
2. GRAPH GENERATORS
2.1. [P] Model fitting for the 'RTG'
- Problem : Given a real
graph G how can we generate another graph that looks like it? The
Random Typing Generator (RTG) is such a generator. RTG has several
parameters - which values should we set them to, to mimick a given,
real, graph? One major challenge is to express the slopes of the
degree/eigenvalue/etc power-laws, as a function of the model
parameters.
- Data: Several real
graphs (Epinions, Oregon AS, Flickr etc)
- Introductory
papers:
- Comments: Good
background of linear algebra.
- Contact
person: Leman
Akoglu, instructor.
2.2. `PaC' model for graph generation
- Problem: Augment the
'pay and call' model [Du+ KDD09], to make it match more patterns in
real networks; or, augment it with node-attributes, possibly
assuming 'homophily'.
- Data: The usual graph
data; also, confidential who-calls-whom data (but needs NDA)
- Introductory papers:
[Du+
KDD09]
- Comments: There are
several possible extensions: one is to make the inter-arrival times
of 'phonecalls', more realistic. Another, orthogonal direction, is
to add attributes to the nodes, and try to make sure that 'similar'
people tend to form groups. A third direction is to mimic the
reciprocity: if I call you 50 times, how often do you call me
back?
- Contact person:
Instructor
3. VIRUS, TWEET, AND INFLUENCE PROPAGATION
3.1. [P] Shape, and timing of cascades - 'rise and fall'
- Problem: How does the popularity of a 'fad' grows
and drops over time? Exponentially? or like a power-law? Does its
shape depend on the topology of the network? If yes, does it depend
on the average degree / diameter / or something else? We understand
that information and "fads" seem to follow an epidemic-type
pattern, and that the shape of the "cascades" follow
patterns.
- Data: Any graph data, ideally weighted (for certain
experiments), see "Graph-like data" below. Check the memetracker web site at Cornell,
and download its dataset.
- Introductory papers: Start from [Crane and Sornette,'08], which
claims that rise- and fall-times follow power-laws. Also, for
epidemiology surveys, see Hethcote
(for SIS/SIR); for shape of cascades, see Leskovec+PAKDD06
(cascade algs), Leskovec+SDM07
and McGlohon+ICWSM07(cascades
in blogs). Also, the memetracker paper
[KDD'09]
- Comments: Examine real data (e.g., from meme-tracker
above), or crawls from Twitter etc, to check whether
network-related activity follows exponential, or power-law (or some
other shape), in its growth phase, and in its decline phase. A
second direction is to study the shape of cascades (say, re-tweets,
in a crawl of Twitter - are they mainly 'stars'? 'chains'?
in-between?)
- Contact: Aditya Prakash, instructor.
3.2 [P] Competing
viruses
- Problem: Given two
competing products (or viruses, or ideas), like, e.g., iphone and
android, find out what will happen in the steady state (if a
steady-state exists at all): will one product completely take over
and extinguish the other? or will they capture different
market-shares, and reach an equilibrium? or something
else?
- Data: We might be able
to get epidemiology data, but the problem is mainly
theoretical.
- Introductory papers:
The standard texts in epidemiology ( the survey
by Hethcote and the book by
Anderson and May).
- Comments: The case of
a clique topology (all nodes are connected to all), with perfect
interaction among the viruses, seems solved: winner takes all. One
direction is to study arbitrary topologies; another is to model the
case where the two strands leave partial immunization (eg., once I
buy an iphone, I may still buy an android, but with much less
probability). The project will probably involve many simulation runs (we recommend the
'condor' system, free at CMU/CS).
- Contact: Aditya Prakash, instructor.
4. SPATIO/TEMPORAL AND STREAM MINING
4.1 Co-evolving time series mining
- Problem: Given time series of patients (blood pressure
over time, etc), and class labels ('healthy', 'unhealthy') extract
features and do classification. Or, given a set of sequences of,
say, BGP updates, find correlations and anomalies (BGP = Border
Gateway Protocol, in computer networks). In yet-another scenario,
consider monitoring a data-center (like the Self-* system or the
Data Center
Observatory , both at CMU/PDL. Another application is
monitoring environmental data, to spot, say, global warming,
deforestation, etc - see the web page of Prof. Vipin
Kumar
- Data From the physionet.org collection; for BGP, check
the Datapository
project. For environmental
data:
- Introductory paper(s) For BGP, check [Prakash+,
KDD'09] (or here, for a
more detailed version. For data center monitoring, check the
SPIRIT
project,
and the corresponding publication
OSR06.
Also the lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments Start with Fourier and wavelets, for features.
Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one
challenge is how to handle the several, wrong recordings (eg.,
blood pressure ~ 0). Depending on the composition of the team, the
project could focus on any of the above settings (environment only;
datacenter only; etc).
- Contact person:: Lei Li (until mid
October'11); instructor.
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots:
number of sunspots per unit time. Some data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data
- YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild
NDA
- Web-log and click-stream data (NDA: needed).
- call-graphs Snapshots of anonymized (and anonymous)
who-calls-whom graphs (NDA)
- Enron email
dataset (400 MB compressed)
- Movie-actor data from imdb.com (we have a cleaned-up snapshot
of it)
- DBLP author-paper-conference data from the DBLP site of Mike
Ley (records in XML,
and their
DTD). For 'ego-surfing', try this java app or the
java applet
at U. Alberta.
- Graph
datasets at U.Mass (Amherst), by Prof. Dave Jensen.
- More graph datasets
from Mark
Newman (U. Michigan) - including popular test-beds like the
Zachary's karate club social network etc.
- patent information, from googlebooks
(mirroring the U.S. Patent Office). Contact instructor for a
who-cites-whom file.
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
-
METIS,
for graph partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Sept. 20, 2011, by Christos Faloutsos.