Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2016 - C. Faloutsos
List of suggested non-default projects, for PhD students
PRELIMINARIES
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.S16/CMU-ONLY/projlist.html
- Please form groups of 2
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
You may negotiate with the instructor, and propose a project
outside of this list.
1. GRAPH / TENSOR MINING
1.1. Spam Detection for Review Data
- Problem: Review data
provides valuable information about products and services. Review
data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and
is being frequently used by customers to choose among competing
products or services. Since reviews highly affect the buying
behaviour of customers, spammers try to mislead the users by
writing fake reviews. The goal of this project is to develop
methods to detect users showing spamming behaviour. We want to
start with a feature based detection of spammers: What are the
characteristics of a spammer? Which features can be used to
discriminate between spammers and non-spammers? Are these features
useful for all users or only for a subset of users? Based on this
feature representation, automatic methods to classify/rank the
users regarding their spamming behaviour should be developed
exploiting, e.g., the principles of subspace
clustering/co-clustering or low rank matrix factorization.
- Data: The participants
can test their methods on multiple review datasets such as Amazon
(6M reviews) and Yelp (300K reviews).
- Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing Liu,
Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh.
Spotting Opinion Spammers using Behavioral Footprints. SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel,
Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and
correlation clustering. TKDD 3(1) (2009)
- Contact Person:
Instructor;
Mr. Neil Shah
; and
Mr. Alex
Beutel.
1.2. Weighted graphs over time
- Problem: How
can we model weighted graphs -for example with network packages
flowing between nodes- for future prediction? Is there any pattern
with respect to the weights? What kind of patterns would be expect?
How are the weights distributed on the incident edges of a given
node? For a given edge, do weight arrivals show any interesting
behavior besides being bursty? How can we do a "microscopic"
analysis so that to model a given weighted graph over time?
- Data:
call-graph (needs NDA), Campaign donations
- Introductory
papers:
-
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: Patterns and a
model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws
and a recursive generator for weighted time-evolving graphs. In
ICDM: International Conference on Data Mining, Pisa, Italy,
December 2008.
- Microscopic
Evolution of Social Networks Jure Leskovec, Lars
Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell
and Kleinberg 2003
- Phonecall patterns - duration: In [ PKDD2010]
the duration seemed to be log-logistic. What can we say about
correlations: if I make a 30' phonecall now, what can you say about
the duration of my next one?
- Comments: There are
several generators in the literature, but the point is is to model
the weights.
- Contact Person: instructor; Mr. Daniel Chino chinodyt AT gmail
1.3. Tensor decomposition using RDBMS
- Problem: Let's take
the 2nd default project a step further. Can SQL be used to
manipulate temporal evolving graphs? We are particularly interested
in applying SQL to the tensor decomposition problem: given a 3-way
tensor (for instance, indicating if person i contacted
person j on day k) we want to find heavy blocks in
the tensor. Using the previous example, we are looking for a set of
people that called a set of other people on a set of days (the
output would be a set of these 3 vectors). There are many
algorithms that can be applied to solve this problem, but can any
of them be implemented in SQL (and thus be easily
parallelizable)?
- Data: Any temporal
graph will do, we have phone networks, computer communications
network and email network data available.
- Introductory material:
- The Pegasus paper with GIM-V
is a good starting point to understand how common matrix operations
can be applied in SQL.
- Navasca's
presentation is a simple introduction to CP decomposition and
the ALS method.
- Tamara Kolda and Brett Bader's survey is a
more detailed alternative to understand all the notation and the
most common algorithms.
- Comments: This project
combines a fair amount of implementation and mathematical problems
and can definitely lead to a publication.
- Contact Persons:
Instructor; Mr. Vagelis
Papalexakis.
2. MODELING
2.1 'Brain in a box'
- Problem: Can you
design a neural network, to mimick the level of energy activities
of a real brain, when it is performing some tasks? Start with a
survey on ``recurrent neural networks'', and the pointers on
``system identification'' in the introductory paper below. Design a
GUI, so that we can add/delete/modify neurons, and see the
reactions of the resulting ``brain''.
- Data: From
Vagelis.
- Introductory Material:
the paper below, and its citations
-
- Evangelos E. Papalexakis, Alona Fyshe, Nicholas
Sidiropoulos,Partha Pratim Talukdar, Tom Mitchell,Christos
Faloutsos,
Good-Enough Brain Model: Challenges, Algorithms and Discoveries in
Multi-Subject Experiments, ACM SIGKDD 2014, New York
City, USA
- Comments: Hard problem
in general, but the GUI should be do-able within a
semester.
- Contact Persons:
Instructor; Mr. Vagelis
Papalexakis.
3. TIME SERIES
3.1 Guess the next flu spike: Co-evolving time series
mining
- Problem: Given time series of patients (blood pressure
over time, etc), and class labels ('healthy', 'unhealthy') extract
features and do classification. Or, given a set of sequences of,
say, BGP updates, find correlations and anomalies (BGP = Border
Gateway Protocol, in computer networks). In yet-another scenario,
consider monitoring a data-center (like the Self-* system or the
Data Center
Observatory , both at CMU/PDL. Another application is
monitoring environmental data, to spot, say, global warming,
deforestation, etc - see the web page of Prof. Vipin
Kumar
- Data
-
- Very interesting dataset: from the tycho project - epidemiology time
series, with # of infected people per unit time per US city per
disease. Other data include
- From the physionet.org collection
- Introductory paper(s) For spikes in epidemiology data,
check the 'spikeM'
model [kdd'12]. For BGP, check [Prakash+,
KDD'09] (or here, for a
more detailed version. For data center monitoring, check the
SPIRIT
project; and the corresponding publication OSR06. Also the
lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments Start with Fourier and wavelets, for features.
For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo'
and 'PLiF' methods. For the physionet data, one challenge is how to
handle the several, wrong recordings (eg., blood pressure ~ 0).
Depending on the composition of the team, the project could focus
on any of the above settings (environment only; datacenter only;
etc). There is a lot of code on the web site of Prof.
Yasuko
Matsubara
- Contact person: instructor; Mr. Bryan Hooi
bhooi@andrew
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots: number of sunspots per unit time. Some
data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data - need NDA
- YahooWeb crawl
(120Gb, 1B nodes, 6B edges). Needs mild NDA
- Web-log and click-stream data (NDA: needed).
- call-graphs
Snapshots of anonymized (and anonymous) who-calls-whom graphs
(NDA)
Graph Data - public
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
- METIS, for graph
partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Jan. 18, 2016, by Christos Faloutsos.