Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2017 - C. Faloutsos
List of suggested non-default projects, for PhD students
PRELIMINARIES
The projects are grouped according to their general theme. We also
list the data and software available, to leverage your effort. More
links and resources may be added in the future. Reminders:
- URL for this very page (internal to CMU - please treat it
'confidentially'):
www.cs.cmu.edu/~christos/courses/826.S17/CMU-ONLY/projlist.html
- Please form groups of 2
people.
- Please check the 'blackboard' system, where we
will create one thread for each of the projects below. Please
indicate your interest, by posting in the appropriate thread(s), so
that you can find partners.
SUGGESTED TOPICS
You may negotiate with the instructor, and propose a project
outside of this list.
1. GRAPH / TENSOR MINING
1.1. Spam Detection for Review Data
- Problem: Review data
provides valuable information about products and services. Review
data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and
is being frequently used by customers to choose among competing
products or services. Since reviews highly affect the buying
behaviour of customers, spammers try to mislead the users by
writing fake reviews. The goal of this project is to develop
methods to detect users showing spamming behaviour. We want to
start with a feature based detection of spammers: What are the
characteristics of a spammer? Which features can be used to
discriminate between spammers and non-spammers? Are these features
useful for all users or only for a subset of users? Based on this
feature representation, automatic methods to classify/rank the
users regarding their spamming behaviour should be developed
exploiting, e.g., the principles of subspace
clustering/co-clustering or low rank matrix factorization.
- Data: The participants
can test their methods on multiple review datasets such as Amazon
(6M reviews) and Yelp (300K reviews).
- Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing Liu,
Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh.
Spotting Opinion Spammers using Behavioral Footprints. SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel,
Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A
survey on subspace clustering, pattern-based clustering, and
correlation clustering. TKDD 3(1) (2009)
- Contact Person:
Instructor;
Mr. Neil Shah
1.2. Weighted graphs over time
- Problem: How
can we model weighted graphs -for example with network packages
flowing between nodes- for future prediction? Is there any pattern
with respect to the weights? What kind of patterns would be expect?
How are the weights distributed on the incident edges of a given
node? For a given edge, do weight arrivals show any interesting
behavior besides being bursty? How can we do a "microscopic"
analysis so that to model a given weighted graph over time?
- Data:
call-graph (needs NDA), Campaign donations
- Introductory
papers:
-
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos.
Weighted graphs and disconnected components: Patterns and a
model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws
and a recursive generator for weighted time-evolving graphs. In
ICDM: International Conference on Data Mining, Pisa, Italy,
December 2008.
- Microscopic
Evolution of Social Networks Jure Leskovec, Lars
Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell
and Kleinberg 2003
- Phonecall patterns - duration: In [ PKDD2010]
the duration seemed to be log-logistic. What can we say about
correlations: if I make a 30' phonecall now, what can you say about
the duration of my next one?
- Comments: There are
several generators in the literature, but the point is is to model
the weights.
- Contact Person: instructor; Ms. Dhivya Eswaran and Mr. Hemank Lamba.
1.3. Tensor decomposition using RDBMS
- Problem: Can SQL be used to
manipulate temporal evolving graphs? We are particularly interested
in applying SQL to the tensor decomposition problem: given a 3-way
tensor (for instance, indicating if person i contacted
person j on day k)
we want to find heavy blocks in
the tensor. Using the previous example, we are looking for a set of
people that called a set of other people on a set of days (the
output would be a set of these 3 vectors). There are many
algorithms that can be applied to solve this problem, but can any
of them be implemented in SQL (and thus be easily
parallelizable)? It turns out for the 3-modes, we can do that (see
contact person Mr. Shin, below).
- Can we extend it to N-modes (N>3)?
There are subtle optimizations that we need to consider.
- Can we extend it to other types of decompositions, like non-negative tensor decomposition, or boolean one?
- Data: Any temporal
graph will do, we have phone networks, computer communications
network and email network data available. We also have 4-mode data
(intrusion detection: (source-IP-address, destination-IP,
destination-port, timestamp) ).
- Introductory material:
- The Pegasus paper with GIM-V
is a good starting point to understand how common matrix operations
can be applied in SQL.
- Navasca's
presentation is a simple introduction to CP decomposition and
the ALS method.
- Tamara Kolda and Brett Bader's survey is a
more detailed alternative to understand all the notation and the
most common algorithms.
- Boolean tensor decomposition
- non-negative tensor decomposition
- Comments: This project
combines both implementation as well as mathematical problems
and can definitely lead to a publication.
- Contact Persons:
Instructor; Mr. Kijung Shin.
1.4. Confidence-based ranking for graph classification
- Problem: Recent work has shown that incorporating the notion of confidence in node classification improves classification accuracy. However, many real world applications (e.g., fraud detection) are precision-critical and the ranking of nodes becomes important. The problem is this: Given scores of each node ([p_1,...,p_k], c), where p_i is the probability for class i and c is the confidence of this classification, how would you rank the nodes to achieve maximum acccuracy at top k (for any k)?
- Data: political blogs data (polblogs), DBLP coauthorship network (coauthor), a friendship network (pokec)
- Introductory papers:
- Comments: This project is fairly open-ended and a thorough literature survey for ranking mechanisms is recommended. The general problem for k classes is a bit hard; a good starting point would be to try out this ranking scheme for binary classification.
- Contact person(s): Instructor; Ms. Dhivya Eswaran
2. MODELING
2.1 'Brain in a box'
- Problem: Can you
design a neural network, to mimick the level of energy activities
of a real brain, when it is performing some tasks? Start with a
survey on ``recurrent neural networks'', and the pointers on
``system identification'' in the introductory paper below. Design a
GUI, so that we can add/delete/modify neurons, and see the
reactions of the resulting ``brain''.
- Data: From
Vagelis.
- Introductory Material:
the paper below, and its citations
-
- Evangelos E. Papalexakis, Alona Fyshe, Nicholas
Sidiropoulos,Partha Pratim Talukdar, Tom Mitchell,Christos
Faloutsos,
Good-Enough Brain Model: Challenges, Algorithms and Discoveries in
Multi-Subject Experiments, ACM SIGKDD 2014, New York
City, USA
- Comments: Hard problem
in general, but the GUI should be do-able within a
semester.
- Contact Persons:
Instructor; Ms. Hyun Ah Song.
3. TIME SERIES
3.1 Guess the next flu spike: Co-evolving time series
mining
- Problem: Given time series of patients (blood pressure
over time, etc), and class labels ('healthy', 'unhealthy') extract
features and do classification. Or, given a set of sequences of,
say, BGP updates, find correlations and anomalies (BGP = Border
Gateway Protocol, in computer networks). In yet-another scenario,
consider monitoring a data-center (like the Self-* system or the
Data Center
Observatory , both at CMU/PDL. Another application is
monitoring environmental data, to spot, say, global warming,
deforestation, etc - see the web page of Prof. Vipin
Kumar
- Data
-
- Very interesting dataset: from the tycho project - epidemiology time
series, with # of infected people per unit time per US city per
disease. Other data include
- From the physionet.org collection
- Introductory paper(s) For spikes in epidemiology data,
check the 'spikeM'
model [kdd'12]. For BGP, check [Prakash+,
KDD'09] (or here, for a
more detailed version. For data center monitoring, check the
SPIRIT
project; and the corresponding publication OSR06. Also the
lag-correlation paper [Sakurai+
SIGMOD'05], and the DynaMMo method (Kalman filters for
missing values [ Li+ KDD'09
]).
- Comments Start with Fourier and wavelets, for features.
For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo'
and 'PLiF' methods. For the physionet data, one challenge is how to
handle the several, wrong recordings (eg., blood pressure ~ 0).
Depending on the composition of the team, the project could focus
on any of the above settings (environment only; datacenter only;
etc). There is a lot of code on the web site of Prof.
Yasuko
Matsubara
- Contact person: instructor; Mr. Bryan Hooi
bhooi@andrew
4. RICH / HETEROGENEOUS GRAPHS
4.1. Structural Correlation of Attributes
- Problem: If all my 'friends' on faceBook, are 25 year old males,
what can you say about my gender? my age?
How to measure the strength of correlation (if any),
between the gender (age, salary, etc) of two 'friends'?
How to use that, to spot anomalies?
Given a graph in which nodes have categorical attributes, how can we quantify the structural correlation of attributes? Structure independent attribute correlations, such as homophily ('birds of the same feather flock together') and heterophily ('opposites attract'), are well-studied and have been successfully exploited for various tasks, e.g., node classification. Their counterparts in the context of inter-related data, usually represented as graphs, are largely been neglected in the literature. This project is aimed to bridge this gap by answering the following questions: (1) How to quantify inter-attribute correlation for 2 or more attributes? (2) How can we utilize this, e.g. for anomaly detection, link prediction, or ranking attributes which are most correlated with the structure?
- Data: Facebook, Google+, or any graph data with 2 or more categorical attributes.
- Introductory papers:
- Comments: This project involves a good amount of math, data analysis and visualization; has high potential of leading to a publication.
- Contact person(s): Instructor; Ms. Dhivya Eswaran; Ms. Reihaneh Rabbany
4.2. Attribute and/or Link Prediction in Heterogeneous Networks
- Problem: Given an attributed network, the intuition is that one can predict characteristics of an entity, based on the known characteristics of its neighbors. In more detail, we can use the observed structural correlation between the attributes characterizing the nodes of a graph, to guess the unknown values, e.g., if knowing the females often befriend males, we can guess that a friend of a female is likely male. In the case of a general network, can we infer attributes of nodes, which are not yet observed from the data, given how nodes interact with other in the network and also their observed/known characteristics. Alternatively, can we predict the links which are not yet known parts of the current network.
- Data: Facebook, Google+, or any other attributed graph data
- Introductory papers:
- R. Rabbany, D. Eswaran, A. Dubrawski, C. Faloutsos. Beyond Assortativity: Proclivity Index for Attributed Networks (ProNe). PAKDD 2017. Jeju, South Korea.
- Kipf, Thomas N., and Max Welling. Variational Graph Auto-Encoders. arXiv preprint arXiv:1611.07308 (2016)
- Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Eui Chul Richard Shin, Emil Stefanov, Elaine Runting Shi, and Dawn Song. Joint link prediction and attribute inference using a social-attribute network. ACM Transactions on Intelligent Systems and Technology (TIST), 5(2):27, 2014.
- P. Wang, B. Xu, Y. Wu, and X. Zhou. Link prediction in social networks: the state-of-the-art. Science China Information Sciences, 58(1):1-38, 2015. (general survey)
Comments: This is closely related to previous project, but here the focus is on the prediction task, not measurement and modeling. Any prediction which uses both attributes of nodes and their connections in a sensible way would suffices. High potential of resulting in a paper.
Contact person(s): Instructor; Ms. Dhivya Eswaran; Ms. Reihaneh Rabbany
4.3. Clustering in Attributed Networks
- Problem: Given an attributed network, how can we find groups of nodes that are both internally well connected and having homogeneous attributes. What is a good objective function for this problem? how can we develop a scalable algorithm to detect such groupings (possibly non-disjoint)? and how could such algorithm be evaluated? could the evaluation and/or definition of the problem be dependent on an inferences task that builds on top of this model? There is a large body of work on clustering in networks, a.k.a community detection. However, few works focus on attributed networks, and overlapping communities, and there is much room for research.
- Data: Facebook, Google+, or any other attributed graph data
- Introductory papers:
- Bryan Perozzi, Leman Akoglu, Patricia Iglesias Sanchez, Emmanuel Muller. Focused clustering and outlier detection in large attributed graphs. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
- Flavia Moser, Recep Colak, Arash Rafiey, and Martin Ester. Mining cohesive patterns from graphs with feature vectors. In SDM, volume 9, pages 593-604, 2009.
- Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute similarities.
Proceedings of the Very Large Data Bases Endowment, 2(1):718-729, 2009.
- Rabbany, Reihaneh, and Osmar R. Zaiane. Evaluation of community mining algorithms in the presence of attributes. Trends and Applications in Knowledge Discovery and Data Mining. Springer International Publishing, 2015. 152-163.
- Bothorel, C., Cruz, J. D., Magnani, M., & Micenkova, B. (2015). Clustering attributed graphs: models, measures and methods. Network Science, 3(03), 408-444.
- Comments: large body of related works, and diverse techniques, potential of resulting in a paper.
- Contact person(s): Instructor; Ms. Reihaneh Rabbany
DATASETS
Unless explicitly mentioned, the datasets are either
'public' or 'owned' by the instructor; for the rest, we need to
discuss about 'Non-disclosure agreements' (NDAs).
Time sequences
- Time series
repository at UCR.
- KURSK
dataset of multipe time sequences: time series from
seismological sensors by the explosion site of the 'Kursk'
submarine.
- Track traffic data, from our Civil Engineering
Department. Number of trucks, weight etc per day per highway-lane.
Find patterns, outliers; do data cleansing.
- River-level / hydrology data: multiple,
correlated time series. Do data cleansing; find correlations
between these series. Excellent project for people that like
canoeing!
- Sunspots: number of sunspots per unit time. Some
data are here.
Sunspots seem to have an 11-year periodicity, with high
spikes.
- Time sequences from the
Sante-Fe Institute forecasting competition (financial data,
laser-beam oscillation data, patients' apnea data etc)
- Disk access
traces, from HP Labs (we have local copies at CMU).
For each disk access, we have the timestamp, the block-id, and the
type ('read'/'write'). Here is a
snippet of the data, aggregated per 30'.
- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu
Spatial data
- Astrophysics data - thousands of galaxies, with
coordinates, red-shift, spectra, photographs.
Small snippet of the data. More data are in the 'skyserver' web
site, where you can ask SQL queries and
get data in html or csv format
- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
- Road segments: several datasets with line segments
(roads of U.S. counties, Montgomery MD, Long Beach CA, x-y
coordinates of stars in the sky from NASA, etc).
Snippet of data (roads from California, from TIGER).
Graph data - need NDA
- YahooWeb crawl
(120Gb, 1B nodes, 6B edges). Needs mild NDA
- Web-log and click-stream data (NDA: needed).
- call-graphs
Snapshots of anonymized (and anonymous) who-calls-whom graphs
(NDA)
Graph Data - public
Miscellaneous:
SOFTWARE
Notes for the software: Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.
- Readily available:
- ACCESS METHODS
-
DR-tree : R-tree code; searches for range and nearest-neighbor
queries. In C.
-
kd-tree code
- OMNI
trees - a faster version of metric trees.
- B-tree code, for text (should be easily changed to handle
numbers, too). In C.
- SVD AND TENSORS:
- FRACTALS
- GRAPHS
- the PEGASUS
package for graph mining on hadoop.
- the
NetMine network topology analysis package
- GMine:
interactive graph visualization package and graph manipulation
library (by Junio (Jose Fernandez Rodrigues Junior) and Jure
Leskovec)
- the '
crossAssociation' package for graph partitioning.
- Outside CMU:
- GiST package from
Hellerstein at UC Berkeley: A general spatial access method, which
is easy to customize. It is already customized to yield
R-trees.
- hadoop, PIG and hbase
- pajek,
jung, graphviz, guess, cytoscape , for (small) graph
visualization
- METIS, for graph
partitioning
BIBLIOGRAPHICAL RESOURCES:
Last modified Jan. 24, 2017, by Christos Faloutsos.