project list

Carnegie Mellon University
15-826 Multimedia Databases and Data Mining
Spring 2017 - C. Faloutsos

List of suggested non-default projects, for PhD students

PRELIMINARIES

The projects are grouped according to their general theme. We also list the data and software available, to leverage your effort. More links and resources may be added in the future. Reminders:

URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.S17/CMU-ONLY/projlist.html
Please form groups of 2 people.
Please check the 'blackboard' system, where we will create one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

1. GRAPH / TENSOR MINING

1.1. Spam Detection for Review Data

Problem: Review data provides valuable information about products and services. Review data is ubiquities on websites as Amazon, Yelp or Tripadvisor, and is being frequently used by customers to choose among competing products or services. Since reviews highly affect the buying behaviour of customers, spammers try to mislead the users by writing fake reviews. The goal of this project is to develop methods to detect users showing spamming behaviour. We want to start with a feature based detection of spammers: What are the characteristics of a spammer? Which features can be used to discriminate between spammers and non-spammers? Are these features useful for all users or only for a subset of users? Based on this feature representation, automatic methods to classify/rank the users regarding their spamming behaviour should be developed exploiting, e.g., the principles of subspace clustering/co-clustering or low rank matrix factorization.
Data: The participants can test their methods on multiple review datasets such as Amazon (6M reviews) and Yelp (300K reviews).
Introductory material:
- Paper on review spam: Arjun Mukherjee, Abhinav Kumar, Bing Liu, Junhui Wang, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. Spotting Opinion Spammers using Behavioral Footprints. SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2013), August 11-14 2013 in Chicago, USA.
- Overview of subspace clustering techniques: Hans-Peter Kriegel, Peer Kroeger, Arthur Zimek: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1) (2009)
Contact Person: Instructor; Mr. Neil Shah

1.2. Weighted graphs over time

Problem: How can we model weighted graphs -for example with network packages flowing between nodes- for future prediction? Is there any pattern with respect to the weights? What kind of patterns would be expect? How are the weights distributed on the incident edges of a given node? For a given edge, do weight arrivals show any interesting behavior besides being bursty? How can we do a "microscopic" analysis so that to model a given weighted graph over time?
Data: call-graph (needs NDA), Campaign donations
Introductory papers:
- Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Weighted graphs and disconnected components: Patterns and a model. In ACM SIG-KDD, Las Vegas, Nev., USA, August 2008.
- Leman Akoglu, Mary McGlohon, and Christos Faloutsos. RTM: Laws and a recursive generator for weighted time-evolving graphs. In ICDM: International Conference on Data Mining, Pisa, Italy, December 2008.
- Microscopic Evolution of Social Networks Jure Leskovec, Lars Backstrom, Ravi Kumar, Andrew Tomkins. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM KDD), 2008.
- For a good survey on link prediction, see Liben-Nowell and Kleinberg 2003
- Phonecall patterns - duration: In [ PKDD2010] the duration seemed to be log-logistic. What can we say about correlations: if I make a 30' phonecall now, what can you say about the duration of my next one?
Comments: There are several generators in the literature, but the point is is to model the weights.
Contact Person: instructor; Ms. Dhivya Eswaran and Mr. Hemank Lamba.

1.3. Tensor decomposition using RDBMS

Problem: Can SQL be used to manipulate temporal evolving graphs? We are particularly interested in applying SQL to the tensor decomposition problem: given a 3-way tensor (for instance, indicating if person i contacted person j on day k) we want to find heavy blocks in the tensor. Using the previous example, we are looking for a set of people that called a set of other people on a set of days (the output would be a set of these 3 vectors). There are many algorithms that can be applied to solve this problem, but can any of them be implemented in SQL (and thus be easily parallelizable)? It turns out for the 3-modes, we can do that (see contact person Mr. Shin, below).

Can we extend it to N-modes (N>3)? There are subtle optimizations that we need to consider.
Can we extend it to other types of decompositions, like non-negative tensor decomposition, or boolean one?

Data: Any temporal graph will do, we have phone networks, computer communications network and email network data available. We also have 4-mode data (intrusion detection: (source-IP-address, destination-IP, destination-port, timestamp) ).
Introductory material:
- The Pegasus paper with GIM-V is a good starting point to understand how common matrix operations can be applied in SQL.
- Navasca's presentation is a simple introduction to CP decomposition and the ALS method.
- Tamara Kolda and Brett Bader's survey is a more detailed alternative to understand all the notation and the most common algorithms.
- Boolean tensor decomposition
- non-negative tensor decomposition
Comments: This project combines both implementation as well as mathematical problems and can definitely lead to a publication.
Contact Persons: Instructor; Mr. Kijung Shin.

1.4. Confidence-based ranking for graph classification

Problem: Recent work has shown that incorporating the notion of confidence in node classification improves classification accuracy. However, many real world applications (e.g., fraud detection) are precision-critical and the ranking of nodes becomes important. The problem is this: Given scores of each node ([p_1,...,p_k], c), where p_i is the probability for class i and c is the confidence of this classification, how would you rank the nodes to achieve maximum acccuracy at top k (for any k)?
Data: political blogs data (polblogs), DBLP coauthorship network (coauthor), a friendship network (pokec)
Introductory papers:
- Y. Yamaguchi, C. Faloutsos, and H. Kitagawa. Socnl: Bayesian label propagation with confidence. PAKDD 2015. Ho Chi Minh City, Vietnam.
- Y. Yamaguchi, C. Faloutsos, and H. Kitagawa. Camlp: Confidence-aware modulated label propagation SDM 2016. Miami USA.
- D. Eswaran, S. Guennemann, C. Faloutsos. The Power of Certainty: A Dirichlet-Multinomial Model for Belief Propagation. SDM 2017. Houston, USA.
Comments: This project is fairly open-ended and a thorough literature survey for ranking mechanisms is recommended. The general problem for k classes is a bit hard; a good starting point would be to try out this ranking scheme for binary classification.
Contact person(s): Instructor; Ms. Dhivya Eswaran

2. MODELING

2.1 'Brain in a box'

Problem: Can you design a neural network, to mimick the level of energy activities of a real brain, when it is performing some tasks? Start with a survey on ``recurrent neural networks'', and the pointers on ``system identification'' in the introductory paper below. Design a GUI, so that we can add/delete/modify neurons, and see the reactions of the resulting ``brain''.
Data: From Vagelis.
Introductory Material: the paper below, and its citations
- Evangelos E. Papalexakis, Alona Fyshe, Nicholas Sidiropoulos,Partha Pratim Talukdar, Tom Mitchell,Christos Faloutsos, Good-Enough Brain Model: Challenges, Algorithms and Discoveries in Multi-Subject Experiments, ACM SIGKDD 2014, New York City, USA
Comments: Hard problem in general, but the GUI should be do-able within a semester.
Contact Persons: Instructor; Ms. Hyun Ah Song.

3. TIME SERIES

3.1 Guess the next flu spike: Co-evolving time series mining

Problem: Given time series of patients (blood pressure over time, etc), and class labels ('healthy', 'unhealthy') extract features and do classification. Or, given a set of sequences of, say, BGP updates, find correlations and anomalies (BGP = Border Gateway Protocol, in computer networks). In yet-another scenario, consider monitoring a data-center (like the Self-* system or the Data Center Observatory , both at CMU/PDL. Another application is monitoring environmental data, to spot, say, global warming, deforestation, etc - see the web page of Prof. Vipin Kumar
Data
- Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include
- From the physionet.org collection
- Check here for environmental data
Introductory paper(s) For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+, KDD'09] (or here, for a more detailed version. For data center monitoring, check the SPIRIT project; and the corresponding publication OSR06. Also the lag-correlation paper [Sakurai+ SIGMOD'05], and the DynaMMo method (Kalman filters for missing values [ Li+ KDD'09 ]).
Comments Start with Fourier and wavelets, for features. For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one challenge is how to handle the several, wrong recordings (eg., blood pressure ~ 0). Depending on the composition of the team, the project could focus on any of the above settings (environment only; datacenter only; etc). There is a lot of code on the web site of Prof. Yasuko Matsubara
Contact person: instructor; Mr. Bryan Hooi bhooi@andrew

4. RICH / HETEROGENEOUS GRAPHS

4.1. Structural Correlation of Attributes

Problem: If all my 'friends' on faceBook, are 25 year old males, what can you say about my gender? my age? How to measure the strength of correlation (if any), between the gender (age, salary, etc) of two 'friends'? How to use that, to spot anomalies? Given a graph in which nodes have categorical attributes, how can we quantify the structural correlation of attributes? Structure independent attribute correlations, such as homophily ('birds of the same feather flock together') and heterophily ('opposites attract'), are well-studied and have been successfully exploited for various tasks, e.g., node classification. Their counterparts in the context of inter-related data, usually represented as graphs, are largely been neglected in the literature. This project is aimed to bridge this gap by answering the following questions: (1) How to quantify inter-attribute correlation for 2 or more attributes? (2) How can we utilize this, e.g. for anomaly detection, link prediction, or ranking attributes which are most correlated with the structure?
Data: Facebook, Google+, or any graph data with 2 or more categorical attributes.
Introductory papers:

R. Rabbany, D. Eswaran, A. Dubrawski, C. Faloutsos. Beyond Assortativity: Proclivity Index for Attributed Networks (ProNe). PAKDD 2017. Jeju, South Korea.
Pelechrinis, Konstantinos, and Dong Wei. VA-Index: Quantifying Assortativity Patterns in Networks with Multidimensional Nodal Attributes. PloS one 11.1 (2016): e0146188.
Newman, Mark EJ. Mixing patterns in networks. Physical Review E 67.2 (2003): 026126.

Comments: This project involves a good amount of math, data analysis and visualization; has high potential of leading to a publication.
Contact person(s): Instructor; Ms. Dhivya Eswaran; Ms. Reihaneh Rabbany

4.2. Attribute and/or Link Prediction in Heterogeneous Networks

Problem: Given an attributed network, the intuition is that one can predict characteristics of an entity, based on the known characteristics of its neighbors. In more detail, we can use the observed structural correlation between the attributes characterizing the nodes of a graph, to guess the unknown values, e.g., if knowing the females often befriend males, we can guess that a friend of a female is likely male. In the case of a general network, can we infer attributes of nodes, which are not yet observed from the data, given how nodes interact with other in the network and also their observed/known characteristics. Alternatively, can we predict the links which are not yet known parts of the current network.
Data: Facebook, Google+, or any other attributed graph data
Introductory papers:
- R. Rabbany, D. Eswaran, A. Dubrawski, C. Faloutsos. Beyond Assortativity: Proclivity Index for Attributed Networks (ProNe). PAKDD 2017. Jeju, South Korea.
- Kipf, Thomas N., and Max Welling. Variational Graph Auto-Encoders. arXiv preprint arXiv:1611.07308 (2016)
- Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Eui Chul Richard Shin, Emil Stefanov, Elaine Runting Shi, and Dawn Song. Joint link prediction and attribute inference using a social-attribute network. ACM Transactions on Intelligent Systems and Technology (TIST), 5(2):27, 2014.
- P. Wang, B. Xu, Y. Wu, and X. Zhou. Link prediction in social networks: the state-of-the-art. Science China Information Sciences, 58(1):1-38, 2015. (general survey)
Comments: This is closely related to previous project, but here the focus is on the prediction task, not measurement and modeling. Any prediction which uses both attributes of nodes and their connections in a sensible way would suffices. High potential of resulting in a paper.
Contact person(s): Instructor; Ms. Dhivya Eswaran; Ms. Reihaneh Rabbany

4.3. Clustering in Attributed Networks

Problem: Given an attributed network, how can we find groups of nodes that are both internally well connected and having homogeneous attributes. What is a good objective function for this problem? how can we develop a scalable algorithm to detect such groupings (possibly non-disjoint)? and how could such algorithm be evaluated? could the evaluation and/or definition of the problem be dependent on an inferences task that builds on top of this model? There is a large body of work on clustering in networks, a.k.a community detection. However, few works focus on attributed networks, and overlapping communities, and there is much room for research.
Data: Facebook, Google+, or any other attributed graph data
Introductory papers:
- Bryan Perozzi, Leman Akoglu, Patricia Iglesias Sanchez, Emmanuel Muller. Focused clustering and outlier detection in large attributed graphs. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014.
- Flavia Moser, Recep Colak, Arash Rafiey, and Martin Ester. Mining cohesive patterns from graphs with feature vectors. In SDM, volume 9, pages 593-604, 2009.
- Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. Graph clustering based on structural/attribute similarities. Proceedings of the Very Large Data Bases Endowment, 2(1):718-729, 2009.
- Rabbany, Reihaneh, and Osmar R. Zaiane. Evaluation of community mining algorithms in the presence of attributes. Trends and Applications in Knowledge Discovery and Data Mining. Springer International Publishing, 2015. 152-163.
- Bothorel, C., Cruz, J. D., Magnani, M., & Micenkova, B. (2015). Clustering attributed graphs: models, measures and methods. Network Science, 3(03), 408-444.
Comments: large body of related works, and diverse techniques, potential of resulting in a paper.
Contact person(s): Instructor; Ms. Reihaneh Rabbany

DATASETS

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

Time sequences

Time series repository at UCR.
KURSK dataset of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.
Track traffic data, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.
River-level / hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!
Sunspots: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.
Time sequences from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)
Disk access traces, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.
Network traffic data from datapository.net at CMU
Motion-capture data from CMU mocap.cmu.edu

Spatial data

Astrophysics data - thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format
Synthetic astrophysics data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft (CMU). The full dataset is 200Mb compressed - contact instructor.
Road segments: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

Graph data - need NDA

YahooWeb crawl (120Gb, 1B nodes, 6B edges). Needs mild NDA
Web-log and click-stream data (NDA: needed).
call-graphs Snapshots of anonymized (and anonymous) who-calls-whom graphs (NDA)

Graph Data - public

Enron email dataset (400 MB compressed)
Large collection of networks, from Stanford
Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.
More graph datasets from Mark Newman (U. Michigan) - including popular test-beds like the Zachary's karate club social network etc.
patent information, from googlebooks (mirroring the U.S. Patent Office). Contact instructor for a who-cites-whom file.

Miscellaneous:

Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
Demographic data from the U.S. Bureau of Census

SOFTWARE

Notes for the software: Before you modify any code, please contact the instructor - ideally, we would like to use these packages as black boxes.

Readily available:
- ACCESS METHODS
  - DR-tree : R-tree code; searches for range and nearest-neighbor queries. In C.
  - kd-tree code
  - OMNI trees - a faster version of metric trees.
  - B-tree code, for text (should be easily changed to handle numbers, too). In C.
- SVD AND TENSORS:
  - Code for SVD in `mathematica'.
  - Code for SPIRIT (incremental SVD on streams)
  - Tensor toolkit from Tamara Kolda
- FRACTALS
  - Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
  - Barnsley's algorithm for Iterated Function Systems in `C'.
- GRAPHS
  - the PEGASUS package for graph mining on hadoop.
  - the NetMine network topology analysis package
  - GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
  - the ' crossAssociation' package for graph partitioning.
Outside CMU:
- GiST package from Hellerstein at UC Berkeley: A general spatial access method, which is easy to customize. It is already customized to yield R-trees.
- hadoop, PIG and hbase
- pajek, jung, graphviz, guess, cytoscape , for (small) graph visualization
- METIS, for graph partitioning

BIBLIOGRAPHICAL RESOURCES:

Last modified Jan. 24, 2017, by Christos Faloutsos.

Carnegie Mellon University 15-826 Multimedia Databases and Data Mining Spring 2017 - C. Faloutsos