15-826 Multimedia Databases and Data Mining

Fall 2012 - C. Faloutsos

- URL for this very page (internal to CMU - please treat it 'confidentially'): www.cs.cmu.edu/~christos/courses/826.F12/CMU-ONLY/projlist.html
- The default project is the 'UCR insect classification contest' - strongly recommended for the majority of the students.

- You may propose projects
**outside**this list, as long as they have to do with mining and indexing**large**datasets. In that case, contact the instructor as early as possible. - A
**[P]**in the project title signify that this project is related to the phd dissertation of the contact person. - Please form groups of 3-4 people.
- Please check the 'blackboard' system, where we will create one thread for each of the projects below. Please indicate your interest, by posting in the appropriate thread(s), so that you can find partners.

The rest of the projects are more open-ended, and they are more suitable for people who want to do research in data mining.

- Problem: See the project web site. The default is to do classification, but you are welcome to do visualization, feature extraction, clustering, etc
- Data: from the project web site. 500 sound clips, each with a class label.
- Introductory papers: Start from the award winning paper of [Rakthanmanon+, KDD'12]; check the sigmod'07 tutorial by Keogh; or the sigmod'04 tutorial by the instructor

- Comments: very well defined project - extremely suitable for the majority of people in the class.
- Contact person(s): instructor

- Problem: Do we need an
additional language, to do graph manipulations? Show that SQL is
enough, to answer all the questions we want. Given a graph of (source,
destination) pairs on a disk, write the SQL queries to answer numerous
questions of interest, like 'which are the most important nodes', 'find
the radius of each node' etc.

- Data: Any graph dataset - the emphasis on this project is implementation.

- Introductory papers: The PEGASUS paper with GIM-V; the follow-up paper of GBASE; the rest of the papers on the pegasus project web site

- Comments: May lead to a publication. Degree distribution and pagerank have been implemented; radius and eigenvalues are still missing.

- Contact person: instructor.

- Problem: How can we automatically find anomalies (e.g. spikes) in datasets? And more importantly, how can we do attribution? For instance, if there are too many nodes of degree 256, can we say something more about them? That is, if there is a spike in the count-vs-degree plot (assume a power-law-like distribution), what can we say about the nodes that are causing the spike in the plot? Do they belong to some specific structure, e.g. star or chain? This project aims at helping make sense of big graphs: we want to find automatically the properties that make some nodes in the graph anomalous, instead of just reporting that there is 'some type' of anomaly.
- Data: ``Stack overflow'' - The data is described briefly in the paper: OPAvion: Mining and
Visualization in Large Graphs. Leman Akoglu, Duen Horng Chau, U
Kang, Danai Koutra, and Christos Faloutsos. SIGMOD'12,
Arizona, USA, May 2012. Any other graph dataset would be suitable, too.

- Introductory material: For automatically detecting spikes in power-law graphs, start from the 'median filtering' method (check wikipedia to get familiar with the denoising algorithm).
- Comments: There are several anomaly detection algorithms in the literature. Here we focus on spotting some types of anomalies in plots, and mainly on explaining them. There is no work in doing anomaly attribution automatically.
- Contact Person: Danai Koutra; instructor.

- Problem: Belief Propagation is a powerful algorithm that has been used successfully in numerous fields, such as computer vision, fraud detection, malware detection, ldpc codes. In this project, we will focus on a fast approximation of belief propagation (as presented in the first paper given below) which currently handles only two different classes (e.g., guilty/non-guilty people). The goal of the project is to extend the algorithm to multiple classes and derive the more general matrix multiplication equation (instead of the initial iterative equations of the method), so that the belief propagation approximation is more widely applicable (e.g., in the paper-citation graph we have 4 classes (areas of research): AI, DB, IR, DM).
- Data: DBLP network (annotated, with 4 different classes)
- Introductory material:
- Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms. Danai Koutra, Tai-You Ke, U Kang, Duen Horng (Polo) Chau, Hsing-Kuo Kenneth Pao, and Christos Faloutsos. ECML PKDD, Athens, Greece, Sep. 2011
- Understanding belief propagation and its generalizations. J. Yedidia, W. Freeman, and Y. Weiss. Exploring articial intelligence in the new millennium, 8:236{239, 2003.
- Polonium: Tera-scale graph mining and inference for malware detection. D. Chau, C. Nachenberg, J. Wilhelm, A. Wright, and C. Faloutsos. Polonium: Tera-scale graph mining and inference for malware detection. SDM, 2011.
- Comments: There are multiple implementations of BP in the literature. Here we start from the formulas given in the second paper, and following similar (but trickier) analysis to the first paper, we will try to derive a generalized matrix formula for BP, which handles more than 2 classes.
- Contact Person: Danai Koutra; instructor.

- Problem: Given a large
graph with billions of edges and tens of billions of nodes, and
several share-nothing machines, parallelize the typical graph
mining algorithms, to be as fast as you can. Our 'pegasus' system already
computes the in- and out-degree distributions, the diameter of the
graph, the first several eigenvalues, and runs on top of hadoop.
'hadoop' allows relatively
easy parallel execution,
implementing the map-reduce
system of Google [Dean + Ghemawat, OSDI'04]. 'Hadoop' is
open source; we have a small cluster where we can give you an
account, or make some other arrangement.
- The first step is to do timing of several possible architectures: with, or without a relational DBMS; with, or without replication of the data; using the PIG system; using 'hbase'
- Also, what is the best way to store the data (e.g., as <from,to> pairs in a flat file; as an adjacency list, hashed on the 'from' node-id, or as something else.)

- Data: We shall start with synthetic data, using an existing generator [Leskovec+, PAKDD'05]. Then, DBLP, IMDB etc. We could also get data on real CMU IP traffic (will need NDA). Finally, we also have a who-talks-to-whom social network with 270 million nodes and 8 billion edges (60Gb of data)
- Introductory paper(s): The generator above; the Gamma database machine papers [Dewitt+, IEEE TKDE'90]; papers on hash-joins [Kitsuregawa+, vldb'90] the RMAT paper [Chakrabarti+ SIAM-DM'04], the connection sub-graph paper [Faloutsos+, KDD'04]. If you plan to use 'hadoop', get the map-reduce paper [Dean + Ghemawat, OSDI'04] and the documentation about the add-ons to hadoop, PIG and hbase.
- Comments: Very high practical interest, with hard problems from both the algorithmic as well as the system side. There is a lot of room, even for 4 or more people.
- Contact person: instructor.

- Problem: Non-negative Matrix Factorization, and Matrix Factorizations in general, have proved useful for many data mining tasks such as matrix completion, concept discovery, and latent semantic indexing. How much can present state of the art algorithms scale? Is there a best choice among the existing algorithms (e.g. "Multiplicative Updates" or (Stochastic) Gradient Descent) in terms of parallelizability and, ultimately, scalability? In this project, your task will be to 1) investigate existing algorithms with respect to their scalability potential, 2) implement your choice in MapReduce/Hadoop and 3) experiment with one or more real world datasets (and possibly a synthetic one) in order to a) describe the findings of the algorithm, and b) demonstrate that your implementation scales.
- Data: IMDB dataset, DBLP dataset, come up with a way to generate synthetic data
- Introductory papers:
- DD Lee and HS Seung. Learning the parts of objects by non-negative matrix factorization, Nature 1999
- DD Lee and HS Seung. Algorithms for Non-negative Matrix Factorization, Advances in Neural Information Processing Systems, 2001
- Rainer Gemulla, Peter J Haas, Erik Nijkamp, and Yannis Sismanis. Large Scale Matrix Factorization with Distributed Stochastic Gradient Descent, ACM KDD 2011
- Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, Yi-Min Wang. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce, ACM WWW 2010
- Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 1990
- Comments: The paper of Gemulla+ is especially interesting.

- Contact Person: Vagelis Papalexakis, instructor.

*Problem:*Given time series of patients (blood pressure over time, etc), and class labels ('healthy', 'unhealthy') extract features and do classification. Or, given a set of sequences of, say, BGP updates, find correlations and anomalies (BGP = Border Gateway Protocol, in computer networks). In yet-another scenario, consider monitoring a data-center (like the Self-* system or the Data Center Observatory , both at CMU/PDL. Another application is monitoring environmental data, to spot, say, global warming, deforestation, etc - see the web page of Prof. Vipin Kumar*Data*

- Very interesting dataset: from the tycho project - epidemiology time series, with # of infected people per unit time per US city per disease. Other data include

- From the physionet.org collection

- for BGP, check
the Datapository
project.

- Check here for nvironmental data
*Introductory paper(s)*For spikes in epidemiology data, check the 'spikeM' model [kdd'12]. For BGP, check [Prakash+, KDD'09] (or here, for a more detailed version. For data center monitoring, check the SPIRIT project, and the corresponding publication OSR06. Also the lag-correlation paper [Sakurai+ SIGMOD'05], and the DynaMMo method (Kalman filters for missing values [ Li+ KDD'09 ]).*Comments*Start with Fourier and wavelets, for features. For the 'tycho' data, try the 'spikeM' method. Check the 'DynaMMo' and 'PLiF' methods. For the physionet data, one challenge is how to handle the several, wrong recordings (eg., blood pressure ~ 0). Depending on the composition of the team, the project could focus on any of the above settings (environment only; datacenter only; etc).*Contact person*: instructor.

Unless explicitly mentioned, the datasets are either 'public' or 'owned' by the instructor; for the rest, we need to discuss about 'Non-disclosure agreements' (NDAs).

- Time series repository at UCR.
**KURSK dataset**of multipe time sequences: time series from seismological sensors by the explosion site of the 'Kursk' submarine.**Track traffic data**, from our Civil Engineering Department. Number of trucks, weight etc per day per highway-lane. Find patterns, outliers; do data cleansing.**River-level**/ hydrology data: multiple, correlated time series. Do data cleansing; find correlations between these series. Excellent project for people that like canoeing!**Sunspots**: number of sunspots per unit time. Some data are here. Sunspots seem to have an 11-year periodicity, with high spikes.**Time sequences**from the Sante-Fe Institute forecasting competition (financial data, laser-beam oscillation data, patients' apnea data etc)**Disk access traces**, from HP Labs (we have local copies at CMU). For each disk access, we have the timestamp, the block-id, and the type ('read'/'write'). Here is a snippet of the data, aggregated per 30'.- Network traffic data from datapository.net at CMU
- Motion-capture data from CMU mocap.cmu.edu

**Astrophysics data**- thousands of galaxies, with coordinates, red-shift, spectra, photographs. Small snippet of the data. More data are in the 'skyserver' web site, where you can ask SQL queries and get data in html or csv format- Synthetic astrophysics
data: 1K of (x,y,z, weight) tuples, from Prof. Rupert Croft
(CMU). The full dataset is 200Mb compressed - contact
instructor.
**Road segments**: several datasets with line segments (roads of U.S. counties, Montgomery MD, Long Beach CA, x-y coordinates of stars in the sky from NASA, etc). Snippet of data (roads from California, from TIGER).

**YahooWeb**crawl (120Gb, 1B nodes, 6B edges). Needs mild NDA**Web-log**and click-stream data (NDA: needed).**call-graphs**Snapshots of anonymized (and anonymous) who-calls-whom graphs (NDA)- Enron email dataset (400 MB compressed)
- Large collection of networks, from Stanford
- Movie-actor data from imdb.com (we have a cleaned-up snapshot of it)
- DBLP author-paper-conference data from the DBLP site of Mike Ley (records in XML, and their DTD). For 'ego-surfing', try this java app or the java applet at U. Alberta.
- Graph datasets at U.Mass (Amherst), by Prof. Dave Jensen.
- More graph datasets from Mark Newman (U. Michigan) - including popular test-beds like the Zachary's karate club social network etc.
- patent information, from googlebooks (mirroring the U.S. Patent Office). Contact instructor for a who-cites-whom file.

- Several collections of training data from the UC-Irvine repository (check the larger ones) and from KDD-nuggets for machine learning algorithms.
**Demographic**data from the U.S. Bureau of Census

**Notes for the software:** Before you modify any code,
please contact the instructor - ideally, we would like to use these
packages as black boxes.

- Readily available:
- ACCESS METHODS
- SVD AND TENSORS:
- Code for SVD in `mathematica'.
- Code for SPIRIT (incremental SVD on streams)
- Tensor toolkit from Tamara Kolda

- FRACTALS
- Code for computing the fractal dimension (simplified version in Perl; more elaborate, in Perl and C, by Leejay Wu)
- Barnsley's algorithm for Iterated Function Systems in `C'.

- GRAPHS
- the PEGASUS package for graph mining on hadoop.
- the NetMine network topology analysis package
- GMine: interactive graph visualization package and graph manipulation library (by Junio (Jose Fernandez Rodrigues Junior) and Jure Leskovec)
- the ' crossAssociation' package for graph partitioning.

- Outside CMU: