# CSD

Recent progresses on a number of combinatorial and numerical problemsvbenefited from combining ideas and techniques from both fields to design faster and more powerful algorithms. A prime example is the field of spectral graph theory, which involves the interplay between combinatorial graph algorithms with numerical linear algebra, and this led to the first nearly linear time solvers for symmetric and diagonally dominant (SDD) linear systems.

In this thesis proposal we focus on a number of combinatorial building blocks of spectral graph theory as well as their applications in solving SDD linear systems. We give new (and often parallel) algorithms for low diameter decompositions, low stretch tree embeddings, graph spanners, and combinatorial sparsifiers. We propose to improve our techniques developed in solving the above problems to data sensitive metrics, in particular the nearest neighbor metric or equivalently the edge squared metric, when vertices are samples from some probability distribution. We

also propose to investigate spectral methods for partitioning probability distributions in Euclidean domains from random samples.

**Thesis Committee:**

Gary Miller (Chair)

Bernhard Haeupler

Daniel Sleator

Noel J. Walkington

Ioannis Koutis (University of Puerto Rico)

Given the ever-growing prevalence of online social services, usefully leveraging massive datasets has become an increasingly important challenge for businesses and end-users alike. Online services capture a wealth of information about user behavior and platform interactions, such as who-follows-whom relationships in social networks and who-rates-what-and-when relationships in e-commerce networks. Since many of these services rely on data-driven algorithms to recommend relevant content to their users, authenticity of user behavior is paramount to success. But given anonymity on the internet, how do we know which users and actions are real and which are fake? My thesis focuses on this problem and introduces new techniques to discern anomalous and fraudulent behavior in online social graphs. Specifically, I propose to work on three thrusts: plain, dynamic and rich graphs.

Firstly, we focus on plain graphs, in which only static connectivity information is known. We detail several proposed algorithms spanning the topics of spectral-based fraud detection in a single graph, blame attribution between graph snapshots, and evaluation of the descriptive power of various graph clustering and partitioning methods in identifying anomalous graph structures.

Next, we broaden our scope to dynamic graphs, in which we are able to leverage connectivity information over time. We describe multiple relevant works which describe how to identify and summarize anomalous temporal graph structures, model interarrival time patterns in user queries to find anomalous search behavior, and identify “group” anomalies comprising of users acting in lockstep.

Lastly, we expand our reach to rich graphs, in which connectivity information is supplemented by other attributes, such as time, rating, counts, etc. To this end, we propose a principled means for ranking abnormal users by leveraging statistical patterns in edge attributes.

The techniques described in this thesis span various disciplines including machine learning, graph theory, information theory and social science and are practically applicable to a number of real-world fraud and anomaly detection scenarios. They are carefully designed and scale to massive datasets, including social networks, telecommunication networks, e-commerce and collaboration networks with millions of nodes and billions of edges.

**Thesis Committee: **

Christos Faloutsos (Chair)

Roni Rosenfeld

Jaime Carbonell

Jiawei Han (University of Illinois, Urbana-Champaign)

Let P be a k-ary predicate. Consider a random instance of the constraint satisfaction problem CSP(P) on n variables with $\Delta n$ constraints, each being P applied to k randomly chosen literals. When $\Delta >>1$, such an instance is unsatisfiable with high probability. The refutation problem is to efficiently find a proof of unsatisfiability. The sum of squares (SOS) hierarchy is a powerful algorithmic tool in the study of CSPs. It is a sequence of semidefinite programming relaxations parameterized by the degree d. The relaxation becomes tighter as d increases, but it takes ${n}^{O\left(d\right)}$ time to solve. In this talk, we will discuss a lower bound on the SOS degree required to refute a random instance of CSP(P): If P supports a t-wise-uniform probability distribution on its satisfying assignments, the SOS algorithm of degree $d=\Theta \phantom{\rule{0.333em}{0ex}}(n/{\Delta}^{2/(t-1)})$ cannot refute. By recent algorithmic results, this bound is essentially tight. This is joint work with Pravesh K. Kothari, Ryuhei Mori, and Ryan O.Donnell.

*This is joint work with Pravesh K. Kothari, Ryuhei Mori, and Ryan O’Donnell.*

—

David Witmer is a PhD student in the Computer Science Department at CMU. His advisors are Anupam Gupta and Ryan O'Donnell.

*Supported in part by Yahoo! Labs and the Simons Foundation*

We show how to perform sparse approximate Gaussian elimination for Laplacian matrices. We present a simple, nearly linear time algorithm that approximates a Laplacian by a matrix with a sparse Choleskyfactorization – the version of Gaussian elimination for positive semi-definite matrices. We compute this factorization by subsampling standard Gaussian elimination.This is the first nearly linear time solver for Laplaciansystems that is based purely on random sampling, and does not use any graph theoretic constructions such as low-stretch trees, sparsifiers, or expanders. The crux of our proof is the use of matrix martingales to analyze the algorithm.

*Joint work with Sushant Sachdeva.*

We consider the fundamental derandomization problem of deterministically finding a satisfying assignment to a CNF formula that has many satisfying assignments. We give a deterministic algorithm which, given an n-variable poly(n)-clause CNF formula F that has $\mid {F}^{-1}\left(1\right)\mid \ge \epsilon {2}^{n}$, runs in time ${n}^{\stackrel{~}{O}(\mathrm{log}\mathrm{log}n{)}^{2}}$ for $\epsilon \ge 1/\backslash polylog\left(n\right)$ and outputs a satisfying assignment of F. Prior to our work the fastest known algorithm for this problem was simply to enumerate over all seeds of a pseudorandom generator for CNFs; using the best known PRGs for CNFs [DETT10], this takes time ${n}^{\stackrel{~}{\Omega}\left(\mathrm{log}n\right)}$ even for constant $\epsilon $.

*Joint work with Rocco Servedio. *

Ordered sets (and maps when data is associated with each key) are one of the most important and useful data types. The set-set functions union, intersection and difference are particularly useful in certain applications. Brown and Tarjan first described an algorithm for these functions, based on 2-3 trees, that meet the optimal O(mlog(n/m + 1)) time bounds in the comparison model (n and m n are the input sizes). Later Adams showed very elegant algorithms for the functions, and others, based on weight-balanced trees. They only require a single function that is specific to the balancing scheme--a function that joins two balanced trees--and hence can be applied to other balancing schemes. Furthermore the algorithms are naturally parallel. However, in the twenty-four years since, no one has shown that the algorithms, sequential or parallel are asymptotically work optimal.

In this paper we show that Adams' algorithms are both work efficient and highly parallel (polylog span) across four different balancing schemes--AVL trees, red-black trees, weight balanced trees and treaps. To do this we use careful, but simple, algorithms for JOIN that maintain certain invariants, and our proof is (mostly) generic across the schemes. To understand how the algorithms perform in practice we have also implemented them (all code except JOIN is generic across the balancing schemes). Interestingly the implementations on all four balancing schemes and three set functions perform similarly in time and speedup (more than 45x on 64 cores). We also compare the performance of our implementation to other existing libraries and algorithms.

—

Yihan Sun is a third-year PhD student in the computer science department advised by Guy Blelloch

In this talk, we'll discuss how to use observations on some vertices of a graph to draw inferences on the remaining vertices.

Given real-valued observations on some vertices, we seek a smooth extension of these observations to all vertices. We propose the absolutely minimal Lipschitz extension, which is the limit of p-Laplacian regularization for large p.

We'll present algorithmic results for computing these extensions efficiently. Our algorithms naturally apply to directed graphs, and run on graphs with millions of edges in a few minutes. These extensions are particularly suited for regularization and outlier removal, which is surprising since outlier removal is NP-hard for other natural extensions. We'll present some experimental results for detecting spam webpages using our algorithms.

Finally, we'll discuss how these extensions are particularly suited for regularization and outlier removal, which is surprising since outlier removal is NP-hard for other natural extensions.

*This is joint work with < Rasmus Kyng, Anup Rao, and Daniel Spielman .*

—

Sushant Sachdeva is a research scientist at Google. He is interested in Algorithms, and its connections to optimization, machine learning, and statistics. His recent research focus has been the design of fast algorithms for graph problems.

He completed his postdoc at Yale with Dan Spielman (2016), his PhD from Princeton (2013) under Sanjeev Arora, and his BTech from IIT Bombay (2008). He is the recipient of Simons Berkeley Research Fellowship (2013), and the IITB President of India Gold Medal (2008).

In Massively Open Online Courses (MOOCs) TA resources are limited. In order to deal with this, MOOCs use peer assessments to grade assignments and then use TA resource to check over some of these assignments. We present an extension to the Stackelberg game model that models the challenge involved in allocating the limited TA resource. We discuss what makes this new model difficult difficult, and then we present an algorithm that computes an approximate solution.

*Presented in Partial Fulfillment of the CSD Speaking Skills Requirement. *

JVuSched is a new end-to-end cluster scheduling system that robustly exploits job runtime predictions. Using runtime knowledge allows it to more effectively pack jobs with diverse time concerns (e.g., deadline vs. latency) and placement preferences on

heterogeneous cluster resources. A natural set of questions, then, ask how accurate runtime knowledge is, and what happens when it is imperfect. JVuSched introduces new techniques to mitigate the effects of real mis-prediction profiles. Experiments with workloads derived from a Google cluster trace show that JVuSched matches the end-to-end performance of a hypothetical oracular predictor with perfect job runtime information, while outperforming state-ofthe-art schedulers afforded the same prediction accuracy. JVuSched reduces SLO miss rate, increases cluster goodput, and

improves or matches latency for best effort jobs.

*Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.*

We study the amortized number of combinatorial changes (edge insertions and removals) needed to update the graph structure of the Voronoi diagram *\VD(S)*(and several variants thereof) of a set *S* of *n* sites in the plane as sites are added to the set. To that effect, we define a general update operation for planar graphs that can be used to model the incremental construction of several variants of Voronoi diagrams as well as the incremental construction of an intersection of halfspaces in 3 Dimensional space. We show that the amortized number of edge insertions and removals needed to add a new site to the Voronoi diagram is *O(\sqrt{n}).* A matching *Omega(\sqrt{n}) *combinatorial lower bound is shown, even in the case where the graph representing the Voronoi diagram is a tree. This contrasts with the *O(\log{n})* upper bound of Aronov et al. (2006) for farthest-point Voronoi diagrams in the special case where the points are inserted in clockwise order along their convex hull.

We then present a semi-dynamic data structure that maintains the Voronoi diagram of a set *S *of* n *sites in convex position. This data structure supports the insertion of a new site *p* (and hence the addition of its Voronoi cell) and finds the asymptotically minimal number *K *of edge insertions and removals needed to obtain the diagram of *S \cup \{p\}* from the diagram of *S*, in time *O(K\,\mathrm{polylog}\ n)* worst case, which is *O(\sqrt{n}\;\mathrm{polylog}\ n)* amortized by the aforementioned combinatorial result.

The most distinctive feature of this data structure is that the graph of the Voronoi diagram is maintained explicitly at all times and can be retrieved and traversed in the natural way; this contrasts with other structures supporting nearest neighbor queries in that they do not maintain the Voronoi diagram after each insertion. This structure supports general search operations on the current Voronoi diagram, which can, for example, be used to perform point location queries in the cells of the current Voronoi diagram in *O(\log n)* time, or to determine whether two given sites are neighbors in the Delaunay triangulation.

*Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.*