# CSD

Ordered sets (and maps when data is associated with each key) are one of the most important and useful data types. The set-set functions union, intersection and difference are particularly useful in certain applications. Brown and Tarjan first described an algorithm for these functions, based on 2-3 trees, that meet the optimal *O(mlog(n/m + 1))* time bounds in the comparison model (n and m n are the input sizes). Later Adams showed very elegant algorithms for the functions, and others, based on weight-balanced trees. They only require a single function that is specific to the balancing scheme--a function that joins two balanced trees--and hence can be applied to other balancing schemes. Furthermore the algorithms are naturally parallel. However, in the twenty-four years since, no one has shown that the algorithms, sequential or parallel are asymptotically work optimal.

In this paper we show that Adams' algorithms are both work efficient and highly parallel (polylog span) across four different balancing schemes--AVL trees, red-black trees, weight balanced trees and* O(mlog(n/m + 1))* treaps. To do this we use careful, but simple, algorithms for JOIN that maintain certain invariants, and our proof is (mostly) generic across the schemes. To understand how the algorithms perform in practice we have also implemented them (all code except JOIN is generic across the balancing schemes). Interestingly the implementations on all four balancing schemes and three set functions perform similarly in time and speedup (more than 45x on 64 cores). We also compare the performance of our implementation to other existing libraries and algorithms.

*Joint work with Guy Blelloch and Daniel Ferizovic *

*Presented in Partial Fulfillment of the CSD Speaking Skills Requirement. *

This thesis research presents content delivery optimization techniques in the eXpressive Internet Architecture (XIA). First, we propose an intradomain CID routing protocol that achieves performance objectives of reducing request latency for ISP’s customers by routing content requests off-path br />to nearby caches. This is beneficial for ISP since it helps them attract more customer through offering content delivery with lower latency. The challenges for the CID routing design are that 1) scalability of the protocol to a large number of CIDs in the domain and 2) resistance to cache churn.

To address the first challenge, we propose 1) scoping the advertisements and 2) advertise the delta of locally cached CIDs. We evaluate the scalability of the protocol by comparing it against traditional link-state and distance vector routing protocol. The metrics we used for comparison are: 1) size and the number of advertisements messages, 2) efficiency of route computations, and 3) space consumptions. For churn resistance, we show that advertising content that would stay in cache longer during the event of high cache churn can reduce the number of failed requests in a domain. We compare this benefit against the performance objective of CID routing and show that advertising popular content can be a good strategy to reduce the number of invalid CID routes during high cache churn.

The second part of the thesis discusses an interdomain cache management scheme. We discuss a cache sharing model that allow regional peering ISPs to collaboratively share their cache space to reduce cost of interdomain data exchange, and the tradeoffs between performance, availability, and bandwidth consumption. Our results show that replicating the top few popular content locally in the domain achieves a good balance between these criteria for coordinating ISPs. Finally, we present a cache sharing algorithm that distributes the request load evenly among coordinating domains without incurring explicit message exchange. We evaluate this approach against the oracle that would require coordination overhead but approximate the optimal solution very closely. The result shows that the caching decision output by our approach differs little from the oracle in terms of the cache hit rate for each domain.

**Thesis Committee:** Srinivasan Seshan, Peter Steenkiste

We believe that it is essential for robots that coexist with humans to be able to interact with their users in a seamless way. This thesis advocates the use of language as a rich and natural interface for human-robot interaction. Previous work on language-based human-robot interaction has extensively focused on enabling robots to understand spatial language in the context of users giving step-by-step directions to robots. We assume a mobile service robot, like our CoBot, is equipped with a map of its environment and is able to autonomously navigate to a desired goal position. This thesis will address the problem of user-to-robot interaction, but is going to assume users provide a high-level specification of the task (e.g., "Take the package to the small-size lab and then bring me tea") rather than step-by-step navigational instructions.

Furthermore the thesis will focus on a novel robot-to-user interaction where the robot is able to adapt to different users, to answer user queries about its state (present, past or future), and to proactively take information-providing actions (i.e., reporting on the outcome of a task after finishing its execution). Summing up, this thesis will contribute a novel language-based bidirectional interaction approach for mobile service robot: from users to the robot and vice-versa. We will evaluate the work in real and extensive real-data constructed simulated environments.

**Thesis Committee: **

Manuela Veloso (Chair)

Jaime Carbonell

Stephanie Rosenthal

Xiao-Ping Chen (University of Science and Technology of China)

Recent progresses on a number of combinatorial and numerical problemsvbenefited from combining ideas and techniques from both fields to design faster and more powerful algorithms. A prime example is the field of spectral graph theory, which involves the interplay between combinatorial graph algorithms with numerical linear algebra, and this led to the first nearly linear time solvers for symmetric and diagonally dominant (SDD) linear systems.

In this thesis proposal we focus on a number of combinatorial building blocks of spectral graph theory as well as their applications in solving SDD linear systems. We give new (and often parallel) algorithms for low diameter decompositions, low stretch tree embeddings, graph spanners, and combinatorial sparsifiers. We propose to improve our techniques developed in solving the above problems to data sensitive metrics, in particular the nearest neighbor metric or equivalently the edge squared metric, when vertices are samples from some probability distribution. We

also propose to investigate spectral methods for partitioning probability distributions in Euclidean domains from random samples.

**Thesis Committee:**

Gary Miller (Chair)

Bernhard Haeupler

Daniel Sleator

Noel J. Walkington

Ioannis Koutis (University of Puerto Rico)

Given the ever-growing prevalence of online social services, usefully leveraging massive datasets has become an increasingly important challenge for businesses and end-users alike. Online services capture a wealth of information about user behavior and platform interactions, such as who-follows-whom relationships in social networks and who-rates-what-and-when relationships in e-commerce networks. Since many of these services rely on data-driven algorithms to recommend relevant content to their users, authenticity of user behavior is paramount to success. But given anonymity on the internet, how do we know which users and actions are real and which are fake? My thesis focuses on this problem and introduces new techniques to discern anomalous and fraudulent behavior in online social graphs. Specifically, I propose to work on three thrusts: plain, dynamic and rich graphs.

Firstly, we focus on plain graphs, in which only static connectivity information is known. We detail several proposed algorithms spanning the topics of spectral-based fraud detection in a single graph, blame attribution between graph snapshots, and evaluation of the descriptive power of various graph clustering and partitioning methods in identifying anomalous graph structures.

Next, we broaden our scope to dynamic graphs, in which we are able to leverage connectivity information over time. We describe multiple relevant works which describe how to identify and summarize anomalous temporal graph structures, model interarrival time patterns in user queries to find anomalous search behavior, and identify “group” anomalies comprising of users acting in lockstep.

Lastly, we expand our reach to rich graphs, in which connectivity information is supplemented by other attributes, such as time, rating, counts, etc. To this end, we propose a principled means for ranking abnormal users by leveraging statistical patterns in edge attributes.

The techniques described in this thesis span various disciplines including machine learning, graph theory, information theory and social science and are practically applicable to a number of real-world fraud and anomaly detection scenarios. They are carefully designed and scale to massive datasets, including social networks, telecommunication networks, e-commerce and collaboration networks with millions of nodes and billions of edges.

**Thesis Committee: **

Christos Faloutsos (Chair)

Roni Rosenfeld

Jaime Carbonell

Jiawei Han (University of Illinois, Urbana-Champaign)

Let P be a k-ary predicate. Consider a random instance of the constraint satisfaction problem CSP(P) on n variables with $\Delta n$ constraints, each being P applied to k randomly chosen literals. When $\Delta >>1$, such an instance is unsatisfiable with high probability. The refutation problem is to efficiently find a proof of unsatisfiability. The sum of squares (SOS) hierarchy is a powerful algorithmic tool in the study of CSPs. It is a sequence of semidefinite programming relaxations parameterized by the degree d. The relaxation becomes tighter as d increases, but it takes ${n}^{O\left(d\right)}$ time to solve. In this talk, we will discuss a lower bound on the SOS degree required to refute a random instance of CSP(P): If P supports a t-wise-uniform probability distribution on its satisfying assignments, the SOS algorithm of degree $d=\Theta \phantom{\rule{0.333em}{0ex}}(n/{\Delta}^{2/(t-1)})$ cannot refute. By recent algorithmic results, this bound is essentially tight. This is joint work with Pravesh K. Kothari, Ryuhei Mori, and Ryan O.Donnell.

*This is joint work with Pravesh K. Kothari, Ryuhei Mori, and Ryan O’Donnell.*

—

David Witmer is a PhD student in the Computer Science Department at CMU. His advisors are Anupam Gupta and Ryan O'Donnell.

*Supported in part by Yahoo! Labs and the Simons Foundation*

We show how to perform sparse approximate Gaussian elimination for Laplacian matrices. We present a simple, nearly linear time algorithm that approximates a Laplacian by a matrix with a sparse Choleskyfactorization – the version of Gaussian elimination for positive semi-definite matrices. We compute this factorization by subsampling standard Gaussian elimination.This is the first nearly linear time solver for Laplaciansystems that is based purely on random sampling, and does not use any graph theoretic constructions such as low-stretch trees, sparsifiers, or expanders. The crux of our proof is the use of matrix martingales to analyze the algorithm.

*Joint work with Sushant Sachdeva.*

We consider the fundamental derandomization problem of deterministically finding a satisfying assignment to a CNF formula that has many satisfying assignments. We give a deterministic algorithm which, given an n-variable poly(n)-clause CNF formula F that has $\mid {F}^{-1}\left(1\right)\mid \ge \epsilon {2}^{n}$, runs in time ${n}^{\stackrel{~}{O}(\mathrm{log}\mathrm{log}n{)}^{2}}$ for $\epsilon \ge 1/\backslash polylog\left(n\right)$ and outputs a satisfying assignment of F. Prior to our work the fastest known algorithm for this problem was simply to enumerate over all seeds of a pseudorandom generator for CNFs; using the best known PRGs for CNFs [DETT10], this takes time ${n}^{\stackrel{~}{\Omega}\left(\mathrm{log}n\right)}$ even for constant $\epsilon $.

*Joint work with Rocco Servedio. *

Ordered sets (and maps when data is associated with each key) are one of the most important and useful data types. The set-set functions union, intersection and difference are particularly useful in certain applications. Brown and Tarjan first described an algorithm for these functions, based on 2-3 trees, that meet the optimal O(mlog(n/m + 1)) time bounds in the comparison model (n and m n are the input sizes). Later Adams showed very elegant algorithms for the functions, and others, based on weight-balanced trees. They only require a single function that is specific to the balancing scheme--a function that joins two balanced trees--and hence can be applied to other balancing schemes. Furthermore the algorithms are naturally parallel. However, in the twenty-four years since, no one has shown that the algorithms, sequential or parallel are asymptotically work optimal.

In this paper we show that Adams' algorithms are both work efficient and highly parallel (polylog span) across four different balancing schemes--AVL trees, red-black trees, weight balanced trees and treaps. To do this we use careful, but simple, algorithms for JOIN that maintain certain invariants, and our proof is (mostly) generic across the schemes. To understand how the algorithms perform in practice we have also implemented them (all code except JOIN is generic across the balancing schemes). Interestingly the implementations on all four balancing schemes and three set functions perform similarly in time and speedup (more than 45x on 64 cores). We also compare the performance of our implementation to other existing libraries and algorithms.

—

Yihan Sun is a third-year PhD student in the computer science department advised by Guy Blelloch

In this talk, we'll discuss how to use observations on some vertices of a graph to draw inferences on the remaining vertices.

Given real-valued observations on some vertices, we seek a smooth extension of these observations to all vertices. We propose the absolutely minimal Lipschitz extension, which is the limit of p-Laplacian regularization for large p.

We'll present algorithmic results for computing these extensions efficiently. Our algorithms naturally apply to directed graphs, and run on graphs with millions of edges in a few minutes. These extensions are particularly suited for regularization and outlier removal, which is surprising since outlier removal is NP-hard for other natural extensions. We'll present some experimental results for detecting spam webpages using our algorithms.

Finally, we'll discuss how these extensions are particularly suited for regularization and outlier removal, which is surprising since outlier removal is NP-hard for other natural extensions.

*This is joint work with < Rasmus Kyng, Anup Rao, and Daniel Spielman .*

—

Sushant Sachdeva is a research scientist at Google. He is interested in Algorithms, and its connections to optimization, machine learning, and statistics. His recent research focus has been the design of fast algorithms for graph problems.

He completed his postdoc at Yale with Dan Spielman (2016), his PhD from Princeton (2013) under Sanjeev Arora, and his BTech from IIT Bombay (2008). He is the recipient of Simons Berkeley Research Fellowship (2013), and the IITB President of India Gold Medal (2008).