For the first time in 25 years, a new non-volatile memory (NVM) category is being created that is two orders of magnitude faster than current durable storage media. The advent of NVM will fundamentally change the dichotomy between volatile memory and durable storage in database systems. These new NVM devices are almost as fast as DRAM, but all writes to it are potentially persistent even after power loss. Existingdatabase systems are unable to take full advantage of this technology because their internal architectures are predicated on the assumption that memory is volatile. With NVM, many of the components of legacy database systems are unnecessary and will degrade the performance of the data intensive applications.

This dissertation explores the implications of NVM for database systems. It presents the design and implementation of Peloton, a new database system tailored specifically for NVM. The dissertation focuses on three aspects of a database system: (1) logging and recovery, (2) storage management, and (3) indexing. Our primary contribution in this dissertation is the design of a new logging and recovery protocol, called write-behind logging, that improves the availability of the system by more than two orders of magnitude compared to the ubiquitous write- ahead logging protocol. Besides improving availability, we found that write- behind logging improves the space utilization of the NVM device and extends its lifetime. Second, we propose a new storage engine architecture that leverages the durability and byte-addressability properties of NVM to avoid unnecessary data duplication. Third, the dissertation presents the design of a latchfree range index tailored for NVM that supports near-instantaneous recovery without requiring special-purpose recovery code.

Thesis Committee:
Andy Pavlo (Chair)
Garth Gibson
Todd Mowry
Samuel Madden (Massachusetts Institute of Technology)
Donald Kossmann (Microsoft Research)

Copy of Thesis Summary

Tree data structures play an important role in almost every area if computer science. Nowadays the explosion of data puts forward a higher demand for the performance of the trees, which requires efficient parallel algorithms for processing large-scale data, as well as a full-featured interface for achieving more functionalities. Traditional algorithm designing on balanced binary trees based on insertions and deletions are plagued with challenges such as being hard to parallelize, single-operand oriented, and varying among different balancing schemes. This thesis proposes a parallel algorithmic framework that overcomes these challenges, which captures all balancing criteria in a single function JOIN. The JOIN-based algorithms are then used for supporting sequences, ordered sets, ordered maps and augmented maps (formally defined in this thesis). A wide variety of functions form a hierarchical interface for these four data types, which are all implemented by join-based algorithms as part of a library PAM (Parallel Augmented Maps). This library is parallel, work-efficient, generic across balancing schemes, persistent, safe for concurrency and applicable to a wide range of applications and queries by proper augmentations. The augmented map structure itself is an abstraction befitting many real applications. The PAM interface greatly simplifies the programming of many applications over direct implementations, such as interval trees, range trees, inverted indices, segment trees, and database snapshot isolations. Experiments show that the implementations of the functions in PAM, as well as all the applications using PAM are practically efficient. Sequentially the code achieves performance that matches or exceeds exiting libraries designed specially for a single application, and the parallel implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.

Thesis Committee:
Guy E. Blelloch,  (Chair)r
Andrew Pavlo
Daniel D.K. Sleator
Michael T. Goodrich (University of California, Irvine)

Copy of Thesis Summary

There is an unknown n-bit string x. A "trace" is a random substring of x formed by deleting each bit with probability (say) 1/2. Suppose an algorithm has access to independent traces of x. How many does it need to reconstruct x? The previous best method needed about exp(n^{1/2}) traces. We give a simple "mean-based" algorithm that uses about exp(n^{1/3}) traces (and time). We also show that any algorithm working in the restricted "mean-based" framework requires exp(n^{1/3}) traces. The main tool in our work is elementary complex analysis.

Joint work with Anindya De (Northwestern) and Rocco Servedio (Columbia).

CMU Youtube Theory channel.

Wearable cognitive assistance applications can provide guidance for many facets of a user's daily life. In this presentation, I will talk about my recent work that enables a new genre of such applications that require both heavy computation and very low response time on inputs from mobile devices. The core of this work is the design, implementation, and evaluation of Gabriel, an application platform that simplifies the creation of and experimentation with this new genre of applications. An essential capability of this platform is to use cloudlets for computation offloading to achieve low latency. By implementing several prototype applications on top of Gabriel, I will evaluate the system performance of Gabriel under various system conditions. I will also show how Gabriel is capable of exploiting coarse grain parallelism on cloudlets to improve system performance, and conserving energy on mobile devices based on user context.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.

Formal verification methods are making headway into required safety verification policies for transportation. However, they often fail to account for the fact that controllers, whether digital or human, have imperfect knowledge of the world. Without addressing this shortcoming, we risk never bridging the gap between theoretical and practical safety.

In this talk, we discuss the sometimes fatal consequences of these shortcomings, and describe a new logic with belief as a first-class citizen, allowing us to model, think and verify belief-aware cyber physical systems.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.

DNA read mapping is one of the fundamental problem in bioinformatics. The general goal is to map billions of short pieces of DNA (reads) to a high quality reference genome, while tolerating errors including genomic variations between individuals and the sequencer imprecision.

The seed-and-extend strategy, a general mapping heuristic, efficiently maps erroneous DNA fragments to the reference by breaking the read into multiple nonoverlapping segments, called "seeds", and assume at least one of the seed is error free. By using seed to index into the reference genome, seed-and-extend mappersdrastically reduce the search space hence improves the run time.

However, a fundamental trade-off of seed-and-extend mappers is the balance between seed lengths and seed count. Greater seed length further reduces the search space therefore improve the performance. However, greater seed spacealso renders fewer seeds hence make the mapping process more error prune.

In this talk, I will present an on-going work, VEST (Versatile Error-resilient Seed Technique). VEST aims to adapt long seeds in order to make them error-resilient, hence achieves both high error tolerance as well as high mapping speed.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.

Hybrid systems combine discrete and continuous dynamics, which makes them attractive as models for systems that combine computer control with physical motion. However, verification of hybrid systems is undecidable in general and challenging for many systems of practical interest. Even when verification results are obtainable, verification only provides from guarantees when reality matches the verified model. Conversely, reinforcement learning-based controllers are lauded for their flexibility in unmodeled environments, but also do not typically provide guarantees of safe operation.

This talk describes a set of tooling and theory that allows system designers to obtain safety proofs for hybrid systems that incorporate reinforcement learning algorithms. Our main insight enabling efficient verification of hybrid systems is a proof development language, called Bellerophon, that provides a way to conveyinsights by programming hybrid systems proofs. Our approach toward safe reinforcement learning for hybrid systems leverages these verification results to provide safety guarantees for reinforcement learning controllers.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement

My dissertation will demonstrate some of the limitations of existing approaches to automated curriculum design and propose a new approach that leverages learner-generated resources to create new educational content.

Existing approaches to automated curriculum design, such as mastery learning and reinforcement learning-based approaches, largely focus on the adaptive sequencing of content and educational activities. I begin by describing three types of limitations that face these techniques: lack of empirical evidence, computational infeasibility, and incompatibility with constructivism. As part of this, I use a novel educational data mining approach to reexamine the assumption that knowledge can be decomposed into independent components---an assumption that underlies existing approaches to automated sequencing.

I then propose a more constructivist approach to automated curriculum design that focuses on the creation of new educational content. I will describe results from a series of experiments in online educational settings that demonstrate how to effectively use learner-generated resources to help future students learn and how algorithms can be combined with learning theory to determine which resources should be used when. The vision of this approach is that if we start with no educational resources to teach a subject, we can use the crowd of learners and data-driven algorithms to create new resources---and if we do start with existing expert resources, using learner-generated resources can still enhance the existing curriculum.

While my dissertation is rooted in constructivism and challenges notions from information processing psychology, my work also builds upon results from information processing psychology. In order to dispel the seeming cognitive dissonance and situate my work in a way that is meaningful to researchers from different epistemological bents, I undertake a qualitative inquiry to identify the meaningful distinctions between different epistemologies in the learning sciences.

Finally, throughout my dissertation, I will describe a number of insights that can be gained by reasoning about robustness to different models of student learning. I show how this emergent theme of model robustness can impact our understanding of learning, approaches to teaching, and approaches to education research.

Thesis Committee:
Emma Brunskill (Chair)
Vincent Aleven
Ken Koedinger
Chinmay Kulkarni
Eric Horvitz (Microsoft Research)

Copy of Thesis Summary

The rapid progress in artificial intelligence in recent years can largely be attributed to the resurgence of neural networks, which enables learning representations in an end-to-end manner. Although neural network are powerful, they have many limitations. For examples, neural networks are computationally expensive and memory inefficient; Neural network training needs many labeled exampled, especially for tasks that require reasoning and external knowledge. The goal of this thesis is to overcome some of the limitations by designing neural network with structural bias of the inputs taken into consideration.

This thesis aims to improve the efficiency of neural networks by exploring structural properties of inputs in designing model architectures. Specifically, this thesis augments neural networks with designed modules to improves their computational and statistical efficiency. We instantiate those modules in a wide range of tasks including supervised learning and unsupervised learning and show those modules not only make neural networks consume less memory, but also generalize better.

Thesis Committee:
Eric Xing (Co-Chair)
Taylor Berg-Kirkpatrick (Co-chair)
Ruslan Salakhutdinov
Alexander Smola (Amazon)
Li Deng (Citadel)

Copy of Thesis Summary

It is an open problem in static resource bound analysis to connect high-level resource bounds (order of complexity) with the actual execution time and memory usage of compiled  machine code. We propose to use machine learning to derive a cost model for a high-level source language that approximates the execution cost of compiled programs, including garbage collection, on a specific hardware platform. The technique has been implemented for a subset of OCaml using Inrias OCaml compiler on an Intel x86-64 and ARM 64-bit v8-A platform with execution time and heap allocations being the considered resources. In this talk, I will define cost semantics, motivate and describe our machine learning approach and finally present the evaluation

Joint work with Jan Hoffman

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement


Subscribe to CSD