go back to the main page

 SSS Abstracts 
Spring 2012

go back to the main page

Duolingo

Tuesday, January 17th, 2012 from 12-1 pm in GHC 4303.

Presented by Severin Hacker, CSD

In this talk, I will introduce Duolingo, a website (duolingo.com) which aims to translate the web into every major language: every webpage, every video and, yes, even Justin Bieber's tweets.

With its content split up into hundreds of languages—and with over 50% of it in English—most of the Web is inaccessible to most people in the world. This problem is pressing, now more than ever, with millions of people from China, Russia, Latin America and other quickly developing regions entering the Web. While machine translation systems are improving every day, their output is not yet good enough to be directly shown on sites like CNN.

With Duolingo, our goal is to encourage people to translate the Web into their native languages. With billions and billions of pages on the Web, this cannot be done with just a few volunteers, nor can we afford to pay professional translators. Instead, Luis von Ahn and I quickly realized that we needed a way to entice millions of people to help translate the Web. However, coordinating millions of contributors to translate language presents two major hurdles. First, finding enough people who are bilingual enough to help with translation is difficult. Second, motivating them to do it for free makes this next to impossible.

The idea behind Duolingo is to kill two birds with one stone by solving both of these problems simultaneously. We accomplish this by transforming language translation into something that anyone can do—not just bilinguals—and that millions of people want to do: learning a foreign language.

With Duolingo, people learn a foreign language while simultaneously translating text.

This is joint work with my advisor Luis von Ahn and the Duolingo Team.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement


Influence Propagation in Large Graphs

Tuesday, January 31st, 2012 from 12-1 pm in GHC 4303.

Presented by B. Aditya Prakash, CSD

Given a network of who-contacts-whom, will a contagious virus/product/meme take-over (cause an epidemic) or die-out quickly? What will change if nodes have partial, temporary or permanent immunity? What if the underlying network changes over time (e.g., people have different connections during the day at work, and during the night at home)? Propagation style processes can model well many real-world scenarios like epidemiology and public health, information diffusion etc. We present results on understanding the tipping-point behavior of epidemics (enabling among other things faster simulations), predicting who-wins among competing viruses, and developing effective algorithms for immunization and marketing for several large-scale real-world settings.

Joint work with: Christos Faloutsos, Lada Adamic, Roni Rosenfeld, Michalis Faloutsos and Hanghang Tong

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


On the Duality of Data-intensive File System Design: Reconciling HDFS and PVFS

Friday, February 3rd, 2012 from 12-1 pm in GHC 4303.

Presented by Wittawat Tantisiriroj, CSD

Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, we explore the similarities and differences between PVFS, a parallel file system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. We integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. We study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these file systems affect application performance. We show how to embed multiple replicas into a PVFS file, including a mapping with a complete copy local to the writing client, to emulate HDFS's file layout policies. We also highlight implementation issues with HDFS's dependence on disk bandwidth and benefits from pipelined replication.

This is joint work with Swapnil Patil, Garth Gibson, Seung Son, Samuel Lang, and Robert Ross.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Duolingo

Tuesday, February 7th, 2012 from 12-1 pm in GHC 4303.

Presented by Severin Hacker, CSD

In this talk, I will introduce Duolingo, a website (duolingo.com) which aims to translate the web into every major language: every webpage, every video and, yes, even Justin Bieber's tweets.

With its content split up into hundreds of languages—and with over 50% of it in English—most of the Web is inaccessible to most people in the world. This problem is pressing, now more than ever, with millions of people from China, Russia, Latin America and other quickly developing regions entering the Web. While machine translation systems are improving every day, their output is not yet good enough to be directly shown on sites like CNN.

With Duolingo, our goal is to encourage people to translate the Web into their native languages. With billions and billions of pages on the Web, this cannot be done with just a few volunteers, nor can we afford to pay professional translators. Instead, Luis von Ahn and I quickly realized that we needed a way to entice millions of people to help translate the Web. However, coordinating millions of contributors to translate language presents two major hurdles. First, finding enough people who are bilingual enough to help with translation is difficult. Second, motivating them to do it for free makes this next to impossible.

The idea behind Duolingo is to kill two birds with one stone by solving both of these problems simultaneously. We accomplish this by transforming language translation into something that anyone can do—not just bilinguals—and that millions of people want to do: learning a foreign language.

With Duolingo, people learn a foreign language while simultaneously translating text.

This is joint work with my advisor Luis von Ahn and the Duolingo Team.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement


A Type System for Borrowing Permissions

Tuesday, February 28th, 2012 from 12-1 pm in GHC 4303.

Presented by Karl Naden, CSD

In imperative object-oriented programming, programmers must consider the impacts of aliasing when reasoning about their programs and proving properties such as consistency of typestate and noninterference of concurrency. Access permission annotations are a lightweight mechanism for reasoning about aliasing in which references are characterized in terms of the access granted to the object and what other aliases to the object might exist. For example, a unique permission grants full access and guarantees no other aliases to the object exist, while an immutable permission grants non-modifying access and guarantees that all other aliases also only have non-modifying access.

In order to be useful in practice, tools based on access permissions must support common programming paradigms such as borrowing, where a unique permission is temporarily made immutable but regained at a later point, and extracting permissions from object fields. Existing tools support borrowing through heavy-weight alias analyses or manual accounting through fractional permissions. Previous systems have modeled field reads as an awkward swap operation. This work proposes local permissions and implicit unpacking of fields as lightweight and natural abstractions for dealing with these problems. In each case, the our type system does all necessary accounting under the hood leaving the exposed language simple.

In this talk, I will motivate the need for tools to reason about aliases, particularly in the context of typestate in the Plaid programming language, and introduce our use of access permissions through a series of examples.

This is joint work with Jonathan Aldrich, Robert Bocchino, and Kevin Bierhoff.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Internally Deterministic Parallel Algorithms Can Be Fast

Friday, March 23rd, 2012 from 12-1 pm in GHC 4303.

Presented by Julian Shun, CSD

The virtues of deterministic parallelism have been argued for decades and many forms of deterministic parallelism have been described and analyzed. Here we are concerned with one of the strongest forms, requiring that for any input there is a unique dependence graph representing a trace of the computation annotated with every operation and value. This has been referred to as internal determinism, and implies a sequential semantics—i.e., considering any sequential traversal of the dependence graph is sufficient for analyzing the correctness of the code. In addition to returning deterministic results, internal determinism has many advantages including ease of reasoning about the code, ease of verifying correctness, ease of debugging, ease of defining invariants, ease of defining good coverage for testing, and ease of formally, informally and experimentally reasoning about performance. On the other hand one needs to consider the possible downsides of determinism, which might include making algorithms (i) more complicated, unnatural or special purpose and/or (ii) slower or less scalable.

In this work we study the effectiveness of this strong form of determinism through a broad set of benchmark problems. Our main contribution is to demonstrate that for this wide body of problems, there exist efficient internally deterministic algorithms, and moreover that these algorithms are natural to reason about and not complicated to code. We leverage an approach to determinism suggested by Steele (1990), which is to use nested parallelism with commutative operations. Our algorithms apply several diverse programming paradigms that fit within the model including (i) a strict functional style (no shared state among concurrent operations), (ii) an approach we refer to as deterministic reservations, and (iii) the use of commutative, linearizable operations on data structures. We describe algorithms for the benchmark problems that use these deterministic approaches and present performance results on a 32-core machine. Perhaps surprisingly, for all problems, our internally deterministic algorithms achieve good speedup and good performance even relative to prior nondeterministic solutions.

Joint work with Guy Blelloch, Jeremy Fineman and Phillip Gibbons.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


ConcertOS, or: How I Learned to Stop Worrying and Love x64

Friday, March 30th, 2012 from 12-1 pm in GHC 4303.

Presented by Luke Zarko, CSD

We are building an operating system for x64 and ARM machines from the bits up, aiming for performance and safety through the use of an (in development) low-level logic that permits reasoning about program code at the instruction level. In this talk, I will discuss the project from a systems perspective, exploring some of the engineering design decisions we have made thus far; of particular importance is the garbage collector, which is based on the semi-realtime parallel and concurrent algorithm developed by Perry Cheng as part of his thesis work.

Joint work with Michael Arntzenius, Karl Crary, David Eckhardt, Salil Joshi, and Michael Sullivan.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Verifying Object-Oriented Code Using Object Propositions

Tuesday, April 3rd, 2012 from 12-1 pm in GHC 4303.

Presented by Ligia Nistor, CSD

The presence of aliasing makes modular verification of object-oriented code difficult. If multiple clients depend on the properties of an object, one client may break a property that others depend on.

We have developed a modular verification approach based on the novel abstraction of object propositions, which combine predicates and information about object aliasing. In our methodology, even if shared data is modified, we know that an object invariant specified by a client holds. Our permission system allows verification using a mixture of linear and nonlinear reasoning. This allows it to provide more modularity in some cases than competing separation logic approaches, because it can more effectively hide the exact aliasing relationships within a module. We validate our approach in extended examples including a simulation system that illustrates our system's modularity, and an instance of the composite pattern that illustrates our system's practicality.

Joint work with Jonathan Aldrich and Hannes Mehnert

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Verifying Object-Oriented Code Using Object Propositions

Friday, April 27th, 2012 from 12-1 pm in GHC 4303.

Presented by Ligia Nistor, CSD

The presence of aliasing makes modular verification of object-oriented code difficult. If multiple clients depend on the properties of an object, one client may break a property that others depend on.

We have developed a modular verification approach based on the novel abstraction of object propositions, which combine predicates and information about object aliasing. In our methodology, even if shared data is modified, we know that an object invariant specified by a client holds. Our permission system allows verification using a mixture of linear and nonlinear reasoning. This allows it to provide more modularity in some cases than competing separation logic approaches, because it can more effectively hide the exact aliasing relationships within a module. We validate our approach in extended examples including a simulation system that illustrates our system's modularity, and an instance of the composite pattern that illustrates our system's practicality.

Joint work with Jonathan Aldrich and Hannes Mehnert

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Leveraging Input and Output Structures For Joint Mapping of Epistatic and Marginal eQTLs

Friday, May 4th, 2012 from 12-1 pm in GHC 4303.

Presented by Seunghak Lee, CSD

Since many complex disease and expression phenotypes are the outcome of intricate perturbation of molecular networks underlying gene regulation resulted from interdependent genome variations, association mapping of causal QTLs or eQTLs must consider both additive and epistatic effects of multiple candidate genotypes. This problem poses a significant challenge to contemporary genome-wide-association (GWA) mapping technologies because of its computational complexity. Fortunately, a plethora of recent developments in biological network community, especially the availability of genetic interaction networks, make it possible to construct informative priors of complex interactions between genotypes, which can substantially reduce the complexity and increase the statistical power of GWA inference.

In this paper, we consider the problem of learning a multi-task regression model while taking advantage of the prior information on structures on both the inputs (genetic variations) and outputs (expression levels). We propose a novel regularization scheme over multi-task regression called structured jointly input/output lasso based on an L1/L2 norm, which allows shared sparsity patterns for related inputs and outputs to be optimally estimated. Such patterns capture multiple related SNPs that jointly influence multiple related expression traits. In addition, we generalize this new multi-task regression to structurally regularized polynomial regression to detect epistatic interactions with manageable complexity by exploiting the prior knowledge on candidate epistatic SNPs from biological experiments. We demonstrate our method on simulated and yeast eQTL datasets.

This is joint work with Eric Xing.

Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.


Exploring an Automated Resource-efficient Approach to Meeting Service Deadlines

Friday, May 11th, 2012 from 12-1 pm in GHC 4303.

Presented by Elie Krevat, CSD

Large services such as those at Amazon.com and Google are expected to respond quickly to customers, even when faced with thousands of simultaneous requests. If a service cannot respond in time to meet its service level objectives (SLOs), the company may suffer significant consequences (e.g., financial penalties, lost business, and disappointed customers). Companies that are successful in meeting their SLOs generally do so by designing their services to scale over as many machines as are available and throwing an abundance of computing resources at the problem. When there is only one service to manage, this task is fairly pliable. However, with the recent proliferation of data-intensive computing, many more essential component services (e.g., spell checking, ads generator) provide selective building blocks that are incorporated into composite services (e.g., search engine, document manager). These component services can share the same pool of resources and infrastructure services (e.g., storage, databases), or they may be managed remotely by third-party entities.

Determining how many resources are needed for each of the component services, and avoiding wasteful and inefficient allocations on shared clusters, has become an increasingly complex problem. For third-party services, paying more for better response times may be an option, but when are those costs justified? This talk will explore the issues and challenges of an efficient and general approach to reducing SLO violations in these environments. We would like to automatically identify downstream services that could benefit from additional resources and to assess the effectiveness of adding them. Our proposal includes making use of simplified analytical models synthesized from end-to-end trace data to determine service dependencies and to find good candidates for additional resources.

Joint work with Greg Ganger. In Partial Fulfillment of the Speaking Requirement.


Web contact: sss+www@cs