go back to the main page

 SSS Abstracts 
Fall 2010

go back to the main page

Input Generation via Decomposition and Re-Stitching: Finding Bugs in Malware

Tuesday, September 28th, 2010 from 12-1 pm in GHC 8102.

Presented by Pongsin Poosankam, CSD

The technique of symbolic reasoning computes constraints on program variables in terms of the symbolic input and uses a constraint solver to reason about various security properties of those variables. Over the last few years, symbolic reasoning has enabled many security applications such as vulnerability discovery. However, existing symbolic reasoning techniques only work well on a fraction of programs. In particular, they are ineffective in the presence of certain code constructs that are hard to analyze, such as decryption and checksums; we call these encoding functions. To extend symbolic reasoning to more classes of programs, we introduce a new technique that makes it possible to symbolically reason about programs that use encoding functions. The technique is based on decomposing the constraints induced by a program, solving only certain subsets, and then re-stitching the constraint solutions into a complete result. We implement the approach in a symbolic reasoning engine for x86 binaries and apply it to 4 prevalent families of malware that previously were difficult to analyze. We find 6 vulnerabilities that could be exploited to remotely terminate or subvert the malware. These vulnerabilities have persisted across malware revisions for months, and even years. We discuss the possible applications and ethical considerations of this new finding.

Joint work with Juan Caballero, Stephen McCamant, Domagoj Babic, and Dawn Song.

(Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.)

A Fast Solver for Quantified Boolean Formulas (QBF)

Friday, November 5th, 2010 from 12-1 pm in GHC 8102.

Presented by Will Klieber, CSD

The problem of Quantified Boolean Formulas (QBF) is a natural generalization of the problem of Boolean Satisfiability (SAT). SAT asks whether a propositional formula is satisfiable; in QBF, each variable is explicitly bound by either a universal or existential quantifier. Many problems in Model Checking and Artificial Intelligence can be naturally expressed in the language of QBF, so there a great deal of interest in developing QBF solvers that are fast for these types of QBF instances.

QBF solvers have inherited many of the techniques used by SAT solvers, including the use of conjunctive normal form (CNF). Although CNF works well for SAT solvers, it hinders the work of QBF solvers by impeding the ability to detect and learn from satisfying assignments. In fact, a family of problems that are trivially satisfiable in negation-normal form (NNF) were experimentally found to require exponential time for existing CNF-based solvers.

In this talk, I will present a method for solving QBF instances in non-CNF, non-prenex form. Our method makes two improvements over existing solvers. First, we reformulate clause/cube learning, extending it to non-prenex instances. Second, we introduce a propagation technique using ghost literals that exploits the structure of a non-CNF instance in a manner that is symmetric between the universal and existential variables.

This presentation is based on joint work with Samir Sapra, Sean Gao, and Edmund Clarke.

(Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.)

MoGUL: Detecting Common Insertions and Deletions in a Population

Tuesday, November 16th, 2010 from 12-1 pm in GHC 8102.

Presented by Seunghak Lee, CSD

While the discovery of structural variants in the human population is ongoing, most methods for this task assume that the genome is sequenced to high coverage (e.g. 40x), and use the combined power of the many sequenced reads and mate pairs to identify the variants. In contrast, the 1000 Genomes Project hopes to sequence hundreds of human genotypes, but at low coverage (4-6x), and most of the current methods are unable to discover insertion/deletion and structural variants from this data.

In order to identify indels from multiple low-coverage individuals we have developed the MoGUL (Mixture of Genotypes Variant Locator) framework, which identifies potential locations with indels by examining mate pairs generated from all sequenced individuals simultaneously, uses a Bayesian network with appropriate priors to explicitly model each individual as homozygous or heterozygous for each locus, and computes the expected Minor Allele Frequency (MAF) for all predicted variants. We have used MoGUL to identify variants in 1000 Genomes data, as well as in simulated genotypes, and show good accuracy at predicting indels, especially for MAF > 0.06 and indel size > 50 base pairs.

This is joint work with Eric Xing and Michael Brudno.

This will be presented in Partial Fulfillment of the CSD Speaking Skills Requirement.

Click Chain Model in Web Search

Tuesday, November 30th, 2010 from 12-1 pm in GHC 6115.

Presented by Fan Guo, CSD

Both search engines and web search users benefit from more relevant results. Past user clicks collected from the same or similar queries form a very informative and extensive, yet noisy source of data that could be leveraged for improving search qualities in future, and providing insights for other search-related business.

Click models have been developed as an early approach to principally characterize user behavior in examining search results and making click decisions. It has attracted strong interests from both academic researchers in data mining and web search, and industry practitioners seeking for better solutions to search-related problems. I'll present a personal review along this line of research and future directions, employing the click chain model as the running example.

Joint work with Chao Liu, Anitha Kannan, Tom Minka, Michael Taylor, Yi-Min Wang and Christos Faloutsos.

(Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.)

Applying Simple Performance Models to Understand Inefficiencies in Data-Intensive Computing

Tuesday, December 7th, 2010 from 12-1 pm in GHC 8102.

Presented by Elie Krevat, CSD

New programming frameworks for scale-out parallel analysis, such as MapReduce and Hadoop, have become a cornerstone for exploiting large datasets. However, there has been little analysis of how these systems perform relative to the capabilities of the hardware on which they run. This talk describes a simple analytical model that predicts the theoretic ideal performance of a parallel dataflow system. The model exposes the inefficiency of popular scale-out systems, which take 3-13 times longer to complete jobs than the hardware should allow, even in well-tuned systems used to achieve record-breaking benchmark results. Using a simplified dataflow processing tool called Parallel DataSeries, we show that the model's ideal can be approached (i.e., that it is not wildly optimistic), coming within 14-19% of the model's prediction. Moreover, guided by the model, we present analysis of inefficiencies which exposes issues in both the disk and networking subsystems that will be faced by any DISC system built atop standard OS and networking services.

Joint work with Tomer Shiran and Greg Ganger at CMU, Eric Anderson, Joseph Tucek, and Jay Wylie at HP Labs.

(Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.)

GIGA+: File System Directories with Millions of Files

Friday, December 10th, 2010 from 12-1 pm in GHC 8102.

Presented by Swapnil Patil, CSD

In this talk, we examine the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of file creates every second. We present a POSIX-compliant scalable directory design, GIGA+, that distributes directory entries over a cluster of server nodes that make only local, independent decisions about migration. GIGA+ uses two tenets, asynchrony and inconsistency, to: (1) partition the index among all servers without any synchronization or serialization, and (2) minimize stale and inconsistent mapping state at the clients. Applications are provided traditional strong data consistency semantics, and cluster growth requires minimal directory entry migration.

We demonstrate that the prototype built using the GIGA+ approach scales better than existing distributed directory implementations, delivers a sustained throughput of more than 98,000 file creates per second on a 32-server cluster, and balances load more efficiently than consistent hashing.

(Joint work with Garth Gibson)

(Presented in Partial Fulfillment of the CSD Speaking Skills Requirement.)

Inferring genetic ancestry in admixed populations (CANCELED)

Tuesday, December 14th, 2010 from 12-1 pm in GHC 8102.

Presented by Kyung-Ah Sohn, CSD

Web contact: sss+www@cs