Girija Narlikar's Research

  1. Thesis Research: Space-Efficient Multithreading
    (a)
    Space-efficient scheduling algorithms
    (b)
    Multithreading runtime systems
  2. Distributed Systems
    (a)
    Bulk-synchronous distributed objects
    (b)
    Distributed shared memory
  3. Language Implementation
  4. Scalable Algorithms
    (a)
    N-body algorithms
    (b)
    Multithreaded decision tree builder
I am also interested in building and analyzing distributed applications, scalable servers, and scalable systems for mining, web indexing and searching.

1. Thesis Research: Space-Efficient Multithreading

Programs with irregular or dynamic parallelism benefit from the use of lightweight, fine-grained threads. Lightweight threads significantly simplify the programmer's job, allow for automatic load balancing, and dynamically adapt to a changing number of processors. However, the program depends heavily on the threads implementation for high performance. Therefore, unless the threads scheduler is carefully implemented, the program may end up using too much space, or suffer poor performance due to excessive memory allocation, high scheduling overheads, or poor locality. My thesis research has focused on two aspects of thread scheduling: the design and analysis of provably space-efficient scheduling algorithms for fine-grained multithreaded programs, and the engineering and evaluation of fast, multithreading runtime systems based on these algorithms.

(a) Space-efficient scheduling algorithms

I designed two online, asynchronous scheduling algorithms for multithreading systems that support nested parallelism. The algorithms are provably space and time efficient, that is, they provide low upper bounds on the space and time requirements of a multithreaded computation. The algorithms guarantee that a nested parallel program with S1 serial space requirement and D critical path length (depth) requires at most S1 + O(pD) space on p processors. This is a significant improvement over previous asynchronous, space-efficient schedulers that guarantee an upper bound of p S1, since D is small compared to S1 for most parallel programs. My algorithms maintain this low space bound by prioritizing threads according to their sequential execution order, and temporarily stalling big allocations of space. To ensure scalability for both the algorithms, I also parallelized the scheduler itself, and analyzed the space and time requirements including scheduling overheads. Both algorithms provide a user-adjustable trade-off between space and time performance which I analyze and experimentally verify. The second algorithm improves upon the first by also trying to schedule threads close in the computation graph on the same processor, to result in good locality and low scheduling contention.

In joint work with Phillip Gibbons and Yossi Matias from Bell Labs, and my advisor Guy Blelloch, we designed the first asynchronous, space-efficient scheduling algorithm for parallel languages with synchronization variables. Such languages include languages with futures such as Multilisp or Cool, as well as other languages like Id.

(b) Multithreading runtime systems

To determine how useful my space-efficient scheduling algorithms were in practice for realistic benchmarks, I implemented them in the context of two separate runtime systems. First, I built a runtime system that executed C programs, with the threads written in a continuation-passing style, on an SGI Power Challenge. The system used one of my asynchronous scheduling algorithms to schedule the threads. The results of executing parallel programs on this system show that my scheduling algorithm significantly reduces memory usage compared to previous space-efficient techniques, without compromising performance.

Although POSIX threads (Pthreads) have become a popular standard for shared memory programming, I found the current Pthread schedulers unsuited for executing programs with fine-grained, dynamic parallelism. I therefore implemented both my space-efficient scheduling algorithms as part of a popular Pthreads package on Solaris. Although the algorithms guarantee upper bounds for nested parallel programs, they also appear to work well in practice for program with arbitrary synchronizations, such as locks or condition variables. This is the first space-efficient system that supports a functionality as general as that of Pthreads. I used several parallel programs, including numerical codes, physical simulations, and a data classifier, to evaluate the schedulers. The programs, written in a fine-grained style, were simpler to code than their hand-partitioned, coarse-grained counterparts, and yet provided equivalent performance. However, my space-efficient schedulers were required in place of the Pthread library's original scheduler to achieve this performance. Here's a link to the work on space-efficient scheduling within the Scandal project.

Recently, I also studied in some detail the applicability of lightweight threads for parallelizing a data mining problem. In particular, I found that with a space-efficient thread scheduler, a high-level formulation of a decision tree builder (for in-core data) scales well with both the data size and number of processors.

Related Publications:

2. Distributed Systems

(a) Bulk-synchronous distributed objects

Bulk-synchronous libraries provide a simple and efficient model for parallel programming with predictable performance. However, the programmer has to manage all the data movement at a low level, similar to the message-passing style. In collaboration with Satish Rao (NEC Research Institute), I designed and implemented a distributed object layer over a bulk-synchronous library on a network of PCs. The object layer provides a global object name space that simplifies the task of programming, and uses software caching to minimize communication and programming effort. Since it is built over a bulk-synchronous library, the object system can also provide efficient and predictable performance. We have evaluated the object layer's utility and efficiency using a variety of applications.

(b) Distributed shared memory

I worked with Chandramohan Thekkath (DEC Systems Research Center) on adding runtime optimizations to an existing distributed shared memory (DSM) library called SAM. SAM uses objects to share data between processors, and provides primitives to access these shared objects. For example, when a read access to an object is requested by a processor, a copy of the object is asynchronously fetched to the local cache of the processor. Although the programming model is fairly user-friendly, performance can be slowed down by a large number of asynchronous requests for remote data. We designed runtime optimizations to minimize communication, and to overlap communication with computation. We took advantage of the fact that data access patterns for many applications change slowly over time. Therefore, we dynamically kept track of the set of active readers for each DSM object, so that on writes to the object, its copies in the caches of the active readers could be updated before the next read request was made. Further, we kept track of the types of accesses (read vs. write) made to an object, so that we could dynamically modify its cache coherence policy to suit the accesses. For example, a update coherence policy is preferred over an invalidate policy when the object is read often and written rarely. The effectiveness of these optimizations was experimentally verified using existing parallel benchmarks written in SAM.

Related Publication:

3. Language Implementation

The Scandal project at CMU is involved with the implementation of a high level, applicative language called Nesl. I built a lightweight, portable interpreter for Nesl, making it more widely available on serial workstations for rapid prototyping of parallel algorithms. Nesl is well-suited to code algorithm animations. However, since the Nesl implementation was limited to X11-based unix systems, we required an insecure X11 connection to make the animations available over the web. In joint work with Jonathan Hardwick and Jay Sipelstein (CMU), we designed and implemented an alternate system for compiling Nesl into Java. In addition to increasing the portability of Nesl, this system has enabled us to make existing simulations and algorithm animations available in applet form on the web. We compared the performance of the new system using a set of benchmarks on both PCs and workstations. Current Java virtual machines running the generated code achieve about half the performance of a native implementation of Nesl. We found that the use of Java as an intermediate language is a viable way to improve the portability of existing high-level programming languages for scientific computation and visualization. The applets are available here.

Related Publication:

4. Scalable Algorithms

(a) N-body algorithms

This joint work with Guy Blelloch compares three popular algorithms for the three dimensional N-body problem, the Barnes-Hut algorithm, Greengard's Fast Multipole Method (FMM), and the Parallel Multipole Tree Algorithm (PMTA), to determine which of the algorithms performs best in practice. Although FMM has a better asymptotic running time (O(N) instead of O(N log N) for uniform distributions), the algorithm is more complicated and it is not immediately clear above what values of N it performs better in practice. This is the first research that analyzes and compares the relative performances of the three popular N-body algorithms. We first experimentally studied the dependence of accuracy on the variable parameters in the three algorithms. We then mathematically analyzed and compared the floating point operation counts of the algorithms at similar levels of accuracy, for both charged and uncharged random distributions. At a high level of accuracy, we found that the FMM performed the least number of operations for N > 104, assuming both charged and uncharged distributions of points. However, at a lower level of accuracy, for uncharged distributions, the FMM did not outperform Barnes-Hut even for N > 108. For charged distributions of particles, both the FMM and PMTA were comparable at low accuracy. A link to Nesl implementations of N-Body Algorithms.

(b) A multithreaded decision tree builder

I have recently worked on a high-level, fine-grained parallel formulation of a decision tree-based classifier for memory-resident datasets on SMPs. The algorithm exploit two levels of divide-and-conquer parallelism in the tree builder: at the outer level across the tree nodes, and at the inner level within each tree node. Lightweight Pthreads are used to express this highly irregular and dynamic parallelism in a natural manner. The task of scheduling the threads and balancing the load is left to a space-efficient Pthreads scheduler. Experimental results on large datasets indicate that the space and time performance of the tree builder scales well with both the data size and number of processors.

Related Publications:


Girija Narlikar
[count]