MIME-Version: 1.0 Server: CERN/3.0 Date: Tuesday, 07-Jan-97 15:44:32 GMT Content-Type: text/html Content-Length: 11186 Last-Modified: Friday, 13-Dec-96 15:35:10 GMT
The Laboratory for Experimental Software Systems (LESS) at the University of Texas at Austin's Department of Computer Sciences was formed in September 1996 by four new faculty members --- Lorenzo Alvisi, Robert Blumofe, Mike Dahlin, and Calvin Lin --- to aggregate resources and promote collaboration on research in experimental software systems, particularly in the areas of programming support and fault tolerance for cluster and web-based applications. This document gives a brief overview of research being conducted in the LESS lab.
Fault tolerant parallel computing with distributed shared memory
(Alvisi and Blumofe). Prior work has shown that the combination of
a "well structured" parallel programming model, the
randomized "work-stealing" scheduling algorithm, and the
"dag consistency" coherence model of distributed shared
memory (a combination that form the basis for the
We propose to use a combination of two new techniques --- "return transactions" and "causal logging of reconciles" --- that take advantage of the following key algorithmic property of the well structuring, work stealing, and dag consistency combination. When a procedure activation is stolen, all modifications made to shared memory by the stolen activation and all of its descendants do not need to be seen by any other extant activation except for the stolen activation's parent. Moreover, these modifications do not need to be seen by the parent until after the stolen activation returns.
The
To avoid potentially huge amounts of buffering,
Reliable parallel scientific subroutine libraries (Blumofe).
Traditionally, parallel scientific subroutine libraries, such as
various parallel implementations of the Basic Linear Algebra
Subroutines (BLAS), have been coded by statically partitioning work
among a static set of processes or threads. This approach has been
very successful for traditional parallel platforms in which each
program runs on a static set of (effectively) dedicated processors.
With the growing use and acceptance of SMPs and clusters for parallel
computation, however, this assumption of dedicated resources is no
longer valid, and it has been shown that applications and libraries
coded with static partitioning have very unreliable performance when
run on non-dedicated resources. On the other hand, it has been shown
that by using
wFS: An adaptive data framework for web computing (Dahlin). Although an increasing amount of valuable data resides on the web, current "browser-centric" data-access protocols limit its use. This project seeks to provide stronger cache consistency and data update guarantees that will enable new classes of web-based applications. Because the physical characteristics of the Internet make it expensive to provide some of these guarantees, wFS will pursue an adaptive and application-specific approach. The system will provide a range of consistency and update options with different guarantees and different costs, and applications will pay for only the guarantees that they require. For example, a web browser may emphasize scalability and continue to use the current read-only and weak cache consistency approach. Conversely, a distributed parallel computation may require transactional updates and strict cache consistency even if these guarantees limit its scalability to a few hundred nodes. Two key aspects of the project will be providing a framework for instantiating different consistency and update algorithms under a common interface and providing quantitative criteria that applications can use to select appropriate algorithms.
Lightweight fault-tolerance (Alvisi and Vin).
The objective of this research is to support and enable a new class of
truly distributed and fault-tolerant applications in which distributed
agents communicate through messages as well as files. Our proposed
To achieve transparency, we plan to engineer our solution as a
middleware. To minimize dedicated resources, we plan to use rollback
recovery techniques. To minimize the impact on application
performance and to scale the cost of our solution with the number of
failures that need to be tolerated, we plan to use
Using current techniques, tolerating hardware-generated faults is
possible, but at the cost of potentially forcing the application to
block for every I/O operation while data critical to recovery are
logged to stable storage. Specifically, one cannot assume that a file
read during the execution will still be available in its original form
during recovery. Hence, input from the file system must be
synchronously logged to stable storage. Furthermore, since the file
system in general cannot roll back, the application must delay output
to the file system until it executes an
To address these problems, the middleware that we plan to build will present the file system to the application not as a detached component of the external environment, but as an integrated partner that can be trusted to provide the data needed during recovery. We expect that this will drastically reduce the costs incurred by the application in performing file I/O. Specifically, our solution will have the following benefits.
Parallel computing on the world-wide web with Java (Alvisi,
Blumofe, Dahlin, and Lin). This project will use Java as the basis
for a new parallel computing infrastructure, to be called
Back to LESS
Last modified: December 13, 1996