Andrew B. Hastings
School of Computer Science
Carnegie Mellon University
Copyright © 1992 Andrew B. Hastings
The union of transactions and distributed shared memory offers synergies in transaction recovery, concurrency control, and coherency control, but introduces challenges in transaction recovery. In this dissertation, I describe the design of a system that provides TDSM in the form of distributed recoverable virtual memory. Using the external pager interface of the Mach operating system, I implemented a prototype based on the Camelot distributed transaction facility. I analyze the prototype and its performance, offer techniques for improving the design of future TDSM systems, and characterize the applications for which TDSM is useful.
Transaction processing systems provide failure atomicity, permanence, and serializability guarantees for distributed systems. These guarantees simplify the application programmer's task by reducing the attention that must be paid to concurrency and failures.
Distributed shared memory provides a simple model for application programmers by giving the illusion of a single, global address space for all processes participating in a particular distributed application. This illusion simplifies the application programmer's task by eliminating the need for explicit communication operations to obtain data.
My thesis is that the combination of transactions and distributed shared memory is feasible and useful for a certain class of distributed application, especially those applications which have concurrent access to data that must not be corrupted, and need caching to provide adequate performance. In this dissertation, I motivate this idea and discuss the design and implementation of a prototype system providing transactional distributed shared memory (TDSM). I evaluate the implementation, reflect on the design, and suggest a direction for future designs.
Imagine an application developed for a centralized system that must be adapted for use on a distributed system. The programmer making such an adaptation faces a number of difficulties. Instead of a few processors running on a single clock, a distributed system has many processors operating independently. Which processor should operate on the data? How and when does a processor communicate its results to other processors? To the two-level memory hierarchy of main memory and disk storage, a distributed system adds a third level of network access to main memory and disks attached to other nodes. Where should the data be stored? How and when is data transferred between nodes of the distributed system?
A distributed program may need to meet several requirements.
Concurrent updates complicate the task of maintaining data integrity. Distributed systems exacerbate the problem of concurrent updates by increasing concurrency and increasing the possible delay between reads and writes. The problem that concurrency poses is that independent updates can become interleaved and introduce an inconsistent state. (See  for examples.)
Updates in the presence of failures complicate the task of maintaining data integrity. Failures arise when messages are lost, re-ordered, or corrupted, or when individual nodes and processes crash. The problem that failures pose is that an update may be partially performed, introducing an inconsistent state. (See  for a discussion.)
The application programmer could be overwhelmed by the complexity of these issues. The next subsection describes various programming models that can reduce the complexity of the programmer's task.
Several programming models aid in meeting application requirements on a distributed system. The key dichotomy in this subsection is between function shipping (the client/server model), and data sharing (e.g., distributed shared memory). Independent of this dichotomy are transactions, replication and partitioning, and caching. In most current systems, transactions are associated with function shipping, while caching, replication, and partitioning are part of efficient implementations of data sharing.
The client/server model offers a straightforward approach to constructing distributed applications by extending the familiar notion of procedure call. Data may be distributed among multiple nodes in the distributed system. To access data, a client makes a remote procedure call by sending a message to a server on the node where the data resides (possibly the same node as the client). The server examines the message to identify the procedure and its arguments, calls the appropriate procedure, and sends a message containing the results to the client. Because the client's message contains a request for the server to perform a particular function, the client/server model is also known as function shipping. The server can meet security requirements by refusing to perform some operations for a given client.
The client/server model does not directly address availability or data integrity requirements. Remote procedure calls are a relatively expensive mechanism for accessing remote data because the concept does not include any automatic caching. Thus, performance may suffer if remote procedure calls overwhelm the carrying capacity of the network or the processing capacity of a given server.
Transactions make it easier to meet data integrity requirements in the face of concurrency and failures. A transaction is a sequence of actions grouped into a unit. If the application's data is in a consistent state, a transaction must transform the data into a new consistent state. (The data may be transformed into a inconsistent state temporarily while a transaction is in progress as long as consistency is restored by the end of the transaction.) If each transaction executes as an atomic, indivisible unit, then data will never be left in an inconsistent state.
Transaction systems address the concurrency problem by preventing transactions from observing each other's partial updates. Transactions may execute concurrently, but locking or timestamp schemes order the accesses to shared data so that each transaction sees only the values stored by previously completed transactions.
Transaction systems address the failure problem by ensuring that the sequence of actions within a transaction succeed or fail as a unit. Transaction systems may undo or redo the actions of partially completed transactions to simulate atomic execution in the presence of failures. A transaction commits if it runs to completion; if it fails before completion, any changes it makes are undone, and it aborts.
Formally, a transaction has three properties: failure atomicity, permanence, and serializability. Failure atomicity ensures that either all of the operations within the transaction complete successfully, or none of them do (partial updates are undone). Permanence ensures that the effects of a committed transaction are not lost due to failures. Serializability states that there is a serial sequence of transactions that produces the same results as a given concurrent execution of a set of transactions. Thus, concurrently executing transactions cannot observe inconsistent states.
Replication and partitioning can aid in meeting availability, growth, and performance requirements. If there is more data than one node can store, if there are more client requests than one server can process, or if the application requires better availability than one node can provide, the data must be distributed among several nodes instead of being stored on just one. To increase storage capacity, the data may be partitioned with no overlaps among the nodes. To improve availability, the data may be replicated on a set of nodes, so that other nodes may provide the data if one node goes down. (Partitioning can also improve availability in that part of the data is still available even if one node crashes.) To increase processing capacity, either replication or partitioning may be used, although a poor choice of partitions may still overload a node if the data stored there is accessed too frequently. With replication, a client may contact any server that contains a replica of the data, but extra communication is necessary to maintain a consistent state among the replicas.
Optimizing performance using replication and partitioning can be tricky. When the data is replicated, reading the data is inexpensive since a client may choose the closest or least-lightly-loaded server. Updates are expensive because updates must eventually be propagated to every replica. The cost of updates may be reduced by delaying the propagation until the data is read, but this increases the cost of reads, may reduce availability, and must be managed carefully to avoid inconsistencies. Thus, replication provides good performance when access to data is primarily read-only. When the data is partitioned, there is little cost difference between updates and reads. However, there is a cost difference associated with the location of the data. A data reference is less expensive when the data is local to the client's node. A data reference is more expensive when the client must contact a remote node. Thus, partitioning provides good performance when there is high locality of reference (i.e., most references are local rather than remote). If the frequently-referenced set of data changes over time, then good performance can be maintained only if partitioning changes with it.
Application performance can be improved through the use of caching. A cache reduces communication costs by remembering previous requests and corresponding replies. If a client wishes to make a request that matches a previous request, the cache may be able to replay the corresponding reply, eliminating the communication and processing necessary to repeat the request to the appropriate server. Offsetting this savings is extra communication and processing needed to remove stale data from the cache: the cache should not replay a previous reply if it knows that the server would return a different reply to the repeated request. A cache is coherent if it does not return stale data to the client.
A general-purpose cache of requests and replies is difficult for the communication system to provide because it has little knowledge of the internal semantics of requests and replies, and thus does not know when it is appropriate to delete stale data from the cache. Application programmers can build application-specific caches with knowledge of request/reply semantics, but the cache coherency problem is complicated enough to be the subject of current research, and many applications may need to be changed as new solutions become available.
Distributed shared memory can make it easier for the application programmer to obtain good performance. Shared memory provides processes with a shared address space; processes sharing an address space may access shared information directly without explicit communication operations such as messages or remote procedure calls. Distributed shared memory (DSM) provides a shared address space via software and/or hardware to processes on different nodes that do not physically share memory. A system with virtual memory hardware can use software to provide distributed shared virtual memory.
A well-designed distributed shared memory system can improve application performance through the use of caching. Caching at the memory system level can benefit all applications that use shared memory, and advances in cache coherency algorithms need be implemented in only one system rather than multiple applications. A distributed shared memory cache may be simpler than application-specific caches since it need recognize only four requests, read, write, lock, and unlock. Since read and write requests are built into the hardware, the distributed shared memory cache can often make use of hardware assists (such as virtual memory) to improve performance. Distributed shared memory will not necessarily benefit all applications, however, since it may transfer more data than is needed by the application, and the cost of keeping the cached data consistent may be too large relative to the cost of the processing performed on the data.
Distributed shared memory is an example of data shipping, or the data sharing model. In data sharing data is sent to the node where it is to be processed under the assumption that the data will be used there again. Data sharing should be contrasted with the client/server model, or function shipping, in which the function request is sent to the node where the data resides. Data sharing and function shipping are functionally equivalent since each can be expressed in terms of the other. Essentially, data sharing is a specialized implementation of a few remote procedure calls (in the case of distributed shared memory, read, write, lock, and unlock). The messages underlying function shipping can be implemented via message queues that are data-shared. The difference between the two approaches is one of emphasis: function shipping focuses on the flow of control, whereas data sharing focuses on the flow of data.
Data sharing offers the performance benefits of both replication and partitioning. When data is read at a node, it is stored in the cache at that node. Thus, the data sharing model automatically replicates data that is read-only. When data is updated at a node, it is stored at that node and becomes stale in the caches of all other nodes. Thus, the data sharing model automatically partitions data that is frequently updated at one node and not accessed at other nodes. As the frequently-referenced set of data changes over time, partitioning changes with it.
This dissertation explores a new model for building distributed applications: transactional distributed shared memory. There are several reasons for exploring this model.
The client/server model, as provided by systems such as Sun RPC, NCS, and MIG, has several limitations. Concurrency and failures are hard to deal with, and distributing the data to obtain good performance can be difficult.
Data sharing has been used widely in distributed file systems (such as NFS, AFS, and DCE), and is a popular research topic in the form of distributed shared memory[1,2,3,6,7,14,18,22].
Among the many systems that have demonstrated the value of transactions are CICS, Tuxedo, Camelot, and Encina.
Combining data sharing and transactions offers the benefits of both, and the combined system provides opportunities for optimizing performance. For example, the failure atomicity property of transactions may be provided in part by writing a log containing all of the changes made to a data structure. This log write could also serve to update the contents of the caches in a distributed shared memory system, eliminating the need for additional messages to keep caches coherent.
TDSM will not benefit all distributed applications. Transactions may not benefit applications that can tolerate weaker consistency guarantees but not the extra overhead of transactions. Distributed shared memory does not improve the performance of applications if the distributed shared memory system transfers more data than is needed by the application. If the cost of transferring the data is not amortized over repeated accesses, it may be less expensive to ship function requests to the original location of the data. Also, application-specific security restrictions are harder to provide with distributed shared memory. For example, given a set of salaries, the client/server model can easily restrict a particular user to viewing the sum of the salaries but not the individual salaries. Since distributed shared memory can only restrict requests it knows about (read and write), a user must be able to read individual salaries in order to view the sum that is computed from those salaries.
The major challenge in designing a TDSM system is transaction recovery. To simulate atomic execution, a transaction system may need to undo the modifications a partially completed transaction has made to an object in distributed shared memory. However, the DSM object may have migrated from the node where the modification was originally made, so the transaction system must somehow locate both the object and a description of the modification in order to abort the transaction.
To guarantee permanence, a transaction system may need to redo the modifications made by a committed transaction if the modified object is cached on a node that fails. The transaction system must either locate the modified object on another node, or locate an older version of the object, and redo the modifications made by the committed transaction. In a complicated distributed system with many transactions concurrently modifying many objects in distributed shared memory, the task of reliably locating the appropriate version of an object, and/or locating a description of the modifications made by a particular transaction can be very difficult.
The key issue in designing a recovery algorithm for TDSM is the tradeoff between availability and performance. For good performance, objects in DSM must remain cached at the site of use, and modifications to those objects should be recorded in as few places as possible. For high availability, objects in DSM, as well as records of modifications to those objects, should be stored in as many places as possible.
An example illustrates the utility of transactional distributed shared memory. In general, applications that can benefit from TDSM have these characteristics:
A classic application that involves problems of concurrency and failures is an airline database. Reservations are entered into the database by travel agents and reservations agents, and may refer to several flights. Reservations are indexed by passenger name, flight number, and departure date. A flight database records the flight schedule, the cities and aircraft involved, and the seating capacities. For each flight on each date there is a record of the reservations for that flight. Reservations are updated as requested by passengers, as flight schedules are changed, and as flights are flown. Locality of reference arises as the data for a particular passenger or a particular flight is referenced. If two agents concurrently attempt to reserve a seat, the database should not lose either reservation and should not reserve more seats than are available. If a processor fails while storing a reservation, the database should not allow the indices to become inconsistent.
Transactional distributed shared memory supports a solution to the problems of failures and concurrency in the airline database, and provides good performance. To handle the volume of reservations, the data is distributed across several processors. Distributed shared memory automatically partitions the data among nodes to match the locality of reference, and migrates the data as references move. Transactions prevent failures or concurrency from creating inconsistencies during updates.
An approach that has been used to construct the airline database is to store the data on a large, centralized server that can be queried and updated from remote terminals. The disadvantage of this approach is that the centralized server becomes a performance bottleneck as the number of clients increases. Higher throughput can be obtained by splitting the database among several systems. Function shipping and caching can keep the data consistent among systems. But TDSM offers a simpler approach for migrating the airline database to a distributed environment, since the centralized server can run on a TDSM system with few or no changes.
The background chapter of the dissertation addresses the issues involved in implementing transactions and distributed shared memory. After presenting an overview of distributed computing, the chapter discusses support for transactions in terms of seven major functions: transaction management, recovery management, communication management, configuration management, concurrency management, buffer management, and log management. Next, the chapter describes solutions to the cache coherency problem of data sharing that is central to distributed shared memory, and outlines techniques for distributed concurrency control in conjunction with data sharing. The chapter concludes with a survey of related work in the area of transactional distributed shared memory (TDSM).
The environment chapter of the dissertation introduces the Camelot distributed transaction facility and the Mach operating system on which Camelot runs. Camelot demonstrated that transactions could be efficiently layered on an operating system kernel as a general-purpose facility. Camelot is well-documented, and the source code is readily available for research use; thus, Camelot is a reasonable place from which to start work on transactional distributed shared memory.
The design chapter of the dissertation presents an architecture for implementing TDSM in a general purpose transaction processing facility. The key design decision underlying the architecture is the concept of the home node, which provides a simple, direct method for achieving transaction guarantees. The challenge of transaction recovery is addressed by storing log records and data blocks on the home node, and coordinating recovery from the home node. Caching of data and log records is used to improve performance and reduce the load on the home node.
The implementation chapter of the dissertation provides an overview of the implementation of this architecture in the Camelot distributed transaction facility. Camelot's concept of a recoverable virtual memory (RVM) segment is extended in two ways. First, each RVM segment can be shared by multiple Camelot servers. Second, Camelot services and pages of any RVM segment can be accessed over the network.
The major additions to Camelot for TDSM are the Lock Manager (to ensure serializability of accesses to shared data), the Remote Execution Manager (to assist in remote access to Camelot services and RVM segments), and the External Memory Manager (to maintain the consistency of RVM pages across nodes). To support TDSM, data structures in the Camelot Disk Manager, Recovery Manager, and Node Server are modified to separate RVM segments from the servers that access them. Forward processing and recovery algorithms are similarly affected.
The performance chapter of the dissertation describes the performance of the implementation, reporting the latency of individual operations and ET1 throughput figures in various configurations. The performance results show that reading a RVM page that is cached is no more expensive for remote access to a RVM segment than for local access. Writing a cached page and committing a transaction is somewhat more expensive because of the cost to send log records and transaction commit messages across the network. Access to a page that is not cached is significantly more expensive for remote access because the underlying RPC system in the experimental environment performs poorly on large (page-size) messages.
The analysis chapter of the dissertation models the TDSM system's performance in terms of characteristics of the underlying hardware and software. The performance of the system is extrapolated to two different platforms. ET1 throughput is limited primarily by paging disk throughput, and thus the ET1 benchmark (as used in the performance measurements) is not really an appropriate application for TDSM. Even so, when the underlying platform is optimized for TDSM, an ET1 transaction that pages over the network can achieve a latency within 10% of the latency of an ET1 transaction that uses a local paging disk. Adapting the ET1 benchmark to TDSM to improve locality would further reduce the latency.
Possible optimizations to the TDSM implementation are also modeled. The failure atomicity property of transactions may be provided in part by writing a log containing all of the changes made to a data structure. This log write may also serve to update the contents of the caches in a distributed shared memory system, eliminating the need for additional messages to keep caches coherent. Another possible optimization is to combine locking with cache coherency: locks can be obtained automatically as an item is cached, and released automatically when an item leaves the cache. The model shows that these optimizations offer modest improvements in performance.
This chapter outlines some of the issues in designing future TDSM systems. During recovery, log records must be located in order to restore data blocks to a transaction-consistent state. Pages must be located during both recovery and forward processing. To improve availability, pages and log records should be replicated, complicating the algorithms that locate them, and the transaction commit protocols. Allowing a modified block to migrate before transaction commit can improve throughput, but complicates recovery.
The chapter concludes with a discussion of the deficiencies of the architecture and its implementation. Relatively simple changes to a few Camelot components can improve configurability and security. More serious deficiencies are limitations to availability and incremental growth. Failure of the home node effectively causes all data to be unavailable. Incremental growth in storage capacity is difficult, since only the home node stores data, and using multiple home nodes for a single application is non-trivial. Incremental growth in processing capacity is limited by the load of paging requests and transaction commit requests that the home node can process.
Some of these deficiencies can be addressed by hardening the home node through replicating hardware or software services. A more promising alternative is to eliminate the home node completely, distributing its functions among all nodes in the environment.
Transactional distributed shared memory (TDSM) offers an interesting alternative to the traditional technique of extending transactions to a distributed system via transactional RPC. Remote procedure call (RPC) is attractive because it uses the familiar paradigm of procedure call to hide the details of constructing a message, sending it, awaiting a reply, and unpacking the reply. Unfortunately, RPC cannot hide the time it takes to send and receive a message. As a result, programmers who wish to improve the performance of their applications may try to reduce the number of RPCs they make through the use of a cache of requests and replies. In a general-purpose RPC system where each operation may have complicated semantics, it is difficult to keep caches consistent. The beauty of TDSM is that the semantics of ``memory'' are simple and memory is used by almost every application. Memory caching is a popular research topic, and should benefit from extensive research in the field.
The benefits of TDSM over transactional RPC include:
The performance advantages claimed for TDSM should not be considered in isolation. Knowing the semantics of a given application, a programmer could design an application-specific cache that transfers just the data that is needed, and thus outperforms the more general caching of TDSM. But TDSM offers performance gains in conjunction with ease of use: an application benefits from TDSM caching with no additional programming.
TDSM may not be suitable for all applications. TDSM suffers in several areas:
This dissertation argues that it is feasible and useful to provide TDSM as a tool for constructing distributed applications. The specific contributions of the dissertation are summarized below:
As applications become more sophisticated in their use of RPC, caching of requests and replies will become more common. This implies that a smart application won't use RPC directly. The question for system designers is whether to concentrate on providing the tools needed to build RPC-based systems, or to look instead at providing shared access to data with a common caching mechanism for all applications. In this dissertation, I investigated the second option in the context of transactions, presenting evidence that transactional distributed shared memory is a feasible and useful tool for constructing distributed applications.