Disconnected Operation in a Distributed File System

James J. Kistler
May 1993

Abstract

Disconnected operation refers to the ability of a distributed system client to operate despite server inaccessibility by emulating services locally. The capability to operate disconnected is already valuable in many systems, and its importance is growing with two major trends: the increasing scale of distributed systems, and the proliferation of powerful mobile computers. The former makes clients vulnerable to more frequent and less controllable system failures, and the latter introduces an important class of clients which are disconnected frequently and for long durations -- often as a matter of choice.

 This dissertation shows that it is practical to support disconnected operation for a fundamental system service: general purpose file management. It describes the architecture, implementation, and evaluation of disconnected file service in the Coda file system. The architecture is centered on the idea that the disconnected service agent should be one and the same with the client cache manager. The Coda cache manager prepares for disconnection by pre-fetching and hoarding copies of critical files; while disconnected it logs all update activity and otherwise emulates server behavior; upon reconnection it reintegrates by sending its log to the server for replay. This design achieves the goal of high data availability -- users can access many of their files while disconnected, but it does not sacrifice the other positive properties of contemporary distributed file systems: scalability, performance, security, and transparency.

The system has been seriously used by more than twenty people over the course of two years. Both stationary and mobile workstations have been employed as clients, and disconnections have ranged up to about ten days in length. Usage experience has been extremely positive. The hoarding strategy has sufficed to avoid most disconnected cache misses, and partitioned data sharing has been rare enough to cause very few reintegration failures. Measurements and simulation results indicate that disconnected operation in Coda should be equally transparent and successful at much larger scale.

 The main contributions of the thesis work and this dissertation are the following: a new, client-based approach to data availability that exploits existing system structure and has special significance for mobile computers; an implementation of the approach of sufficient robustness that it has been put to real use; and analysis which sheds further light on the scope and applicability of the approach.