Notes on Distributed Mutual Exclusion 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant Reading: Tannenbaum, Sect. 6.3 Goal: Maintain mutual exclusion among set of n distributed processes. Each process executes loop of form: while true: Perform local operations Acquire() Execute critical section Release() Whereas multithreaded systems can use shared memory, we assume that processes can only coordinate message passing. Terminology: Define a "cycle" as one round of the protocol, where some process acquires the lock, completes its critical section and then releases it. Requirements: 1. Safety. At most one process holds the lock at a time 2. Fairness. Any process that makes a request must be granted lock A. Implies that system must be deadlock-free B. Assumes that no process will hold onto a lock indefinitely C. Eventual fairness: Waiting process will not be excluded forever D. Bounded fairness: Waiting process will get lock within some bounded number of cycles (typically n) Other possible goals 1. Low message overhead 2. No bottlenecks 3. Tolerate out-of-order messages 4. Allow processes to join protocol or to drop out 5. Tolerate failed processes 6. Tolerate dropped messages For today, we will only consider goals 1-3. I.e., assume: * Total number of processes is fixed at n * No process fails or misbehaves * Communication never fails, but messages may be delivered out of order. Scheme 1: Centralized Mutex Server Assume there is a single server that acts as a lock manager. It maintains queue Q containing lock requests that have not yet been granted. Operation on process i Acquire: Send (Request, i) to manager Wait for reply Release: Send (Release) to manager Operation at server while true: m = Receive() If m == (Request, i): if empty(Q): Send (Grant) to i else: Add i to Q If m == (Release) && !empty(Q): Remove ID j from Q Send (Grant) to j Correctness: * Clearly safe * Fairness depends on queuing policy. E.g., if always gave priority to lowest process ID, then processes 1 & 2 could keep making requests & thereby exclude process 3. If use round-robin, or FIFO policy, then would guarantee response within n cycles. Performance: * 3 messages per cycle (1 request, 1 grant, 1 release) * Lock server creates bottleneck Ricart & Agrawala's algorithm (1981) Relies on Lamport totally ordered clocks, having the following properties: 1. For any events e, e' such that e --> e' (causality ordering), T(e) < T(e') 2. For any distinct events e, e', T(e) != T(e'). Notation: Ni = {1, 2, ..., i-1, i+1, ..., n} (n is the number of processes) General idea: When want to enter C.S., node i sends time-stamped request to all other nodes. These other nodes reply (eventually). When i receives n-1 replies, then can enter C.S. Trick: Node j having earlier request doesn't reply to i until after it has completed its C.S. Message types: (Request, i, T): Process i requests lock with timestamp T (Reply, j): Process j responds to some request for lock For each node i, maintain following values: Ti(): Function that returns value of local Lamport clock should_defer: Boolean Set when process i should defer replies to requests Tr: Time stamp of pending local request R: Subset of Ni Set of processes from which have received reply D: Subset of Ni Set of processes for which i has deferred the reply to their requests lock(), unlock(): A local mutex lock, to keep the two threads from interfering with each other Process i consists of two threads. One servicing the application, and one monitoring the network. Application thread: Request() // Request global mutex Wait for Notification // Wait until notified by network thread Critical Section // Operate in exclusive mode Release() // Release mutex Application Functions: Request(): lock() // Don't want app & network functions to step on each other Tr = Ti() // Get time stamp R = {} D = {} should_defer = true Send (Request, i, Tr) to each j in Ni unlock() Release(): lock() should_defer = false Send (Reply, i) to each j in D unlock() Network Functions: while true: m = Receive() lock() if m == (Request, j, T): if should_defer && Tr < T: D = D U {j} // Defer response to j else Send (Reply, i) to j else if m == (Reply, j): R = R U {j} if R == Ni Notify application unlock() Performance: Define a "cycle" to be a complete round of the protocol with one process i entering its critical section and then exiting. Each cycle involves 2(n-1) messages: n-1 requests by i n-1 replies to i Correctness: Mutual exclusion: Cannot have two nodes in their critical sections at the same time Look at nodes A & B. Suppose both are allowed to be in their critical sections at same time. * A must have sent message (Request, A, Ta) & gotten reply (Reply, A). * B must have sent message (Request, B, Tb) & gotten reply (Reply, B). Case 1: One received request before other sent request. E.g., B received (Request, A, Ta) before sending (Request, B, Tb). Then would have Ta < Tb. A would not have replied until after leaving its C.S. Case 2: Both sent requests before receiving others request. But still, Ta & Tb must be ordered. Suppose Ta < Tb. Then A would not sent reply to B until after leaving its C.S. Deadlock Free: Cannot have cycle where each node waiting for some other Consider two-node case: Nodes A & B are causing each other to deadlock. This would result if A deferred reply to B & B deferred reply to A, but this would require both Ta < Tb & Tb < Ta. For general case, would have set of nodes A, B, C, ..., Z, such that A is holding deferred reply to B, B to C, ... Y to Z, and Z to A. This would require Ta < Tb < ... < Tz < Ta, which is not possible. Starvation Free: If node makes request, it will be granted eventually Claim: If node A makes a request with time stamp Ta, then eventually, all nodes will have their local clocks > Ta. Justification: From the request onward, every message A sends will have time stamp > Ta. All nodes will update their local clocks upon receiving those messages. So, eventually, A's request will have a lower time stamp than any other node's request, and it will be granted. Ricart & Agrawala Example Processes 1, 2, 3. Create totally ordered clocks by having process ID compute timestamp of form T(e) = 10*L(e), where L(e) is a regular Lamport clock. Initial timestamps-- P1: 421, P2: 112, P3: 143 Action types: R m: Receive message m B m: Broadcast message m to all other processes S m to j: Send message m to process j Process T1 T2 T3 Action 421 112 143 3 153 B (Request, 3, 153) 2 162 R (Request, 3, 153) 1 431 R (Request, 3, 153) 1 441 S (Reply, 1) to 3 2 172 S (Reply, 2) to 3 3 453 R (Reply, 1) 3 463 R (Reply, 2) 3 473 Enter critical section 1 451 B (Request, 1, 451) 2 182 B (Request, 2, 182) 3 483 R (Request, 1, 451) 3 493 R (Request, 2, 182) 1 461 R (Request, 2, 182) 2 462 R (Request, 1, 451) # 2 has D = {1} 1 471 S (Reply, 1) to 2 # 2 has higher priority 2 482 R (Reply, 1) 3 503 S (Reply, 3) to 1 # Release lock 3 513 S (Reply, 3) to 2 1 511 R (Reply, 3) # 1 has R = {2} 2 522 R (Reply, 3) # 2 has R = {} 2 532 Enter critical section 2 542 S (Reply, 2) to 1 # Release lock 1 551 R (Reply, 2) # 1 has R = {} 1 561 Enter critical section ... Overall flow: P1 and P2 compete for lock after it is released by P3. P1's request has timestamp 451, while P2's request has timestamp 182. P2 defers reply to P1, but P1 replies to P2 immediately. This allows P2 to proceed ahead of P1. Lamport's Distributed Mutual Exclusion (1978) Also relies on Lamport totally ordered clocks, having the following properties: 1. For any events e, e' such that e --> e' (causality ordering), T(e) < T(e') 2. For any distinct events e, e', T(e) != T(e'). More complex than R&A: * 3 rounds of messages. - Send Reply message before entering C.S. - Send Release message after enterining C.S. * Each node must maintain local priority queue, orderd by time stamp. Interesting demonstration of maintaining replica of data any all locations. Initial version (1978) assumed messages received in same order as sent ("FIFO ordering"). Our version doesn't require this assumption. Only assumes that any message that is sent is eventually received, and that messages are never corrupted. Message types: (Request, i, T): Process i requests lock with timestamp T (Reply, j): Process j responds to some request for lock (Release): Release lock For each node i, maintain following values: Ti(): Function that returns value of local Lamport clock waiting: Boolean Set when process i wants lock Q: Priority queue with entries of form (j, T), indicating that process j has a request with timestamp T. Ordered so that entry with lowest timestamp at head. Tr: Time stamp of pending local request R: Subset of Ni Set of processes from which i has received reply for its request D: Subset of Ni Set of processes for which i has deferred the reply to their requests lock(), unlock(): A local mutex lock to synchronize the two threads. Process i consists of two threads. One servicing the application, and one monitoring the network. Application thread: Request() // Request global mutex Wait for Notification // Wait until notified by network thread Critical Section // Operate in exclusive mode Release() // Release mutex Application Functions: Request(): lock() Tr = Ti() // Get time stamp R = {} D = {} Send (Request, i, Tr) to each j in Ni Add (i, Tr) to Q waiting = true unlock() Release(): lock() Send (Release) to each j in Ni Pop top element from Q unlock() Network Function while true: m = Receive() lock() if m ==(Request, j, T): Add (j, T) to Q if waiting && j !in R && Tr < T: D = D U {j} // Defer response to j else Send (Reply, i) to j else if m == (Reply, j): R = R U {j} if j in D: D = D - {j} Send (Reply, i) to j Check() else if m == (Release) Pop top element from Q Check() unlock() Check(): // Check to see if i is now enabled if R == Ni && (i, Tr) at front of queue: waiting = false Notify application Why does Lamport's algorithm work? Key idea: When process x generates request with time stamp Tx, and it has received replies from all y in Nx, then its Q contains all requests with time stamps <= Tx. Expressed as follows: Rule: If x receives message (Reply, y), this indicates that one of the following must hold: 1. y does not have a pending event with time stamp Ty < Tx, or 2. x already has an entry of the form (y, Ty) in Q. Let's see how this rule gets implemented. When node i receives (Request, j, T), it does either an "immediate" reply or a "deferred" reply. IMMEDIATE REPLY. Happens when any of the following conditions hold: A. !waiting B. j in R: C. Tr > T This will cause j to receive the message (Reply, i). Letting x = j and y = i, we can categorize the three cases as follows: A. !waiting Node i does not want access to critical section Rule 1 applies: i does not have a pending event B. j in R: Node j already replied to i's request Rule 2 applies: j has an entry (i, Tr). C. Tr > T Node j's request has an earlier timestamp. Rule 1 applies: i has a pending event, but its time stamp is later than Tr. DEFERRED REPLY. Occurs after node i receives (Reply, j). That reply is an acknowledgement that j has received i's request, and so Rule 2 applies. The deferred reply is the key trick for dealing with out of order messages. By holding back its reply, i will not let j "jump the gun", acting on its own request even though i has an earlier request. Performance issues: Define a "cycle" to be a complete round of the protocol with one process i entering its critical section and then exiting. We can see this cycle would involve 3(n-1) messages as follows: 1. Process i sending n-1 request messages 2. Process i receiving n-1 reply messages 3. Process i sending n-1 release messages. Alternative organization: Token ring Idea: Number processes 0, 1, ..., n-1. Define next(i) = i + 1 mod n Processes are logically connected in ring, so that process i can send a message to next(i). Run two threads for each process, one to service application and one to manage network connection. Each process i maintains two local Boolean variables: havetoken: Initialized to true for process 0 and to false for all others. waiting: For application thread to communicate network thread. Would also need mutex to synchronize changes to these variables, but we will omit these details. Application functions for process i: Request(): if havetoken: Notify application else waiting = true Release(): havetoken = false Send (OK) to next(i) Network functions for process i: // Starting up if havetoken: // True only for process 0 Send (OK) to next(i) havetoken = false // Regular operation while true: When receive (OK): if waiting: havetoken = true Notify application else Send (OK) to next(i) Correctness: * Clearly safe: Only one process can hold token * Fairness: Will pass around ring at most once before getting access. Performance: Each cycle requires between 0 & n-1 messages Latency of protocol between 0 & n-1 Final observations 1. Lamport algorithm demonstrates how distributed processes can maintain consistent replicas of a data structure (the priority queue). 2. Lamport & Ricart & Agrawala's algorithms demonstrate utility of logical clocks. 3. Centralized & ring based algorithms much lower message counts 4. None of these algorithms can tolerate failed processes or dropped messages.