Notes on Distributed Mutual Exclusion 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant Reading: Tannenbaum, Sect. 6.3 Goal: Maintain mutual exclusion among set of n distributed processes. Each process executes loop of form: while true: Perform local operations Acquire() Execute critical section Release() Whereas multithreaded systems can use shared memory, we assume that processes can only coordinate message passing. Terminology: Define a "cycle" as one round of the protocol, where some process acquires the lock, completes its critical section and then releases it. Requirements: 1. Safety. At most one process holds the lock at a time 2. Fairness. Any process that makes a request must be granted lock A. Implies that system must be deadlock-free B. Assumes that no process will hold onto a lock indefinitely C. Eventual fairness: Waiting process will not be excluded forever D. Bounded fairness: Waiting process will get lock within some bounded number of cycles (typically n) Other possible goals 1. Low message overhead 2. No bottlenecks 3. Tolerate out-of-order messages 4. Allow processes to join protocol or to drop out 5. Tolerate failed processes 6. Tolerate dropped messages For today, we will only consider goals 1-3. I.e., assume: * Total number of processes is fixed at n * No process fails or misbehaves * Communication never fails, but messages may be delivered out of order. Scheme 1: Centralized Mutex Server Assume there is a single server that acts as a lock manager. It maintains queue Q containing lock requests that have not yet been granted. Operation on process i Acquire: Send (Request, i) to manager Wait for reply Release: Send (Release) to manager Operation at server while true: m = Receive() If m == (Request, i): if empty(Q): Send (Grant) to i else: Add i to Q If m == (Release) Remove ID j from Q (block if Q empty) Send (Grant) to j Correctness: * Clearly safe * Fairness depends on queuing policy. E.g., if always gave priority to lowest process ID, then processes 1 & 2 could keep making requests & thereby exclude process 3. If use round-robin, or FIFO policy, then would guarantee response within n cycles. Performance: * 3 messages per cycle (1 request, 1 grant, 1 release) * Lock server creates bottleneck Lamport's Distributed Mutual Exclusion Relies on Lamport totally ordered clocks, having the following properties: 1. For any events e, e' such that e --> e' (causality ordering), T(e) < T(e') 2. For any distinct events e, e', T(e) != T(e'). Notation: Ni = {1, 2, ..., i-1, i+1, ..., n} (n is the number of processes) Message types: (Request, i, T): Process i requests lock with timestamp T (Reply, j): Process j responds to some request for lock (Release): Release lock For each node i, maintain following values: Ti(): Function that returns value of local Lamport clock waiting: Boolean Set when process i wants lock Q: Priority queue with entries of form (j, T), indicating that process j has a request with timestamp T. Ordered so that entry with lowest timestamp at head. Tr: Time stamp of pending local request R: Subset of Ni Set of processes from which i has received reply for its request D: Subset of Ni Set of processes for which i has deferred the reply to their requests Process i consists of two threads. One servicing the application, and one monitoring the network. Application thread: Request() // Request global mutex Wait for Notification // Wait until notified by network thread Critical Section // Operate in exclusive mode Release() // Release mutex Application Functions: Request(): Tr = Ti() // Get time stamp R = {} D = {} Send (Request, i, Tr) to each j in Ni Add (i, Tr) to Q waiting = true Release(): Send (Release) to each j in Ni Pop top element from Q Network Function while true: m = Receive() if m ==(Request, j, T): Add (j, T) to Q if waiting && j !in R && Tr < T: D = D U {j} // Defer response to j else Send (Reply, i) to j else if m == (Reply, j): R = R U {j} if j in D: D = D - {j} Send (Reply, i) to j Check() else if m == (Release) Pop top element from Q Check() Check(): // Check to see if i is now enabled if R == Ni && (i, Tr) at front of queue: waiting = false Notify application Why does Lamport's algorithm work? Key idea: When process i has received replies from all j in Nj, then its Q contains all requests with time stamps <= Tr. Expressed as follows: Rule: If i receives message (Reply, j), this indicates that one of the following must hold: 1. j does not have a pending event with time stamp T < Tr, or 2. i already has (j, T) in Q. How do we know this is true? Consider events a1. i sends request (Request, i, Ta). Time stamp Ta1 = Ta a2. j receives request (Request, i, Ta). Time stamp Ta2 a3. j sends reply (Reply, j). Time stamp Ta3 a4. i receives reply (Reply, j). Time stamp Ta4 Suppose there is another set of events: b1. j sends request (Request, j, Tb). Time stamp Tb1 = Tb b2. i receives request (Request, j, Tb). Time stamp Tb2 b3. i sends reply (Reply, i). Time stamp Tb3 b4. j receives reply (Reply, i). Time stamp Tb4. Each of these sequences has causal ordering constraints Ta = Ta1 < Ta2 < Ta3 < Ta4 Tb = Tb1 < Tb2 < Tb3 < Tb4 Violating the rule above would require a scenario where Tb < Ta but Ta4 < Tb2. Assuming Tb < Ta, the code for Request() at i, would have put j in D at time Ta2, and it would not have sent (Reply, j) to i until after Tb4. So we must have Tb4 < Ta3. We can combine this with the other orderings to get Tb2 < Tb4 < Ta3 < Ta4, and so our potential error cannot occur. Performance issues: Define a "cycle" to be a complete round of the protocol with one process i entering its critical section and then exiting. We can see this cycle would involve 3(n-1) messages as follows: 1. Process i sending n-1 request messages 2. Process i receiving n-1 reply messages 3. Process i sending n-1 release messages. Ricart & Agrawala's algorithm This algorithm is a refinement to Lamport's. The main idea is to combine the role of the Reply and the Release messages. That is, process j will not send (Reply, j) to i until after it has completed any local event with time stamp T < Tr. We can dispense with much of the machinery of Lamport's algorithm, including the priority Q. Message types: (Request, i, T): Process i requests lock with timestamp T (Reply, j): Process j responds to some request for lock For each node i, maintain following values: Ti(): Function that returns value of local Lamport clock waiting: Boolean Set when process i wants lock Tr: Time stamp of pending local request R: Subset of Ni Set of processes from which have received reply D: Subset of Ni Set of processes for which i has deferred the reply to their requests Process i consists of two threads. One servicing the application, and one monitoring the network. Application thread: Request() // Request global mutex Wait for Notification // Wait until notified by network thread Critical Section // Operate in exclusive mode Release() // Release mutex Application Functions: Request(): Tr = Ti() // Get time stamp R = {} D = {} waiting = true Send (Request, i, Tr) to each j in Ni Release(): Send (Reply, j) to each j in D Network Functions: while true: m = Receive() if m == (Request, j, T): if waiting && Tr < T: D = D U {j} // Defer response to j else Send (Reply, i) to j else if m == (Reply, j): R = R U {j} if R == Ni waiting = false Notify application Performance: Each cycle involves 2(n-1) messages: n-1 requests by i n-1 replies to i Ricart & Agrawala Example Processes 1, 2, 3. Create totally ordered clocks by having process ID compute timestamp of form T(e) = 10*L(e), where L(e) is a regular Lamport clock. Initial timestamps-- P1: 421, P2: 112, P3: 143 Action types: R m: Receive message m B m: Broadcast message m to all other processes S m to j: Send message m to process j Process T1 T2 T3 Action 421 112 143 3 153 B (Request, 3, 153) 2 162 R (Request, 3, 153) 1 431 R (Request, 3, 153) 1 441 S (Reply, 1) to 3 2 172 S (Reply, 2) to 3 3 453 R (Reply, 1) 3 463 R (Reply, 2) 3 473 Enter critical section 1 451 B (Request, 1, 451) 2 182 B (Request, 2, 182) 3 483 R (Request, 1, 451) 3 493 R (Request, 2, 182) 1 461 R (Request, 2, 182) 2 462 R (Request, 1, 451) # 2 has D = {1} 1 471 S (Reply, 1) to 2 # 2 has higher priority 2 482 R (Reply, 1) 3 503 S (Reply, 3) to 1 # Release lock 3 513 S (Reply, 3) to 2 1 511 R (Reply, 3) # 1 has R = {2} 2 522 R (Reply, 3) # 2 has R = {} 2 532 Enter critical section 2 542 S (Reply, 2) to 1 # Release lock 1 551 R (Reply, 2) # 1 has R = {} 1 561 Enter critical section ... Overall flow: P1 and P2 compete for lock after it is released by P3. P1's request has timestamp 451, while P2's request has timestamp 182. P2 defers reply to P1, but P1 replies to P2 immediately. This allows P2 to proceed ahead of P1. Alternative organization: Token ring Idea: Number processes 0, 1, ..., n-1. Define next(i) = i + 1 mod n Processes are logically connected in ring, so that process i can send a message to next(i). Run two threads for each process, one to service application and one to manage network connection. Each process i maintains two local Boolean variables: havetoken: Initialized to true for process 0 and to false for all others. waiting: For application thread to communicate network thread. Would also need mutex to synchronize changes to these variables, but we will omit these details. Application functions for process i: Request(): if havetoken: Notify application else waiting = true Release(): havetoken = false Send (OK) to next(i) Network functions for process i: // Starting up if havetoken: // True only for process 0 Send (OK) to next(i) havetoken = false // Regular operation while true: When receive (OK): if waiting: havetoken = true Notify application else Send (OK) to next(i) Correctness: * Clearly safe: Only one process can hold token * Fairness: Will pass around ring at most once before getting access. Performance: Each cycle requires between 0 & n-1 messages Latency of protocol between 0 & n-1 Final observations 1. Lamport algorithm demonstrates how distributed processes can maintain consistent replicas of a data structure (the priority queue). 2. Lamport & Ricart & Agrawala's algorithms demonstrate utility of logical clocks. 3. Centralized & ring based algorithms much lower message counts 4. None of these algorithms can tolerate failed processes or dropped messages.