Fall 2006 15-712 Midterm Exam Closed book. 80 minutes. Answers should be written in separate blue books. Read the questions carefully and ANSWER ALL PARTS OF EACH QUESTION. Carefully mark your name on each blue book and number the blue books if you use more than one. I will grade the first 8 answers I find in the blue books submitted. All questions have the same weight. ************* Answer 8 questions ************ 1. Contrast Thekkath's RDMA with Birrell's RPC. Give at least two ways they are similar and two ways they are different. Answer: RPC and RDMA are similar in several ways. - They both are used as a way for local programs to manipulate state on remote machines. - They both are frequently used as a data transfer mechanism. - They both need some form of rendezvous allowing clients to find servers. - They both use language bindings to make remote actions appear similar to local ones. They differ in several ways as well. - RPCs require a transfer of both data and control to a remote machine, while RDMA can transfer only data. - RPC is designed for a looser coupling between machines than RDMA by employing mechanisms such as data marshalling (XDR) all the time, while RDMA leaves data representation differences to the application programmers. - RPC requires scheduling of a thread in the destination machine while RDMA does not. ------------------------------ 2. Hauser's taxonomy for thread paradigms includes two types employing delayed execution: slack processes and deferring work. Contrast these two paradigms. Explain why Hauser found that slack processes were hard to get to work well. Explain how Hauser's task rejuvenation paradigm can encourage performance problems in deployed code. Answer: (a) Slack processes delay execution to collect a batch, optimize its execution and achieve higher throughput. Deferring work delays the execution of the portion of execution that need not be done before responding to a caller or user, enabling lower response time. They are not mutually exclusive in that one thread might defer work for responsiveness and then batch the deferred work in a slack process for throughput. (b) Slack processes have to delay long enough to achieve some saving from the collective optimization. But if they delay too long it may show up to users of the system as very poor response time. Prioritizing the sender over the receiver will pile up arbitrarily large amounts of pending work, possibly leading to poor response time. Prioritizing the receiver over the sender can lead to no delay at all. Hauser resorted to programmer manipulation of the scheduler (YieldButNotToMe()) to compensate for this. (c) Task rejunvenation uses a timeout to restart a function in the code as a simple error handling mechanism. But if the code had a simple bug in which it deterministically failed to complete something that a rejuvenated version of itself will do, then this bug is tolerated, manifesting as slower performance. ------------------------------ 3. Lamport uses a space-time physics analogy in his discussion of simultaneity in distributed computing. Why does this analogy apply? How is multi-threading on a single processor equivalent to distributed computing in terms of defining what is and what is not simultaneous? How do logical clocks help a distributed or multi-threaded program developer? Answer: (a) Space-time physics orders two events in space-time according to the communication of the event by light leaving it. If event A's light reaches the event B's position before B's event time, then A precedes B. If B's light reaches A's position before A's event time, then B precedes A. If neither of these occur, which is not uncommon as light has a maximum speed, then A and B are unordered, or simultaneous. In distributed computing, the analog to light traveling from event A to event B is a message sent by A's thread after event A and received by B's thread before event B. If there is no communication between events, there is no basis for ordering them, and they are simultaneous. (b) On a multi-threaded single processors threads are (mostly) context switched at arbitrary times relative to their internal events/operations. An event A in one thread, that happens at cycle N1 but does not precede an explicit communication received before an event B in another thread, at cycle N2 > N1, cannot order itself, A, before B because in a different execution (scheduling) A's thread may be context switched at a different time, delayed and observe N1 > N2. Events that may be reordered merely by a different set of clock interrupts and scheduler decisions should be understood to be simultaneous. (c) Logical clocks assign a time to each event in multiple threads so that any two events that should be understood to be ordered by communication will have logical clock values in the appropriate order, and all simultaneous events are given an order (not necessarily unique), consistent with the logical clocks of communicating events, so that all events have a possible total ordering. Programmers can the reason within this total ordering that any events that they wish to have ordered in some other ordering will require changes in the communications (and/or thread local code) to enforce. Programmers can also tag threads and events with an on-the-fly logical clock value on which to base order-sensitive decisions (eg. FIFO scheduling). ------------------------------ 4. Lamport's physical clocks depend on message communication time to bound clock skew. The Network Time Protocol (NTP) uses message round trip time to manage clock skew too. What is the key insight in the basic NTP clock skew detection and correction algorithm? Lamport develops physical clocks in order to compensate for covert channels in his Logical Clocks scheme. Give an example of a covert channel that defeats Lamport logical clocks. Does NTP fix all of the covert channel that Lamport's physical clocks set out to fix and why? Answer: (a) From a set of trials, the fastest round trip (probe-respond) message delay can be used to estimate the skew in machine clocks, because the receiver time at reception of the probe should be the send time at the sender plus the message transit time. By using the fastest of many round trips to estimate the message transit time, we can estimate the difference in the sender and receiver clocks. For example, 1/2 the fastest round trip time, minus the turn around at the receiver, can be used to estimate the message transit time, and define the timestamp that the receiver should have had at message receive. (b) Any communication channel not using the logical clock tagging algorthim could defeat logical clocks. A thread printing to a screen, causing a user to call a friend on the phone, who inputs to a different computer an event as a result of the call, is actually communication between threads, but won't have the appropriate tagging UNLESS the logical clock tagging is real physical time (with speed of light communication speeds) so that the phone call could not communicate information faster than the physical clock advances. (c) NTP is about close to synchronized clocks. It does not eliminate all covert channels because it cannot guarantee that the machines' clocks are so closely synchronized that no message might arrive fast enough for the clocks to be mis-ordered. In particular, there is no guarantee that the fastest sample of a set of message round trip trials is in fact the minimum transit time, so it may be possible for a particularly fast message to be received "before" it was sent, allowing a covert channel. ------------------------------ 5. Suppose there is a multi-threaded program (or distributed application) that modifies the records of a shared database using an application specific function "application_change()" using the following code in each thread. if (lots_of_threads_in_database()) { for each record in list_of_records { lock( rec_lock[record] ); application_change(record); unlock( rec_lock[record] ); } } else { lock( database ); for each record in list_of_records { application_change(record); } unlock( database ); } What benefit is the code trying to achieve by conditioning on lots_of_threads_in_database? Explain why Eraser's dynamic data race detector would not like this code. How could this be fixed? Answer: (a) The code is trying to minimize the overhead involved in taking locks by taking a large lock when it thinks that there is a low chance of needlessly delaying other threads. It uses a count of the number of threads in the database as an indication that the increased overhead of taking a lock per record to change is worth the reduced conflict with other threads. (b) Eraser would hate this code because sometimes a record is locked by the whole database lock, and other times it is locked by a per-record lock. Since Eraser objects to code where a record is not protected on every use by the same lock, it will object to this code. In fact, Eraser could be right. If a thread executing lock( re_lock[record] ) is not stopped by another thread holding lock( database ), then two threads could be modifying the same record at the same time. (c) First, the code has to provide mutual exclusion for each record. The simplest solution is to implement "(un)lock( re_lock[record] )" as a monitor whose monitor lock is "database" and to count the number of record locks granted at any time. Then make "lock ( database ); ....; unlock( database ); a monitor that "waits" on the condition that the number of record locks granted is zero. Then you'd have to fix Eraser to accept this lock hierarchy. A simple fix, in keeping with Eraser's specialization for read/write locks and initialization phases, would be an Eraser specialization to a lock hierarchy. A straightforward approach would be to annotate (a comment in) the code with an Eraser recognized statement saying the "database" lock dominates each and all "rec_lock[]" locks, and meaning that acquiring the database lock implies that all rec_lock[]s are held. ------------------------------ 6. Data race detection tools which rely on Lamport's happens-before relation have two primary weaknesses when compared to Eraser. First, they require per-thread information about concurrent accesses to each shared memory location. Secondly, their effectiveness is highly dependent on the interleaving of threads by the scheduler. Savage claims that Eraser is not sensitive to differences in thread interleaving and evaluate their claim with a simple test. Explain the intuition behind why the Lockset algorithm is thread interleaving insensitive. Answer: Given a candidate set of locks for a shared variable V and a sequence of accesses which refine the candidate set, any ordering of thread execution results in an equivalent refinement of the lockset. This is the case because any permutation of a fixed number of set intersections (refinement) gives the same result (permutations have associativity and commutativity), while this is not the case for the happens-before relation. ------------------------------ 7. Network Appliance's NFS filer server appliances use a "write-anywhere" file system very similar to Rosenblum's LFS layered on top of software-implemented RAID. They have recently implemented a variant of RAID level 6, which is a double failure tolerant RAID code, which in their case is organized like RAID level 4 but with two disks of redundancy codes (check disks). Network Appliance says that by using an LFS-like file filesystem, RAID level 4 is as good as RAID level 5 (and maybe better), and RAID level 6 has no performance penalty relative to RAID level 4 or 5. Explain the general rationale under which these two statements are correct. Give two reasons why this rationale breaks down, and explain the penalty and/or cost that can occur when this rationale breaks down. Answer: (a) With a log-structured file system, writes are gathered together and written at the end of the log in a few large writes. In RAID 4 and 5 a large write is faster than a small write because all of the stripe is in memory and its new parity can be computed without reading anything from disk. For small writes the cheapest update is to pre-read the data from disk, pre-read the parity from disk, compute the new parity as a flip of any bit in the old parity that is flipped in the old data, and over-write data and parity. In RAID 4 when multiple independent small writes are concurrent the data work is likely to be done in parallel, but there is only one parity disk, so all changes impact it and the parity disk is a bottleneck. RAID 5 rotates the parity so that in different stripes, the parity update load is applied to different disks evenly in a random small write workload. With LFS doing no small writes, there is no difference work of the parity disk and the data disks, no bottleneck to spread out in RAID 4. RAID 6 computes two check values for each stripe of data values and writes these to two different disks. If there is a dedicated second check disk (the cost) and all writes are large writes, then the RAID 6 writes will take the same amount of time as the RAID 4 or 5 writes with one less disk. (b) This rationale depends on all writes being large writes, which means that all writes can be delayed until enough data has accumulated to overwrite at least one whole stripe. If the user issues a fsync() of the file or file system, no delaying is allowed and the actual disk write may not be large. Or if data is being written so slowly that the periodic timer is firing and the file system is writing whatever is waiting in order to limit its exposure to system crashes, then the actual disk writes may not be large. Or if the LFS system does not do cleaning, and instead threads the log through blocks of previously deleted data, then the actual disk writes may not be large. When the disk write is not large, the minimum work done in a RAID 4 or 5 is two reads and two writes (with a possible bottleneck in RAID 4) and in a RAID 6 it is three reads and three writes. ------------------------------ 8. What kind of workload patterns is the Log-Structured File System well adapted to? What kind of a workload is it's pathological case, when compared to a traditional file system like BSD's Fast File System. A key issue here is when disk fragmentation hurts performance and when it does not. Rosenblum's answer to fragmentation problems has been disputed. What is his answer, why is it disputed and what do other systems do about fragmentation? Answer: LFS writes are always efficient if large free blocks/regions are available. Its reads are indexed by the same inode structure so the number of fetches is the same, but the positioning costs of reads can differ based on the relationship between read address sequences and write address sequences. (a) LFS performs optimally under any access pattern where the ordering of reads matches the ordering of prior writes. This is because it organizes things written at about the same time to be adjacent on the media, and things written shortly after other things to be slightly further down the disk. If reads have the same pattern as writes, everything being read is next or not far away on the disk. (b) LFS' pathological case is the opposite; a different access pattern of writes than reads performs very badly. The read positioning times may be worse because the write were bunched, while another file system may have done more seeking on writes to ensure data was laid out in a read-optimized order on disk. This worst case can be encountered using a database on LFS, in which random writes and sequential reads is not unlikely (online transactions for customer account records with daily reports summarizing all accounts, for example). (c) LFS would argue that most files are read and written in their entirety, which is the same pattern, that reads rarely have to go to disk and that cleaning deleted data out of contiguous regions will eventually both increase the availability of large free blocks/regions (reduce fragmentation) and merge pieces of files back together and sequential in file offset. (d) Others argue that the amount of disk seconds consumed cleaning the disk of fragmentation costs the system more performance interference than is gained by cleaning, that the systems LFS compares to are much worse in performance than current non-LFS state of the art, and that the bad cases (online databases) are not much improved by cleaning. (e) Most other systems defragment disks as an administrative process, manually triggered, or never at all. NetApp's WAFL is a prime example, as is MacOS HFS+ and Windows NTFS. ------------------------------ 9. The NASD interface is most closely related to the inode interface in most modern file systems. Explain this similarity. How does moving the inode interface from a module layer in the file system software (a local procedure call interface) to a wire protocol for storage devices (an RPC interface) provide benefit for both file systems developers and storage device developers? Answer: (a) Inodes are small fixed sized records containing attributes like timestamps, logical and physical size, permissions, and a data structure of pointers (direct pointers, indirect pointers, doubly indirect pointers, etc ) to the data. Inodes do not contain the file name, or its grouping into directories. They do not record where its data is being cached in clients or servers. The NASD interface assigns an object name to the data of a file that is not its pathname. The NASD interface includes the equivalent attributes and hides the actual storage location within the interface. So a NASD disk is a collection of variable length objects and their attributes, like a file system is a collection of inodes, the attributes they contain and the variable length data addressed by these inodes. (b) File system developers no longer have to make block by block allocation decisions, reducing the complexity and total work in their code; they are able to provide a simple mapping for the file onto storage objects so that the amount of metadata needed for the client to directly address storage is small; they can provide fine grain authorization to allow disks to validate that a clients is entitled to directly accessing a specific file and not other file's data; and the existing file systems have approximately the right layering (inode interface) already built in. Storage device developers are able to identify allocation and free, so they know which blocks do not contain data, and they can identify which data is adjacent in the same file so they can prefetch and cache more intelligently. ------------------------------ [3 more questions]