Fall 2006 15-712 Midterm Exam Closed book. 80 minutes. Answers should be written in separate blue books. Read the questions carefully and ANSWER ALL PARTS OF EACH QUESTION. Carefully mark your name on each blue book and number the blue books if you use more than one. I will grade the first 8 answers I find in the blue books submitted. All questions have the same weight. ************* Answer 8 questions ************ 1. Contrast Thekkath's RDMA with Birrell's RPC. Give at least two ways they are similar and two ways they are different. Answer: RPC and RDMA are similar in several ways. - They both are used as a way for local programs to manipulate state on remote machines. - They both are frequently used as a data transfer mechanism. - They both need some form of rendezvous allowing clients to find servers. - They both use language bindings to make remote actions appear similar to local ones. They differ in several ways as well. - RPCs require a transfer of both data and control to a remote machine, while RDMA can transfer only data. - RPC is designed for a looser coupling between machines than RDMA by employing mechanisms such as data marshalling (XDR) all the time, while RDMA leaves data representation differences to the application programmers. - RPC requires scheduling of a thread in the destination machine while RDMA does not. ------------------------------ 2. Hauser's taxonomy for thread paradigms includes two types employing delayed execution: slack processes and deferring work. Contrast these two paradigms. Explain why Hauser found that slack processes were hard to get to work well. Explain how Hauser's task rejuvenation paradigm can encourage performance problems in deployed code. Answer: (a) Slack processes delay execution to collect a batch, optimize its execution and achieve higher throughput. Deferring work delays the execution of the portion of execution that need not be done before responding to a caller or user, enabling lower response time. They are not mutually exclusive in that one thread might defer work for responsiveness and then batch the deferred work in a slack process for throughput. (b) Slack processes have to delay long enough to achieve some saving from the collective optimization. But if they delay too long it may show up to users of the system as very poor response time. Prioritizing the sender over the receiver will pile up arbitrarily large amounts of pending work, possibly leading to poor response time. Prioritizing the receiver over the sender can lead to no delay at all. Hauser resorted to programmer manipulation of the scheduler (YieldButNotToMe()) to compensate for this. (c) Task rejunvenation uses a timeout to restart a function in the code as a simple error handling mechanism. But if the code had a simple bug in which it deterministically failed to complete something that a rejuvenated version of itself will do, then this bug is tolerated, manifesting as slower performance. ------------------------------ 3. Lamport uses a space-time physics analogy in his discussion of simultaneity in distributed computing. Why does this analogy apply? How is multi-threading on a single processor equivalent to distributed computing in terms of defining what is and what is not simultaneous? How do logical clocks help a distributed or multi-threaded program developer? Answer: (a) Space-time physics orders two events in space-time according to the communication of the event by light leaving it. If event A's light reaches the event B's position before B's event time, then A precedes B. If B's light reaches A's position before A's event time, then B precedes A. If neither of these occur, which is not uncommon as light has a maximum speed, then A and B are unordered, or simultaneous. In distributed computing, the analog to light traveling from event A to event B is a message sent by A's thread after event A and received by B's thread before event B. If there is no communication between events, there is no basis for ordering them, and they are simultaneous. (b) On a multi-threaded single processors threads are (mostly) context switched at arbitrary times relative to their internal events/operations. An event A in one thread, that happens at cycle N1 but does not precede an explicit communication received before an event B in another thread, at cycle N2 > N1, cannot order itself, A, before B because in a different execution (scheduling) A's thread may be context switched at a different time, delayed and observe N1 > N2. Events that may be reordered merely by a different set of clock interrupts and scheduler decisions should be understood to be simultaneous. (c) Logical clocks assign a time to each event in multiple threads so that any two events that should be understood to be ordered by communication will have logical clock values in the appropriate order, and all simultaneous events are given an order (not necessarily unique), consistent with the logical clocks of communicating events, so that all events have a possible total ordering. Programmers can the reason within this total ordering that any events that they wish to have ordered in some other ordering will require changes in the communications (and/or thread local code) to enforce. Programmers can also tag threads and events with an on-the-fly logical clock value on which to base order-sensitive decisions (eg. FIFO scheduling). ------------------------------ 4. Lamport's physical clocks depend on message communication time to bound clock skew. The Network Time Protocol (NTP) uses message round trip time to manage clock skew too. What is the key insight in the basic NTP clock skew detection and correction algorithm? Lamport develops physical clocks in order to compensate for covert channels in his Logical Clocks scheme. Give an example of a covert channel that defeats Lamport logical clocks. Does NTP fix all of the covert channel that Lamport's physical clocks set out to fix and why? Answer: (a) From a set of trials, the fastest round trip (probe-respond) message delay can be used to estimate the skew in machine clocks, because the receiver time at reception of the probe should be the send time at the sender plus the message transit time. By using the fastest of many round trips to estimate the message transit time, we can estimate the difference in the sender and receiver clocks. For example, 1/2 the fastest round trip time, minus the turn around at the receiver, can be used to estimate the message transit time, and define the timestamp that the receiver should have had at message receive. (b) Any communication channel not using the logical clock tagging algorthim could defeat logical clocks. A thread printing to a screen, causing a user to call a friend on the phone, who inputs to a different computer an event as a result of the call, is actually communication between threads, but won't have the appropriate tagging UNLESS the logical clock tagging is real physical time (with speed of light communication speeds) so that the phone call could not communicate information faster than the physical clock advances. (c) NTP is about close to synchronized clocks. It does not eliminate all covert channels because it cannot guarantee that the machines' clocks are so closely synchronized that no message might arrive fast enough for the clocks to be mis-ordered. In particular, there is no guarantee that the fastest sample of a set of message round trip trials is in fact the minimum transit time, so it may be possible for a particularly fast message to be received "before" it was sent, allowing a covert channel. ------------------------------ 5. Suppose there is a multi-threaded program (or distributed application) that modifies the records of a shared database using an application specific function "application_change()" using the following code in each thread. if (lots_of_threads_in_database()) { for each record in list_of_records { lock( rec_lock[record] ); application_change(record); unlock( rec_lock[record] ); } } else { lock( database ); for each record in list_of_records { application_change(record); } unlock( database ); } What benefit is the code trying to achieve by conditioning on lots_of_threads_in_database? Explain why Eraser's dynamic data race detector would not like this code. How could this be fixed? Answer: (a) The code is trying to minimize the overhead involved in taking locks by taking a large lock when it thinks that there is a low chance of needlessly delaying other threads. It uses a count of the number of threads in the database as an indication that the increased overhead of taking a lock per record to change is worth the reduced conflict with other threads. (b) Eraser would hate this code because sometimes a record is locked by the whole database lock, and other times it is locked by a per-record lock. Since Eraser objects to code where a record is not protected on every use by the same lock, it will object to this code. In fact, Eraser could be right. If a thread executing lock( re_lock[record] ) is not stopped by another thread holding lock( database ), then two threads could be modifying the same record at the same time. (c) First, the code has to provide mutual exclusion for each record. The simplest solution is to implement "(un)lock( re_lock[record] )" as a monitor whose monitor lock is "database" and to count the number of record locks granted at any time. Then make "lock ( database ); ....; unlock( database ); a monitor that "waits" on the condition that the number of record locks granted is zero. Then you'd have to fix Eraser to accept this lock hierarchy. A simple fix, in keeping with Eraser's specialization for read/write locks and initialization phases, would be an Eraser specialization to a lock hierarchy. A straightforward approach would be to annotate (a comment in) the code with an Eraser recognized statement saying the "database" lock dominates each and all "rec_lock[]" locks, and meaning that acquiring the database lock implies that all rec_lock[]s are held. ------------------------------ 6. Data race detection tools which rely on Lamport's happens-before relation have two primary weaknesses when compared to Eraser. First, they require per-thread information about concurrent accesses to each shared memory location. Secondly, their effectiveness is highly dependent on the interleaving of threads by the scheduler. Savage claims that Eraser is not sensitive to differences in thread interleaving and evaluate their claim with a simple test. Explain the intuition behind why the Lockset algorithm is thread interleaving insensitive. Answer: Given a candidate set of locks for a shared variable V and a sequence of accesses which refine the candidate set, any ordering of thread execution results in an equivalent refinement of the lockset. This is the case because any permutation of a fixed number of set intersections (refinement) gives the same result (permutations have associativity and commutativity), while this is not the case for the happens-before relation. ------------------------------ 7. Network Appliance's NFS filer server appliances use a "write-anywhere" file system very similar to Rosenblum's LFS layered on top of software-implemented RAID. They have recently implemented a variant of RAID level 6, which is a double failure tolerant RAID code, which in their case is organized like RAID level 4 but with two disks of redundancy codes (check disks). Network Appliance says that by using an LFS-like file filesystem, RAID level 4 is as good as RAID level 5 (and maybe better), and RAID level 6 has no performance penalty relative to RAID level 4 or 5. Explain the general rationale under which these two statements are correct. Give two reasons why this rationale breaks down, and explain the penalty and/or cost that can occur when this rationale breaks down. Answer: (a) With a log-structured file system, writes are gathered together and written at the end of the log in a few large writes. In RAID 4 and 5 a large write is faster than a small write because all of the stripe is in memory and its new parity can be computed without reading anything from disk. For small writes the cheapest update is to pre-read the data from disk, pre-read the parity from disk, compute the new parity as a flip of any bit in the old parity that is flipped in the old data, and over-write data and parity. In RAID 4 when multiple independent small writes are concurrent the data work is likely to be done in parallel, but there is only one parity disk, so all changes impact it and the parity disk is a bottleneck. RAID 5 rotates the parity so that in different stripes, the parity update load is applied to different disks evenly in a random small write workload. With LFS doing no small writes, there is no difference work of the parity disk and the data disks, no bottleneck to spread out in RAID 4. RAID 6 computes two check values for each stripe of data values and writes these to two different disks. If there is a dedicated second check disk (the cost) and all writes are large writes, then the RAID 6 writes will take the same amount of time as the RAID 4 or 5 writes with one less disk. (b) This rationale depends on all writes being large writes, which means that all writes can be delayed until enough data has accumulated to overwrite at least one whole stripe. If the user issues a fsync() of the file or file system, no delaying is allowed and the actual disk write may not be large. Or if data is being written so slowly that the periodic timer is firing and the file system is writing whatever is waiting in order to limit its exposure to system crashes, then the actual disk writes may not be large. Or if the LFS system does not do cleaning, and instead threads the log through blocks of previously deleted data, then the actual disk writes may not be large. When the disk write is not large, the minimum work done in a RAID 4 or 5 is two reads and two writes (with a possible bottleneck in RAID 4) and in a RAID 6 it is three reads and three writes. ------------------------------ 8. What kind of workload patterns is the Log-Structured File System well adapted to? What kind of a workload is it's pathological case, when compared to a traditional file system like BSD's Fast File System. A key issue here is when disk fragmentation hurts performance and when it does not. Rosenblum's answer to fragmentation problems has been disputed. What is his answer, why is it disputed and what do other systems do about fragmentation? Answer: LFS writes are always efficient if large free blocks/regions are available. Its reads are indexed by the same inode structure so the number of fetches is the same, but the positioning costs of reads can differ based on the relationship between read address sequences and write address sequences. (a) LFS performs optimally under any access pattern where the ordering of reads matches the ordering of prior writes. This is because it organizes things written at about the same time to be adjacent on the media, and things written shortly after other things to be slightly further down the disk. If reads have the same pattern as writes, everything being read is next or not far away on the disk. (b) LFS' pathological case is the opposite; a different access pattern of writes than reads performs very badly. The read positioning times may be worse because the write were bunched, while another file system may have done more seeking on writes to ensure data was laid out in a read-optimized order on disk. This worst case can be encountered using a database on LFS, in which random writes and sequential reads is not unlikely (online transactions for customer account records with daily reports summarizing all accounts, for example). (c) LFS would argue that most files are read and written in their entirety, which is the same pattern, that reads rarely have to go to disk and that cleaning deleted data out of contiguous regions will eventually both increase the availability of large free blocks/regions (reduce fragmentation) and merge pieces of files back together and sequential in file offset. (d) Others argue that the amount of disk seconds consumed cleaning the disk of fragmentation costs the system more performance interference than is gained by cleaning, that the systems LFS compares to are much worse in performance than current non-LFS state of the art, and that the bad cases (online databases) are not much improved by cleaning. (e) Most other systems defragment disks as an administrative process, manually triggered, or never at all. NetApp's WAFL is a prime example, as is MacOS HFS+ and Windows NTFS. ------------------------------ 9. The NASD interface is most closely related to the inode interface in most modern file systems. Explain this similarity. How does moving the inode interface from a module layer in the file system software (a local procedure call interface) to a wire protocol for storage devices (an RPC interface) provide benefit for both file systems developers and storage device developers? Answer: (a) Inodes are small fixed sized records containing attributes like timestamps, logical and physical size, permissions, and a data structure of pointers (direct pointers, indirect pointers, doubly indirect pointers, etc ) to the data. Inodes do not contain the file name, or its grouping into directories. They do not record where its data is being cached in clients or servers. The NASD interface assigns an object name to the data of a file that is not its pathname. The NASD interface includes the equivalent attributes and hides the actual storage location within the interface. So a NASD disk is a collection of variable length objects and their attributes, like a file system is a collection of inodes, the attributes they contain and the variable length data addressed by these inodes. (b) File system developers no longer have to make block by block allocation decisions, reducing the complexity and total work in their code; they are able to provide a simple mapping for the file onto storage objects so that the amount of metadata needed for the client to directly address storage is small; they can provide fine grain authorization to allow disks to validate that a clients is entitled to directly accessing a specific file and not other file's data; and the existing file systems have approximately the right layering (inode interface) already built in. Storage device developers are able to identify allocation and free, so they know which blocks do not contain data, and they can identify which data is adjacent in the same file so they can prefetch and cache more intelligently. ------------------------------ 10. The Google File System designers set out to manage storage resources on a large number of relatively unreliable systems with as simple a system as they could achieve. They made tradeoffs that lead to high levels of replication and extra burden on the application programmer. Explain we they choose this replication strategy and the risks they take with it. Explain the extra burden on the application programmer and what benefit it provides for the Google file system. Answer: (a) Google wants to use the cheapest hardware they can find, and to use the disks in the servers as storage for other servers, to further reduce cost. This reduction in cost means that the equipment is more likely to fail than standard data center equipment. They also designed their system code to be simple. For example, delete file unlinks the file in the directory but does not clean up the file's blocks, and the failure of a node does not cause an immediate reconstruction of the data that was lost with it; instead, a background process wanders through the list of all files periodically, checking to see if there are too many (deleted) or too few copies (failed node) and correcting these differences. The risk in this strategy is that more failures are happening closer together and failure recovery is being spread over a longer period of time, so data loss is more likely unless they maintain a higher ratio of redundancy to original data (negating some of the benefits of less expensive equipment). Another benefit of high replication is load balancing work by selecting different servers for different requesters. A risk in this is that maintaining more copies might cost more than it benefits, unless reading is far more common than writing (which is the case in search). (b) Because Google writes both the system services and the applications, and they want to make both as simple as possible, they have shifted some of the corner cases typically handled by the file system into the application domain. For example, to eliminate the burden of each application determining which other applications it is collaborating and synchronizing with in a shared work queue model of parallelism, Google put an append() operator into the file system allowing concurrent writing of small records into one file to not require locking or membership among the applications. Instead the file system has to do the appropriate locking/racing to determine the order of the writers. Google also splits files into independent fixed-max-sized chunks, to simplify allocation onto servers by the file system. But this chunking means that an append that crossed chunks would want to be ordered on both chunks in the same order (to have the data contiguous across the chunk boundary). To simplify the file system, the cross chunk case allows the file system to pad the space between records with nulls or have two copies of a record show up in the file, and requires the programmer to detect and cope with these oddities. ------------------------------ 11. I was recently talking to an IT manager at a Bioinformatics company about their critical computing codes and distributed file systems. He described for me a coding style that some of his biologists were using: for i from 1 to N1 { for j from 1 to N2 { compute_awhile( i, j ); fd = open( "output_file" ); lseek( fd, record_address( i, j ) ); write( fd, record_value( i, j ), record_length ); close( fd ); } } I was appalled by this code. Propose two changes to this code that could be expected to improve performance and, for each, explain why. Can you think of any good reason that a reasonable programmer might not comply with your proposals? Answer: (a1) Since the file opened is the same in all iterations of the loop, pull the open() and close() outside of the loop. This eliminates the code path and state manipulation in the kernel for opening and closing the file with every iteration. (a2) Since the record_length is independent of the iteration, so that the location of the records in the file are fixed, pre-allocate an memory region as big as the N1 * N2 * record_length outside the loop, change the lseek() and write() in the inner loop to write into the memory region and add a single giant write of the whole memory region before the close() after the loops end. This will convert many small writes into one large sequential write, making the disk writing process much more efficient. (b) A programmer might not follow (a2) if the size of the file (N1 * N2 * recrod_length) was so big and the actual location of the record writing was very non-linear with (i, j) so that the memory writing would be thrashing the virtual memory paging file anyway. Or if the file system had an implicit guarantee with each close() that what was written until that point was stable on disk and that programmer had a restart-after-crash algorithm that depended on these partial writes being on stable storage. or if there really was another program looking for changes in this file as a covert channel or an indication of progress (when the compute was really long) or a visualization pipeline. Or if the compiler farmed each iteration out to a different thread or processor so that the open() outside the loop might not be done in almost all the instances unless it was inside the loop. ------------------------------ 12. DeBergalis uses Remote DMA, RDMA, primarily for large transfers, while Thekkath and most users of high-performance supercomputers with RDMA-capable networks (like Infiniband) use RDMA especially for small transfers. Explain how each is right for its needs and why? Answer: Thekkath and most high-performance programmers using RDMA are implementing shared memory parallel processing. In this type of programming there are frequent accesses to small data structures (single variables often) on other machines, and remote accesses are often symmetric, reaching into the memory of all other machines equally (though Thekkath's NFS example is not symmetric). The goal of this type of programming is to implement a very large shared memory with lots of computing power operating throughout the shared memory, and low latency on every remote memory access. DeBergalis is implementing an asymmetic file system protocol in which low latency is much less important because much larger amounts of data are being moved (at least 4KB and often many MBs) so the network latency is small compared to the network transfer time. Instead Debergalis is aiming to do high bandwidth data transfers that consume as little client CPU overhead as possible and DeBergalis' server machine is unwilling to trust any client machine to reach into the server's memory and make changes. DeBergalis will not allow the client to implement operation requests as RDMA writes, forcing the client machine to send requests as traditional RPCs. So DeBergalis only uses RDMA for the server to transfer data from the client to the server or from the server to the client, and sends the request reply back in a traditional RPC too. So if the transfer size is small, RDMA provides little benefit in reducing client CPU consumption, while if it is large, the client CPU consumption can be reduced a lot by the server driving data transfer with RDMA. ------------------------------