MIME-Version: 1.0 Server: CERN/3.0 Date: Monday, 16-Dec-96 23:49:58 GMT Content-Type: text/html Content-Length: 13893 Last-Modified: Saturday, 30-Mar-96 06:08:23 GMT Homework 4

Homework 4

Handed out: Tue, Mar 6th
Due: Thu, Mar 28th

You are to find one partner for the term project, select a projet, and place a "1-page" proposal of what you want to do on the web. The proposal must address the following points: title, 1/2 "page" project description, goals, plan of attack, expected kind of results, how to determine sucess/failure, critical steps to get done soon. Create a subdir in /vol/www/Info/Courses/Current/CS516/Projects and place the proposal there. Name the file "Welcome.html".

Here are some ideas that could lead to interesting projects. Some of these are short-term, self-contained projects that could easily be accomplished within the context of a term-project. Others could eventually expand into publications, or a Meng project. If you pick a problem with a wider scope, you must be sure to isolate a piece that can be adequately addressed within the semester.

You may work in teams of two, possibly three, but there should be a clearly identifiable role for each member of the team, especially for three person teams. 1) New

High bandwidth CU-SeeMe

Couple Split-C/U-Net and Split-C/SolarisMP

Build ATM-Fast Ethernet gateway using U-Net -- at IP level and/or at U-Net level

Implement U-Net kernel endpoints and run standard IP over U-Net

Add IP packet filter to SBA-200 firmware or to Fast Ethernet

Implement Active Messages flow-control for shared fast ethernet

Implement/port parallel gdb for Split-C

Splash benchmarks

Perfect fast RPC U-Net implementation (urpc) from last year

Perfect distributed shared memory from last year

USIT 2) Instruction Set Design and Measurement

Condition codes or not, that is the question: (Suggested by David Wood) Most architectures designed in the 1960s and 1970s employ condition codes. Several of the recent RISC architectures do not, yet others do. For example, the MIPS architecture does not have condition codes, yet it has SET instructions that put the result of a comparison into a general purpose register. The RS/6000 has 8 (virtual) condition code registers, which can be set either as a side-effect of an ALU instruction, or explicitly by compare instructions. Explore the similarities and differences of these two schemes. Is one clearly better than the other? Are the differences technology dependent? As processors become increasingly integrated, which scheme will lead to the best performance?

Analyze instruction issue strategies: Analyze and compare Scoreboarding, Tomasulo's algorithm and the RS/6000 scheme. Where's the beef? How much of the advantage of dynamic scheduling is achieved by static scheduling and delay slots? How do the trade-offs change with increased processor/memory speed ratio? Are lock-up free caches or write-buffers critical? (I can pull together some good tools that students built at Berkeley 1 1/2 years ago; they didn't have much time left to study the results, though, so you could leverage their work...)

Address spaces beyond 32 bits: According to some observers, the demand for virtual address space increases at the rate of 1 bit every 2 years. Thus, while 16 bit architectures were quite acceptable throughout the 60s and 70s, they eventually became too constraining, and have almost entirely been replaced. At this rate, we have just a few more years before 32-bit addresses begin to constrain our programs. Some architectures have introduced segments to extend the address space. Some computer architects suggest that only full 64-bit machines (integers and addresses) will solve the problems. Explore the cost, performance, programming, and compatibility issues of these approaches. (Most RISC architectures have already defined a 64-bit extension… But we don't have any tools for analyzing the use of large address spaces and several interesting methodology issues arise in doing so.)

Machine independent binaries: All RISC instruction sets look almost identical, but you sure cannot compile one program for all. Perhaps there is a middle ground, brand-X generic RISC that could be easily mapped to a variety of ISA with reasonable efficiency. (One such design has been developed pretty far by OSF.) This raises another interesting question of machine independent pipeline scheduling.

PowerPC vs. x86 architecture: Compare the Intel P6 processor to the PPC604 or PPC620. Dig up all the technical information you can and try to come to terms with the real differences. The P6 is a RISC processor at heart, so what's all this fuzz about old x86 code? Does it really matter? The point of this project is to go further than the trade rags which beleive all the marketing hype as long as it boosts their sales...

Validation of Cache Studies: Using tools like spy or spix and piping the trace directly into an optimized cache simulator it is possible to evaluate substantial workloads of billions of instructions. Review the literature and see whether the important published results on cache studies really hold up. In particular, what is the limit of Spec based caching studies? At some point, all the Spec programs fit into the cache. 3) Superscalar, Superpipeline

High-Bandwidth Data Cache for Superscalar Architectures: What does it take to services several hits per cycle? To continue servicing hits while misses are being processes in the next bigger/slower level of the memory hierarchy? How much does this buy? Try a sample design to get a feeling for the complexity. Build a simple simulator and run some memory traces through it to see how much performance improvement is possible.

Sophisticated Branching for Superscalar Machines: (Suggested by Steve Krueger, T.I.) In RISC processors a pipeline break is relatively costly. In superscalar RISC processors that cost is effectively multiplied by the number of instructions that execute simultaneously. It is therefore desirable to increase the runs of instructions without branches. Some old architectures had SKIP instructions that were used extensively. It seems that SKIP instructions (possibly with restrictions on what could be skipped) could give conditional execution without breaking the pipeline through the use of SQUASH or KILL hardware already in place due to the needs of exception processing. Study whether the number of cases where SKIP instructions could be effectively used is great enough to make them useful. Skip-n forms might extend the usefulness by allowing more than one instruction to be skipped.

Several variants of this idea could be considered, including multiway branches, conditional moves, operators to avoid branching (e.g., max, min, abs), and conditional operators.

Register organization for Superscalar Machines: Superscalar execution looks very attractive once the multiplier and floating point units fit on a chip, since at CPI>=1 these sit mostly idle. However, the cost increase comes in the memory system (see above) and the register file. To execute four instructions per cycle do we need a 12-port register file? Many values are forwarded between functional units and never touch the register file! Study register usage patterns in superscalar designs and see how the number of ports can be reduced. What are the cost/performance trade-offs? 4) Multiprocessors

Multiple processors on a Chip: If you allow yourself a transistor budget of 10 to 50 million transistors on a chip you have plenty of room for innovative designs with multiple processors on a chip. This raises a host of interesting questions. Should caches be shared or dedicated to individual processors? Should floating point units be shared or dedicated? What is the trade-off between having more simpler processors vs fewer more sophisticated ones. In some sense this is trading instruction level parallelism against process level parallelism. The most serious bottleneck is going to be pin bandwidth in and out of the chip. How can you minimize the bandwidth requirement? Are new protocols required? There are several ways to frame studies in this context. You may want to look at multiple independent processes, as in a workstation with many open windows, or a small shared-memory multiprocessor design, or you may want to look at this as a component in a massively parallel machine.

Characterizing communication and sharing in multiprocessors: In current bus-based multiprocessors, interprocessor communication takes the form of cache misses. Thus, several issues get folded into a single number: the miss rate. Some good work has been done to try to characterize sharing in terms of modern cache organization, but there remains many unanswered questions. Some of these may only be answered by inventing new analysis techniques. Certainly it will be hard to get useful data. The reference string generated my each of the processors depends on how work is scheduled onto processors. This is generally done by allowing the processors to contend for various scheduling data structures. Thus, the schedule is somewhat dependent on the memory system. Exploring design variations relative to a fixed trace ignores this feedback. A thorough study needs to be done on the sensitivity or robustness of multiprocessor address traces. (A few other concerns have been raised, such as the number of shared references in available traces.)

Communication with a co-processor using cache-cache transfers: Several new multiprocessor designs use a second processor for the network interface and communicate between the main processor and the co-processor using cache-to-cache transfers. Results indicate this is actually slower than one might at first expect. Why? What alternatives are there? One suggestion is to have the opposite of a cache on the network interface processor: buffers which push data away into the right place in the memory system (e.g., large blocks into DRAM and small things into the processor cache).

What is the minimum cache miss rate due to communication: As uniprocessor caches get larger, the miss rate approaches the initial load cost (compulsory miss rate). It would seem that multiprocessor caches should tend toward the compulsory miss rate plus a communication factor. How would an optimal cache perform?

Snoopy caches provide communication and replication of data. Replication is what causes the coherence headaches. How does the miss rate (or communication rate) decrease with degree of replication?

Distributed shared memory over ATM: distributed shared memory (or virtual shared memory) has been around for a long time, but it never caught on. To a large degree this is due to the fact that ethernet is too slow to make it interesting. Does ATM change the picture? Design and implement a DSM system over the ATM network available in the dept. Apparently Willi Zwanepoole at Rice developed a nice implementation which could be adapted. 5) Networking

The following projects make use of the ATM network in the department, in particular of the user-level network interface that we've developed. They all deal with experimental software, so you have to be willing to hack around the bugs and gotchas!

Ultra-fast TCP/IP implementation: normal implementations of TCP/IP are slow as hell. A big part of the problem is that TCP/IP normally sits in the kernel, but other problems have to do with the algorithms and implementations themselves. Implement a really fast TCP over user-level ATM and carefully analyze the performance. Werner Vogels has already a prototype ready. It needs some work on the connection set-up an tear-down, but othewise is already wuite fast. He has not done a lot of benchmarking and in particular no study of the performance under congestion has been done.

Ultra-fast RCP implementation: This is similar to the above. Review the literature on fast remote procedure call invocation implementations and implement one for the ATM cluster of workstations.

ATM multicast: The ATM network supports multicast in the switch hardware. Figure out how to control that and use it for some interesting application, e.g., Horus or maybe video broadcasting. 6) Applications

Optimized application kernel implementation: pick your favorite compute-intensive application and beat the hell out of the compute-intensive part. For example, what is the fastest MPEG encoder/decoder you can build? The fastest motion detection algorithm for video scene analysis? The fastest compression/decompression? ... The key here is not just to arrive at a fast implementation but to do the analysis so you can convince others that you've arrived at the best possible implementation.

Truth in SPECmarks: Prepare your own small suite of benchmarks and compare the performance on several workstations to that predicted by the SPECmarks. It would be particularly interesting to put together an application specific suite: for example, a multimedia benchmark suite.

Memory system behavior for large systems: Study the memory system behavior (and maybe other processor performance parameters) for a full-fledged application. ISIS comes to mind: how can you measure the path length from the reception of a message all the way through the 1000's of lines of code `til a reply is sent back? Or some other large system?