Project Suggestions -- Optimizing Compilers Sp'05

Exploiting local memories
Local arrays that are not aliased could be implemented in ASH (application-specific hardware) as local memory banks, improving latency and reducing competition for bandwidth to the main memory. Implement compiler analysis to recognize such cases and update the Pegasus simulator to evaluate its impact.
Dave's reg alloc suggestions
David Koes has suggested some interesting theory papers on register allocation; apply and evaluate these ideas.
Mikkel Thorup, "All structured programs have small tree width and good register allocation", Proves that structured (goto-free) C programs have control flow graphs with treewidth less than 6 making them 6-complex. Provides an approximation algorithm for coloring the intersection graphs of the CFG (register allocation) with optimality bounds within a factor floor(k/2) + 1 and running time O(wn+w^{2.5}n) where w is the most variables which conflict with each other at one point.;
Sampath Kannan and Todd Proebsting, "Register allocation in structured programs";
Hans Bodlaender and Jens Gustedt and Jan Arne Telle, "Linear-time register allocation for a fixed number of registers", A theory paper. It is possible to determine if it is possible to register allocate without spilling if a program is regular in linear time. (But exponential in the number of registers). A linear time algorithm exists even if you allow for reordering of instructions (in order to improve register allocation, not scheduling). If there are register classes, the complexity is linear without rescheduling but polynomial proportional to the number of registers (W-hard) if there is rescheduling. The way they go from local to global is to note that there are at most 2^k local solutions (at least at the boundary). Since this is "constant" they solve for all of them then combine them to get the global solution.
Profiling + opts
Implement profiling in Pegasus and a logical set of optimizations utilizing this information; evaluate performance benefits and how robust performance is in the face of changing execution characteristics (i.e. different dataset than that used during profiling).
Smart hyperblock formation
Hyperblock formation in Pegasus currently simply forms the largest hyperblocks possible. This project would make it more intelligent, likely incorporating profiling information (which currently has not been implemented), and evaluate its impact on ASH and/or C6X targets.
Dynamic pointer disambiguation in SW
Back in 1989, a pure software approach to dynamic pointer disambiguation was described by Nicolau "Run-Time Disambiguation: Coping with Statically Unpredictable Dependencies". It tried to utilize unused slots in a VLIW schedule to add code for the necessary checking and branching to fixup code. Could this be useful targeting the C6X?
Instruction Selection (including automatic utilization of pre/post increments)
Like many DSP architectures, the C6X has pre/post increment versions of memory instructions. This project would add analysis and transformations to our compiler to automatically generate these addressing modes and evaluate the improvement.
Semi-dynamic optimization: loop version selection
Vector compilers would often make two versions of a loop, one vectorized and the other not. If the loop count was too low to offset the vector startup overhead, it would branch to the non-vectorized version. There is a similar situation in reconfigurable computing, but the reconfiguration overhead for an accelerated loop is typically much more significant -- would a similar approach make sense? What about "while" loops where the iteration count is not known at entry? Could recent executions be used to predict future iteration counts?
Predicate Demotion
The Pegasus IR is aggressively speculative in that all 'safe' operations in a hyperblock are executed even when their output will end up being discarded. This is called "full predicate promotion" in some literature. When targeting the C6X, this unfortunately greatly increases register pressure in large hyperblocks where several control paths have been combined. But we are not yet utilizing the C6X's full predication support; if we start predicating all operations, then operations from alternative paths can in fact use the same set of registers since only one will be enabled any given hyperblock execution.
Simultaneous scheduling and register allocation
...for cluster architectures such as the C6X.
Adapt Aiken's BANSHEE pointer analysis
Prof. Aiken's group has developed a very fast and scalable points-to analysis for C; this project would find a way to incorporate it into the SUIF framework for use by existing and new Pegasus optimizations and evaluate its benefit for hardware and/or C6X targets. See http://banshee.sourceforge.net/.
Software Pipelining
Create a cyclic scheduler that can overlap the execution of successive iterations of a loop, being aware of cluster constraints and register pressure.
Implement if-conversion and target a non-predicated architecture (even the c6x without predicates)